Bioinformatic Methods for Characterization of Viral Pathogens in Metagenomic Samples

Linköping studies in science and technology Dissertation No. 1489 Bioinformatic methods for characterization of viral pathogens in metagenomic samples Fredrik Lysholm Department of Physics, Chemistry and Biology Linköping, 2012 Cover art by Fredrik Lysholm 2012. The cover shows a rhinovirus capsid consisting of 4 amino acid chains (VP1: Blue, VP2: Red, VP3: Green and VP4: Yellow) in complex, repeated 60 times (in total 240 chains). The icosahedral shape of the capsid can be seen, especially through observ- ing the placement of each 4 chain complex and the edges. The structure is fetched from RCSB PDB [1] (4RHV [2]) and the ~6500 atoms were rendered with ray-trace using PyMOL 1.4.1 at 3200x3200 pixels, in ~3250 seconds. During the course of research underlying this thesis, Fredrik Lysholm was enrolled in Forum Scientium, a multidisciplinary doctoral programme at Linköping University. Copyright © 2012 Fredrik Lysholm, unless otherwise noted. All rights reserved. Fredrik Lysholm Bioinformatic methods for characterization of viral pathogens in metagenomic samples ISBN: 978-91-7519-745-6 ISSN: 0345-7524 Linköping studies in science and technology, dissertation No. 1489 Printed by LiU-Tryck, Linköping, Sweden, 2012. Bioinformatic methods for characterization of viral pathogens in metagenomic samples Fredrik Lysholm, [email protected] IFM Bioinformatics Linköping University, Linköping, Sweden Abstract Virus infections impose a huge disease burden on humanity and new viruses are continuously found. As most studies of viral disease are limited to the investigation of known viruses, it is important to characterize all circulating viruses. Thus, a broad and unselective exploration of the virus flora would be the most productive development of modern virology. Fueled by the reduction in sequencing costs and the unbiased nature of shotgun sequencing, viral metagenomics has rapidly become the strategy of choice for this exploration. This thesis mainly focuses on improving key methods used in viral metagenomics as well as the complete viral characterization of two sets of samples using these methods. The major methods developed are an efficient automated analysis pipeline for metagenomics data and two novel, more accurate, alignment algorithms for 454 sequencing data. The automated pipeline facilitates rapid, complete and effortless analysis of metagenomics samples, which in turn enables detection of potential pathogens, for instance in patient samples. The two new alignment algorithms developed cover comparisons both against nucleotide and protein databases, while retaining the underlying 454 data representa- tion. Furthermore, a simulator for 454 data was developed in order to evaluate these methods. This simulator is currently the fastest and most complete simulator of 454 data, which enables further development of algorithms and methods. Finally, we have successfully used these methods to fully characterize a multitude of samples, including samples collected from children suffering from severe lower respiratory tract infections as well as patients diagnosed with chronic fatigue syndrome, both of which presented in this thesis. In these studies, a complete viral characterization has revealed the presence of both expected and unexpected viral pathogens as well as many potential novel viruses. iii Abstract iv Populärvetenskaplig sammanfattning Metoder för att bestämma sammansättningen av virus i patientprover. Denna avhandling hanterar främst hur man kan ta reda på vilka virus som infekterar oss, genom att studera det genetiska materialet (DNA) i insamlade prover. Under det senaste årtiondes har det skett explosionsartad utveckling av möjligheterna att bestämma det DNA som finns i ett prov. Det är t.ex. möjligt att bestämma en persons kompletta arvsanlag för mindre än 10’000 dollar, i jämförelse med 3 miljarder dollar för det första, som färdigställdes för ca 10 år sedan. Denna utveckling har emellertid inte bara gjort det möjligt att ”billigt” bestämma en persons DNA utan också möjliggjort storskaliga studier av arvsanlaget hos de virus och bakterier som finns i våra kroppar. För att kunna förstå allt det data som genereras behöver vi dock ta hjälp av datorer. Bioinformatik är det ämnesområde som innefattar att lagra, hantera och analysera just dessa data. Analys av DNA insamlat från prover med okänt innehåll (så kallad metagenomik) kräver användandet av en stor mängd bioinformatiska metoder. Till exempel, för att förstå vad som hittats i provet måste det jämföras med stora databaser med känt innehåll. Tyvärr så har vi ännu inte samlat in och karaktäriserat mer än en bråkdel av allt genetiskt material som står att finna bland bakterier, virus och andra patogener. Dessa ickekompletta databaser, samt de fel som uppstår under in- samlingen av data, försvårar djupgående analyser avsevärt. Vidare så har ökningen av mängden data som genererats samt växande databaser gjort de här analyserna mer och mer tidskrävande. I denna avhandling har vi använt och utvecklat nya bioinformatiska metoder för att bättre och snabbare förstå och analysera metage- nomikdata. Vi har både utvecklat nya sätt att använda befintliga v Populärvetenskaplig sammanfattning sökmetoder för att så snabbt som möjligt ta reda på vad ett prov innehåller samt utvecklat flera metoder för att hantera de fel som uppstår när ett provs DNA bestäms. De här bioinformatiska metoderna och verktygen har redan används för att kartlägga ett flertal material, bl.a. virusinnehållet i prov insamlat från barn som lider av svår lägre luftvägsinfektion samt patienter som lider av kroniskt trötthetssyndrom. Till följd av en mängd analyser har de flesta processerna finputsats och automatiserats så att beräkningar och analyser kan ske utan mänsklig inverkan. Detta möjliggör t.ex. att det i framtiden kommer gå att använda denna eller liknande metoder som ett kliniskt verktyg för att ta reda på de möjliga virus som infekterar en enskild patient. vi Acknowledgments This is perhaps the most important part of my thesis, or at least the most read part. While much of the work you do as a PhD student is done on your own, there are many people who made this thesis possible whom I really ought to thank a lot! Firstly, I would like to thank my supervisors Bengt Person and Björn Andersson. Thank you Bengt for always believing in the work I have done and for an always so very positive attitude towards our projects. I also appreciate that you let me work on whatever I felt like and always supported me in my projects. Björn I would really like to thank for your involvement and help in the metagenomic projects we shared and for all those nice lunches at MF or the local fast food place at KI. I would also like to thank my two former co-workers, Jonas Carlsson and Joel Hedlund. Thank you both for all the cool discussions, pool-playing in Brag and for being stable lunch and fika partners during my first years as a PhD student. Jonas, my mentor during my master thesis time, you are always happy and crazy in a way that made it impossible not to love you. Joel, with whom I shared an office after Jonas left, thanks for helping me to take a break and look at some “something awful pictures” or to discuss really odd stuff such as how best to make an utterly immovable nail-bed. Also, I would like to thank the newcomers to Linköping, Björn Wallner and Robert Pilestål. Björn, you are probably the only person I have ever met who share my occupation, my interest in carpentry and building stuff as well as playing floorball (a true He-Man). Robert, PhD student in Björn’s group, with whom I now share an office, thanks for sharing your thoughts on everything from evolution to marriage! Thank you both (as well as Malin Larsson, new BILS person in Linköping) for sharing lunches and fikas with me during this last year. I would also like to thank all other fellow PhD students at IFM over the years. Special thanks goes to Leif “leffe” Johansson and Karin “karma” vii Acknowledgments Magnusson for being my two allies over the years. Also, special thanks goes to Sara, Viktor, Jonas, Per, Erik, Anders and all the other members of the “coffee club”. Thank you all for the greetings sent and conveyed by Karin at my wedding and for all the weird but incredibly funny discussions over coffee. My appreciation also goes to Stefan Klintström and all the members of Forum Scientium. It has been a lot of fun and educating to be part of Forum! My colleges at CMB, KI and SciLifeLab also deserve special thanks. First of all thanks to Anna Wetterbom, for helping and teaching me how to write a manuscript. I would also like to thank Oscar, Stefanie, Stephen and Hamid for the nice discussions, lunches and for housing me in their small office at KI! I also really have to thank all other friends and collaborators over the years; Tobias Allander, Michael Lindberg, Dave Messina, Erik Sonnhammer and Patrik Björkholm. Tobias and Michael for help with all the viruses and Dave and Erik for the joined effort in trying to make sense of the “dark matter”. Patrik, you are probably the most cheerful guy I know and you have been a great support over the years. It is downright strange we have not published together yet! Finally, I would really like to thank my lovely wife, Ida, who has always encouraged me and been at my side. Linköping, December 2012. viii Papers Papers included in this thesis I. Lysholm F*, Andersson B, Persson B: An efficient simulator of 454 data using configurable statistical models. BMC Res Notes 2011, 4: 449 Fredrik Lysholm drafted the study, implemented the software and wrote the manuscript. II. Lysholm F*, Andersson B, Persson B: FAAST: Flow-space Assisted Alignment Search Tool. BMC Bioinformatics 2011, 12: 293. Fredrik Lysholm drafted the study, implemented the software and wrote the manuscript. III. Sullivan PF*, Allander T, Lysholm F, Goh S, Persson B, Jacks A, Evengård B, Pedersen NL, Andersson B*: An unbiased metagenomic search for infectious agents using monozygotic twins discordant for chronic fatigue.

Bioinformatic Methods for Characterization of Viral Pathogens in Metagenomic Samples

Coming-Of-Age Characterization of Soil Viruses: a User's Guide To

Microbes and Metagenomics in Human Health an Overview of Recent Publications Featuring Illumina® Technology TABLE of CONTENTS

Freshwater Viral Metagenome Reveals Novel and Functional Phage-Borne

Viral Metagenomics in the Clinical Realm: Lessons Learned from a Swiss-Wide Ring Trial

Long-Read Viral Metagenomics Enables Capture of Abundant and Microdiverse Viral Populations and Their Niche-Defining Genomic Islands

Long-Read Viral Metagenomics Captures Abundant and Microdiverse Viral Populations and Their Niche-Defining Genomic Islands

Virus Discovery by Metagenomics: the (Im)Possibilities

An in Silico Evaluation of Metagenome-Enabled Estimates of Viral Community Composition and Diversity

Coming-Of-Age Characterization of Soil Viruses: a User's Guide To

1 Benchmark of Thirteen Bioinformatic Pipelines for Metagenomic Virus Diagnostics Using Datasets

Biovisionalexandria 2012 Proceedings

Identifying Viruses from Metagenomic Data by Deep Learning