The Development and Use of Bioinformatic Web Applications for Infectious Disease Microbiology

The Development and use of Bioinformatic Web Applications for Infectious Disease Microbiology David Michael Aanensen Imperial College London, Department of Infectious Disease Epidemiology Thesis submitted for the degree of Doctor of Philosophy of Imperial College 1 Dedicated to my parents Susan and Keith Aanensen 2 ABSTRACT The ever-increasing generation and submission of DNA sequences, and associated biological data, to publicly available databases demands software for the analysis of the biological meanings held within. The web provides a common platform for the provision of tools enabling the concurrent deposition, visualisation and analysis of data collected by many users from many different locations. Open programmatic web standards allow the development of applications addressing diverse biological questions and, more recently, are providing methods enabling functionality more traditionally associated with desktop software to be provided via the internet. In this work I detail the development and use of web applications addressing different, but not exclusive, areas of infectious disease microbiology. Firstly, an application utilised by a group of researchers (including myself) to undertake a comparative genetic analysis of the capsular biosynthetic locus from serotypes of the pathogen Streptococcus pneumoniae is detailed. Secondly, an application widely used by communities of researchers and public health laboratories for the assignment of microbial isolates to strains via the internet: mlst.net is described. Thirdly, I describe a generic electronic taxonomy application for assigning strains to bacterial species, exemplified using sequences from the viridans group streptococci, the taxonomy of which is notoriously difficult to define. Lastly, I describe the use of web mapping tools for molecular epidemiological databases, such as mlst.net, and further detail their application to the European distribution of genotypes of Staphylococcus aureus, to catalogue the global distribution of the emerging amphibian fungal pathogen Batrachochytrium dendrobatidis and to provide geocoding tools to encourage users to submit and explore their own data in the geographical context they were collected. Each application has been designed with generality in mind and through reference to user workflows and biological examples, I demonstrate the extensibility and general applicability of current web development methodologies to enable the provision of applications addressing a diversity of biological questions. 3 ABSTRACT 3 TABLE OF CONTENTS 4 LIST OF TABLES AND FIGURES 8 TEXT CONVENTIONS 11 ACKNOWLEDGEMENTS 12 Chapter 1: General Introduction 13 1.1 Hypertext Markup Language (HTML) 15 1.2 Cascading Style Sheets (CSS) 16 1.3 Simple Web Framework 17 1.4 Dynamic Web Framework: Server-side Scripting 18 1.5 The Document Object Model (DOM) 21 1.6 JavaScript 22 1.7 Combining JavaScript, CSS and the DOM 22 1.8 Rich Internet Applications: Web 2.0, AJAX and APIs 24 1.9 AJAX Web Framework 24 1.10 Application Programming Interfaces (APIs) 26 1.11 Usability 28 Chapter 2: Streptococcus pneumoniae Capsular Polysaccharide (CPS) Web Application 30 2.1 Introduction 31 2.1.1 Capsule biosynthesis in Streptococcus pneumoniae 34 2.1.2 Wzx/Wzy-dependant biosynthesis 37 2.1.3 Serotypes 3 and 37: Synthase-dependant pathway 40 2.1.4 Sequencing of 90 S.pneumoniae cps loci 41 2.2 Materials and Methods 43 2.2.1 Data handling 43 2.2.2 TribeMCL 43 2.2.3 Gene naming 44 2.2.4 Encoding of raw data 45 2.2.4.1 DNA sequences 45 2.2.4.1.1 Graphical encoding of cps loci 45 2.2.4.1.2 In silico storage of sequences 45 2.2.4.2 Biochemical repeat-unit structures 46 2.2.4.2.1 Graphical encoding of CPS structures 46 2.2.4.2.2 In silico encoding of CPS structures 47 2.2.4.3 Combining sequence and structure representations 47 2.2.4.4 Serology 48 2.2.5 Sequence alignment methods 49 2.2.5.1 Basic Local Alignment Tool (BLAST) 49 2.2.5.2 Clustal 49 2.2.5.3 Needle 50 2.2.6 CPS Web Application 51 2.2.6.1 Database schema 51 2.2.6.2 Presentation and application logic 53 2.2.6.3 Database 53 2.2.6.3.1 Database querying 53 4 2.2.6.3.2 Homology group querying 55 2.2.6.4 WebACT 56 2.2.6.5 Serotypes 58 2.2.6.6 Structures 59 2.2.6.7 Statistics 60 2.2.7 Serotype clustering 60 2.2.7.1 Dendrogram based on homology group sharing 61 2.2.7.2 Dendrogram incorporating the degree of sequence similarity 62 2.2.8 Interrogating the dendrogram within the CPS Web Application 63 2.2.9 Demonstration of workflow for analysing clusters of serotypes 65 2.3 Results 67 2.3.1 General features of the dexB-aliA locus from 90 S.pneumoniae serotypes 67 2.3.2 Initial transferases and Wzy linkages 68 2.3.3 cps locus clustering 68 2.3.4 Serotypes 44 and 46 are related to serogroup 12 71 2.3.5 Serogroup 11 74 2.4 Discussion 76 2.5 Further work 81 Chapter 3: The Multilocus Sequence Typing Network: mlst.net 84 3.1 Introduction 85 3.2 mlst.net v1 88 3.2.1 mlst.net database schema 89 3.2.2 Curator‟s interface 90 3.2.3 User‟s interface 90 3.2.4 Clustering MLST data 94 3.2.4.1 Comparing query strains to the database: Allelic profiles 94 3.2.4.2 Comparing query strains to the database: eBURST 95 3.2.4.3 Comparing query strains to the database: Concatenated sequences 97 3.3 mlst.net v2 98 3.3.1 Interface re-development 99 3.3.2 Integration of Mapping tools for viewing MLST databases 103 3.3.3 Integration of comparative and geographical eBURST 109 3.3.3.1 Geographical eBURST comparisons 111 3.3.3.2 Database query eBURST comparisons 112 3.3.3.3 eBURST menu option 113 3.3.4 Additional mlst.net v2 features 113 3.3.4.1 Publications 113 3.3.4.2 Download data 113 3.3.4.3 Organism-specific tools 114 3.4 Discussion 116 5 Chapter 4: Assigning Strains to Bacterial Species via the Internet: eMLSA.net 119 4.1 Introduction 120 4.2 Sequencing and cluster analysis 122 4.3 Cluster and species concordance 126 4.4 Using MLSA to identify new species clusters 128 4.5 Single gene trees and identification of interspecies imports 128 4.6 Development of a generic electronic taxonomy web application 129 4.6.1 Method of tree production 129 4.6.2 Method of tree visualisation 132 4.6.3 eMLSA.net database schema 134 4.6.4 Definition of application logic 135 4.7 Walkthrough of eMLSA.net 136 4.7.1 Portal homepage 136 4.7.2 Viridans group sub-site: viridans.eMLSA.net 138 4.7.3 Querying the database with sequences 139 4.7.4 Viewing the species tree: The tree view 140 4.7.5 Collapsing nodes on the tree 143 4.7.6 Viewing individual locus species assignment and trees: The locus view 144 4.7.7 Other options 146 4.8 Discussion 148 Chapter 5: Spatialepidemiology.net: Integrating Web Mapping Tools with Molecular Epidemiological Databases 151 5.1 Introduction 152 5.1.1 Dynamic Mapping within Web Applications 154 5.1.2 Google Maps 154 5.1.3 Google Earth 158 5.1.4 JavaScript libraries 161 5.1.4.1 Prototype 161 5.1.4.2 Script.aculo.us 161 5.1.4.3 Mapstraction 161 5.2 Spatialepidemiology.net 163 5.2.1 Multilocus Sequence Typing Maps: MLST-Maps 164 5.2.1.1 Introduction 164 5.2.1.2 Google Map interface 166 5.2.1.3 Google Earth interface 168 5.2.2 Staphylococcal Reference Laboratory Maps: SRL-Maps 171 5.2.2.1 Introduction 171 6 5.2.2.2 Data summary 174 5.2.2.3 SRL-Maps Web Application 175 5.2.2.4 Isolate database 175 5.2.2.5 Spa type Google Map 176 5.2.2.6 Spa type Google Earth 179 5.2.3 Batrachochytrium dendrobatidis Maps: Bd-Maps 181 5.2.3.1 Introduction 181 5.2.3.1 Data summary 183 5.2.3.2 Handling large datasets using Google Maps 184 5.2.3.3 Bd-Maps Web Application 186 5.2.3.4 Bd Google Map 187 5.2.3.5 Surveillance 192 5.2.3.6 Other options 194 5.2.4 Geocoding and generating user maps: User-Maps 196 5.2.4.1 Introduction 196 5.2.4.2 User-Maps Web Application 196 5.2.4.3 Map latitude/longitude lookup 196 5.2.4.4 Batch geocoding 197 5.2.4.5 Simple map creation tool 198 5.2.4.6 Advanced Map creation tool 198 5.3 Discussion 199 Chapter 6: General Discussion 204 Bibliography 209 Publications 209 Web references 221 APPENDIX 1: Polysaccharide Biochemistry 224 APPENDIX 2: Relevant Publications 227 7 LIST OF TABLES AND FIGURES Table 3.1 Species-specific databases hosted at mlst.net 88 Table 4.1 Loci used for viridans group MLSA scheme 123 Figure 1.1 Simple web server framework 17 Figure 1.2 Dynamic web server framework 18 Figure 1.3 Three-tiered Web Application architecture 19 Figure 1.4 AJAX Web Application architecture 25 Figure 2.1 Makeup of a generic cps locus 36 Figure 2.2 Schematic representation of the biosynthesis of CPS by the Wzx/Wzy-dependant pathway 39 Figure 2.3 Graphical representations of the CPS repeat-unit structure and cps locus of serotype 9A 48 Figure 2.4 CPS Application database schema 52 Figure 2.5 Database query form 54 Figure 2.6 Database search results including subsequent clustalW alignment and dendrogram visualised using ATV 55 Figure 2.7 Homology group query results 56 Figure 2.8 A three-way ACT comparison: Serotypes 18A vs 18B vs 18C 57 Figure 2.9 Serotype summary detailing serotype 6B 59 Figure 2.10 Clustering of serotype 14 and serogroup 15 based on gene content and sequence-weighted gene content 62 Figure 2.11 Dendrogram of sub-cluster containing serogroup 11, viewed within the CPS application 63 Figure 2.12 ACT comparison based on dendrogram interrogation: Serotypes 11B vs 11C vs 11A vs 11D vs 11F 64 Figure 2.13 Dendrogram based on proportion of shared HGs, adjusted for sequence similarities between the shared HGs 70 Figure 2.14 Comparison of those cps loci within cluster 2 assigned to the same subcluster: Serogroup 12 and serotypes 44 and 46 73 Figure

The Development and Use of Bioinformatic Web Applications for Infectious Disease Microbiology

Computational Pan-Genomics: Status, Promises and Challenges

Serratia Marcescens: Outer Membrane Porins and Comparative Genomics

CURRICULUM VITAE George M. Weinstock, Ph.D

Science & Policy Meeting Jennifer Lippincott-Schwartz Science in The

The Complete Genome Sequence of Mycobacterium Bovis

EMBO Conference Takes to the Sea Life Sciences in Portugal

November 2019 Volume 57 Issue 11 E00858-19 Journal of Clinical Microbiology Jcm.Asm.Org 1 Brown Et Al

SGM Meeting Abstracts: UMIST, 8-11 September 2003

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

Novel Mycobacterium Tuberculosis Complex Isolate from a Wild

Pdf Ment and Disease Emergence in Humans and Wildlife

BIOLOGY 639 SCIENCE ONLINE the Unexpected Brains Behind Blood Vessel Growth 641 THIS WEEK in SCIENCE 668 U.K