<<

In silico prediction and virulence

discovery of

FrankPo-YenLIN Centre for Health Informatics School of Public Health and Community Medicine University of New South Wales

A thesis submitted in fulfilment of requirements for the degree

of Doctor of Philosophy

October 2009 Declaration of originality

I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no materials previously published or written by another person, nor material which to a substantial extent has been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in this thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis.

I also declare that the intellectual content of the thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation, and linguistic expression is acknowledged.

Frank Po-Yen LIN

October 2009 To my parents

and

my aunt Abstract

Physicians frequently face challenges in predicting which bacterial subpopulations are likely to cause severe . A more accurate prediction of virulence would improve diagnostics and limit the extent of resistance. Nowadays, bac- terial can be typed with high accuracy with advanced genotyping tech- nologies. However, effective translation of bacterial genotyping data into assess- ments of clinical risk remains largely unexplored.

The discovery of unknown virulence is another key determinant of suc- cessful prediction of infectious outcomes. The trial-and-error method for virulence gene discovery is time-consuming and resource-intensive. Selecting can- didate genes with higher precision can thus reduce the number of futile trials. Sev- eral in silico candidate gene prioritisation (CGP) methods have been proposed to aid the search for genes responsible for inherited in human. It remains uninvestigated as to how the CGP concept can assist with virulence gene discovery in bacterial pathogens.

The main contribution of this thesis is to demonstrate the value of translational bioinformatics methods to address challenges in virulence prediction and viru- lence gene discovery. This thesis studied an important perinatal bacterial , group B streptococcus (GBS), the leading cause of neonatal sepsis and meningi- tis in developed countries. While several antibiotic prophylactic programs have successfully reduced the number of early-onset neonatal diseases (infections that occur within 7 days of life), the of late-onset infections (infections that occur between 7–30 days of life) remained constant. In addition, the widespread use of intrapartum prophylactic may introduce undue risk of penicillin allergy and may trigger the development of antibiotic-resistant . To minimising such potential harm, a more targeted approach of antibiotic use is

required. Distinguish virulent GBS strains from colonising counterparts thus lays

the cornerstone of achieving the goal of tailored therapy.

There are three aims of this thesis:

1. Prediction of virulence by analysis of bacterial genotype data:

To identify markers that may be associated with GBS virulence, statistical anal-

ysis was performed on GBS genotype data consisting of 780 invasive and 132

colonising S. agalactiae isolates. From a panel of 18 molecular markers stud-

ied, only alp3 gene (which encodes a surface antigen commonly associ-

ated with serotype V) showed an increased association with invasive diseases

(OR=2.93, p=0.0003, Fisher’s exact test). Molecular serotype II (OR=10.0,

p=0.0007) was found to have a significant association with early-onset neonatal

disease when compared with late-onset diseases.

To investigate whether clinical outcomes can be predicted by the panel of geno-

type markers, logistic regression and machine learning algorithms were applied

to distinguish invasive isolates from colonising isolates. Nevertheless, the pre-

dictive analysis only yielded weak predictive power (area under ROC curve,

AUC: 0.56–0.71, stratified 10-fold cross-validation). It was concluded that a

definitive predictive relationship between the molecular markers and clinical

outcomes may be lacking, and more discriminative markers of GBS virulence

are needed to be investigated.

2. Development of two computational CGP methods to assist with functional dis-

covery of prokaryotic genes:

Two in silico CGP methods were developed based on comparative genomics:

statistical CGP exploits the differences in gene frequency against phenotypic

ii groups, while inductive CGP applies supervised machine learning to identify

genes with similar occurrence patterns across a range of bacterial genomes.

Three rediscovery were carried out to evaluate the CGP methods:

• Rediscovery of peptidoglycan genes was attempted with 417 published

bacterial genome sequences. Both CGP methods achieved their best AUC

>0.911 in K-12 and >0.978 Streptococcus agalactiae

2603 (SA-2603) genomes, with an average improvement in precision of

>3.2-fold and a maximum of >27-fold using statistical CGP. A median

AUC of >0.95 could still be achieved with as few as 10 genome examples

in each group in the rediscovery of the peptidoglycan metabolism genes.

• A maximum of 109-fold improvement in precision was achieved in the

rediscovery of anaerobic fermentation genes.

• In the rediscovery with genes of 31 metabolic pathways in SA-

2603, 14 pathways achieved an AUC >0.9 and 28 pathways achieved AUC

>0.8 with the best inductive CGP algorithms. The results from the re-

discovery experiments demonstrated that the two CGP methods can assist

with the study of functionally uncategorised genomic regions and the dis-

covery of bacterial gene-function relationships.

3. Application of the CGP methods to discover GBS virulence genes:

Both statistical and inductive CGP were applied to assist with the discovery of

unknown GBS virulence factors. Among a list of hypothetical protein genes,

several highly-ranked genes were plausibly involved in molecular mechanisms

in GBS pathogenesis, including several genes encoding family 8 glycosyltrans-

ferase, family 1 and family 2 glycosyltransferase, multiple adhesins, strepto-

coccal neuraminidase, staphylokinase, and other factors that may have roles in

contributing to GBS virulence. Such genes may be candidates for further bio-

iii logical validation. In addition, the co-occurrence of these genes with currently known virulence factors suggested that the virulence mechanisms of GBS in causing perinatal diseases are multifactorial. The procedure demonstrated in this prioritisation task should assist with the discovery of virulence genes in other pathogenic .



iv Acknowledgements

First of all, I wish to express my gratitude to my supervisor Professor Enrico Coiera for his guidance and encouragements over the last three years. In particular, I could not have finished my work without his constant optimism, experience, and strive for perfection. My gratitude goes equally to Dr Vitali Sintchenko, my co-supervisor, whose vision and the remarkable attention to details have truly inspired me. Both

Enrico and Vitali have strengthened my interest in the fields of clinical decision support and genomics. Their encouragements have been an essential element to my candidature.

Coming from a non-technical, non-laboratory background, I could not have completed this thesis without the expertise and assistance of the following people:

Professor Lyn Gilbert and Dr Fanrong Kong for leading me into the fascinating

fields of clinical microbiology and molecular ; Heather Hiddings for her helpful discussions in biostatistics; Danny Ko for collecting and curating GBS genotyping data; Dr. Ruiting Lan for his expert knowledge on microbial genet- ics; and Drs. Mike Bain, Ashwin Srinivasan and Guy Tsafnat for sharing their knowledge on machine learning and their generous comments in assisting me with experimental design. I am also greatly indebted to Enrico, Vitali, Lyn, Kong, and

Ruiting for their assistance for editing the earlier drafts of this thesis.

I would also like to thank many anonymous reviewers and editors of BMC

Bioinformatics, Journal of Infectious Diseases, Clinical Microbiology and Infec-

i tion,andPathology with their invaluable insights on my work. Their constructive criticisms constituted a substantial part of my learning in conducting rigorous sci- entific research (albeit sometimes the hard way!).

Over the years, I received numerous useful advice and helpful discussions from my colleagues and seniors at the Centre for Health Informatics, including but not least (in alphabetical order) Stephen Anthony, Farshid Anvari, Grace Chung, Adam

Dunn, Blanca Gallego Luxan, Andrew Georgiou, Simon Li, Annie Lau, Farah Ma- grabi, Geoff McDonnell, Hieu Phan, Victor Vickland, Prof. Johanna Westbrook,

Tatjana Zrimec, and from my fellow students past and present: Afra, David, Jaron,

MeiSing, Nerida, Rosie, Torsten, and Zafar. Also, I could not have done without the dedicated admin team for their assistance: Sarah Behman, Keri Bell, Danielle

Del Pizzo, Janice Ooi, Samantha Sheridan, Denise Tsiros, and Gerard Viswasam.

Financially, I would like to thank National Health and Coun- cil of Australia for funding my scholarship.

I wish to thank my family for their constant encouragements during my can- didature. I thank my parents for supporting my decisions on what I wanted to do.

Finally, I could not have completed my work without the support from Jerlyna, my long-term girlfriend (now fiancee),´ who shared much of my frustrations and emo- tional upheavals over the last few years. Maintaining a long-distance relationship was challenging – and I am extremely grateful to have her stood by my side, with much understanding and patience, throughout the journey in pursuing my goal. 

ii Table of abbreviations

ADTree Alternating decision tree algorithm amss Arithmetic mean of sensitivity and specificity (scoring function) AUC Area under receiver operating characteristic curve bchisq Directional chi-square scoring function BLAST Basic Local Alignment Search Tool CG Cumulative gain function CGP Candidate gene prioritisation chisq Chi-square scoring function CMP-NeuNAc Cytidine 5’-monophospho–N-acetylneuraminic acid COG Clusters of orthologous groups CoNS Coagulase-negative staphylococci CSF Cerebral spinal fluid DNA Deoxyribonucleic acid ECM Extracellular matrix EOD Early-onset (neonatal) disease EST Expressed sequence tag F F-measure scoring function GAG Glycosaminoglycan GBS Group B streptococcus GC content Guanine-Cytosine content GT1/2/8 Glycosyltransferase family 1/2/8 HGT Hib Haemophilus influenzae type b hmss Harmonic mean of sensitivity and specificity (scoring function) IBk k-nearest neighbour algorithm IgA/G/M Immunoglobulin A/G/M IR Information retrieval KEGG Kyoto Encyclopaedia of Genes and Genomes LOD Late-onset (neonatal) disease LPS LR Logistic regression

iii MGE MLP Multilayer perceptrons MLST Multilocus sequence typing mPCR/RLB Multiplex PCR-based reverse line blot MRSA Methicillin-resistant MS Molecular serotype MSST Molecular serosubtype (of MS type III) MSSA Methicillin-sensitive Staphylococcus aureus NB Na¨ıve Bayes algorithm NCBI National Enter for Biotechnology Information NeuNAc N-acetylneuraminic acid NICU Neonatal intensive care unit npv Negative predictive value (scoring function) OMIM Online Mendelian Inheritance in Man (database) OR OR Odds ratio scoring function orf Open reading frame(s) PCR Polymerase chain reaction pct Percentile PE Probability enrichment PGP Protein genetic profiles ppv Positive predictive value (scoring function) PROM Premature rupture of the membranes PRU repeating units RFLP Restriction fragment length polymorphism ROC Receiver operating characteristic sens Sensitivity (scoring function) SMO Sequential minimal optimization algorithm spec Specificity (scoring function) ST Sequence type (of MLST) SVM Support vector machines SVM/RBF SVM with radial basis function kernel SVM/Poly SVM with polynomial kernel VSM Vector-space model WEKA Waikato Environment for Knowledge Analysis ZeroR ZeroR majority-class classifier



iv List of Publications

1. F Lin, V Sintchenko, F Kong, GL Gilbert, and E Coiera. Commonly-used

molecular epidemiology markers of Streptococcus agalactiae do not appear

to predict virulence. Pathology (Accepted 29 October 2008).

2. F Lin, E Coiera, RT Lan, and V Sintchenko. In silico prioritisation of can-

didate genes for prokaryotic gene function discovery. BMC Bioinformatics

2009, 10:86.

3. F Lin. The role of data mining in clinical predictive medicine: a narrative

review. In: Westbrook, J. and Callen, J., eds. Bridging the Digital Divide:

Clinician, Consumer and Computer: Proceedings of 14th Annual Confer-

ence of Health Informatics Society Australia (HIC 2006); Sydney, Australia,

August 2006.

4. F Lin. Factors affecting the classification performance of machine learning

algorithms in clinical genomics. (poster) presented at Bioinformatics Aus-

tralia 2006 Conference; Sydney, Australia, 21 November 2006.

5. F Lin, V Sintchenko, and E Coiera. A comparative genomic approach for

suggesting candidate virulence genes in . (poster)

presented at Genetics and Genomics of Infectious Diseases 2009 (Nature

conference); Singapore, 21–24 March 2009 .

v This work applied the inductive candidate gene prioritisation method devel- oped in Chapter 5 to S. pneumoniae, an important respiratory pathogen, in the identical fashion to the methods described in Chapter 9.



vi Contents

Abstract i

Acknowledgements i

Table of abbreviations iii

List of Publications v

1 Introduction 2

1.1 Explosion of genetic information ...... 3

1.2 Translational bioinformatics ...... 4

1.3Bacterialpathogens...... 6

1.3.1 Bacterialclassificationandtyping...... 6

1.3.2 Virulencemechanismsandvirulencegenes...... 9

1.4 Antibiotics and their optimal use ...... 11

1.5Virulencegenediscovery...... 13

1.5.1 Classicalapproach...... 13

1.5.2 Bacterialcomparativegenomics...... 15

1.5.3 Tools for comparing different bacterial genomes ..... 16

1.6 Candidate gene prioritisation (CGP) ...... 18

1.7Guidetothesis...... 18

vii I Virulence prediction in Streptococcus agalactiae 21

2 Group B streptococcal diseases and typing methods 22

2.1 Introduction ...... 22

2.2 Clinical manifestations of GBS diseases ...... 23

2.2.1 Maternalcarriageanddisease...... 24

2.2.2 Early-onsetneonataldisease...... 24

2.2.3 Late-onsetdisease...... 26

2.3 Typing of GBS strains ...... 27

2.3.1 Serotyping...... 27

2.3.2 Molecular characterisation ...... 28

2.3.3 Genome sequences ...... 31

3 GBS virulence classification 33

3.1 Introduction ...... 33

3.2DescriptiveanalysisofGBSmarkers...... 34

3.2.1 Material and Methods ...... 34

3.2.2 Results...... 38

3.2.3 Discussion...... 43

3.3 Predictive analysis by machine learning ...... 44

3.3.1 Materialandmethods...... 44

3.3.2 Results...... 46

3.3.3 Discussion...... 46

3.4 Potential limitations ...... 50

3.5 Selection of discriminatory features ...... 52

3.5.1 Methods...... 52

3.5.2 Results...... 54

3.5.3 Discussion...... 56

viii 3.6 Evolutionary considerations ...... 56

3.6.1 Horizontal gene transfer ...... 59

3.6.2 Positiveselectionofvirulencegenes...... 59

3.6.3 Virulencegenetyping...... 60

3.7 Chapter summary ...... 61

II Co-occurrence-based candidate gene prioritisation 62

4 Candidate gene prioritisation: literature review 63

4.1 Introduction ...... 63

4.1.1 Overview of the in silico CGPprocess...... 65

4.2TypesofCGP...... 66

4.2.1 Characteristics-based prioritisation ...... 67

4.2.2 Inductive prioritisation ...... 68

4.3Datasources...... 69

4.3.1 Primarydatasource...... 69

4.3.2 Secondarydatasources...... 70

4.3.3 Primaryversus.secondarydatasources...... 70

4.4Genefeatures...... 71

4.4.1 Co-occurrence...... 73

4.4.2 Similarity ...... 74

4.5 Feature integration ...... 74

4.5.1 ad hoc sorting and filtering ...... 75

4.5.2 Vector-space model ...... 76

4.5.3 Statisticalmodels...... 77

4.5.4 Machine learning models ...... 79

4.6 Methods of evaluation ...... 79

ix 4.6.1 Validation by rediscovery experiments ...... 79

4.6.2 Threshold-dependent measures ...... 79

4.6.3 AreaunderROCcurve...... 81

4.6.4 External validation ...... 81

4.7Discussion...... 82

4.7.1 Conservation hypothesis and biases ...... 82

4.7.2 ChallengesinGBSvirulencegenediscovery...... 82

5 CGP methods for prokaryotic genes 84

5.1 Gene-function co-occurrence ...... 86

5.2 Formal Definitions ...... 88

5.3StatisticalCGP...... 91

5.3.1 Occurrencematrix...... 91

5.3.2 The 2 × 2 contingency tables ...... 93

5.3.3 Scoring functions ...... 93

5.4InductiveCGP...... 96

5.5 Evaluation by rediscovery experiments ...... 97

5.5.1 Threshold-dependent measures ...... 98

5.5.2 Area under ROC curve ...... 100

5.5.3 Evaluation measures used in this thesis ...... 101

6 Co-occurrence-based CGP: case studies 103

6.1 1: Rediscovery of peptidoglycan genes ...... 103

6.1.1 Methods ...... 104

6.1.2 Results ...... 111

6.1.3 Discussion...... 123

6.2 How many genome examples are needed? ...... 124

6.2.1 Methods ...... 124

x 6.2.2 Results ...... 125

6.2.3 Discussion...... 130

6.3 Case study 2: Anaerobic mixed-acid fermentation genes ..... 130

6.3.1 Methods ...... 131

6.3.2 Results ...... 131

6.3.3 Discussion...... 137

6.4Casestudy3:RediscoveryofKEGGpathways...... 138

6.4.1 Methods ...... 138

6.4.2 Results ...... 139

6.4.3 Discussion...... 143

6.5 Potential limitations ...... 144

6.6Summary...... 146

III Virulence gene discovery in S. agalactiae 147

7 Review: virulence genes of Streptococcus agalactiae 150

7.1 Introduction ...... 150

7.2Adhesins...... 153

7.2.1 Fibrinogen-binding protein genes ...... 153

7.2.2 Fibronectin-binding protein gene ...... 153

7.2.3 Streptococcal C5a-peptidase ...... 154

7.2.4 Laminin-binding protein gene ...... 154

7.2.5 Minor pilin gene cluster ...... 154

7.3Invasins...... 155

7.3.1 Cα-protein and α-like protein family genes ...... 155

7.3.2 β-haemolysin/cytolysin gene cluster ...... 155

7.3.3 Hyaluronatelyasegene...... 157

xi 7.3.4 Otherinvasins...... 157

7.4 Immune system evasion ...... 157

7.4.1 Cβ proteingene...... 157

7.4.2 Streptococcal C5a-peptidase ...... 158

7.4.3 Capsular polysaccharide cps genecluster...... 158

7.4.4 Sialicacidsynthasesgenecluster...... 162

7.4.5 SerineproteaseCspA...... 162

7.4.6 Penicillin-binding protein 1A gene ...... 162

7.5Summary...... 163

8 GBS virulence gene discovery by statistical CGP 165

8.1 Material and methods ...... 166

8.1.1 Genome example selection ...... 166

8.1.2 StatisticalCGP...... 167

8.1.3 Comparisonwithcurrentlyknownvirulencefactors.... 168

8.1.4 Clustering of orthologous genes ...... 168

8.2 Results ...... 168

8.3Discussion...... 171

8.3.1 Possible significance of top-ranked genes ...... 171

8.3.2 Few overlaps with currently known virulence factors . . . 171

8.3.3 Sampling bias involved in genome example selection . . . 172

8.3.4 Overlappingwithanaerobic-specificgenes...... 173

9 GBS virulence gene discovery by inductive CGP 174

9.1 Methods ...... 174

9.1.1 Rediscovery of training genes ...... 177

9.1.2 Sub-sampling of negative examples ...... 177

9.2 Results ...... 180

xii 9.2.1 Rediscovery of GBS virulence genes or gene categories . 180

9.2.2 De novo discoveryofGBSvirulencegenes...... 181

9.2.3 Adhesion genes fbsA, fbsB, and pavA (Table 9.3) ..... 181

9.2.4 lmb and scpB genes (Table 9.4) ...... 181

9.2.5 GBS minor pilin genes (Table 9.5) ...... 182

9.2.6 Genes encoding invasins spb1andbca (Table 9.6) .... 182

9.2.7 Cytolysins: the cyl geneclusterandCAMPfactor..... 182

9.2.8 cspAandhylBgenes...... 183

9.2.9 cps and neu geneclusters...... 183

9.3Discussion...... 184

10 Biological significance of prioritised genes 194

10.1 Glycosyltransferases ...... 196

10.1.1 Family 8 glycosyltransferases ...... 196

10.1.2 Family 1 and 2 glycosyltransferases ...... 197

10.2Adhesins...... 197

10.2.1 Adherence to ECM and epithelial cells ...... 197

10.2.2 Adherence to collagen ...... 198

10.3DegradationofECM...... 199

10.3.1 Metalloproteases ...... 199

10.4 Evasion of immune system ...... 200

10.4.1 Neuraminidase ...... 200

10.4.2 Staphylokinase homologue ...... 200

10.5Othergenes...... 200

11 Conclusions 202

11.1 Summary of contributions ...... 202

11.1.1 “Typeable” versus “Predictive” ...... 202

xiii 11.1.2 Development of two CGP methods ...... 203

11.1.3 Used the CGP methods in virulence gene discovery .... 204

11.1.4 Some prioritised genes have biological plausibility .... 205

11.2 Future directions ...... 206

11.2.1 Bench-side validation ...... 206

11.2.2 Virulence prediction studies ...... 206

11.2.3 Applying other data sources and algorithms ...... 206

11.2.4 Practical CGP tools ...... 207

11.2.5 Virulence gene occurrence patterns ...... 208

11.3 Concluding remarks ...... 209

Bibliography 210

Appendices 254

A Genomes used in the sCGP of peptidoglycan genes 255

B Genomes used in the sCGP of anaerobic genes 266

C Using KEGG pathways as validation sets 275

D Bacterial pathogens causing neonatal sepsis 281

D.1 Introduction ...... 281

D.2Gram-positivepathogens...... 282

D.2.1GroupBstreptococcus(GBS)...... 282

D.2.2 Staphylococcus spp...... 283

D.2.3Othergram-positivepathogens...... 284

D.3Gram-negativepathogens...... 285

D.3.1 E. coli ...... 286

xiv D.3.2 Klebsiella pneumoniae ...... 286

D.3.3 Haemophilus influenzae ...... 287

D.4Otherpathogens...... 287

E Genomes used in the sCGP of GBS virulence genes 288

xv List of Figures

1.1 Number of bacterial whole-genome sequences (1995—2007) . . . 5

1.2 Translational research ...... 6

1.3 Tracking of bacterial clonal lineages by typing ...... 7

1.4Theclassicalapproachofvirulencegenediscovery...... 14

3.1 Virulence classification ...... 34

3.2 The important GBS clinical subgroups ...... 36

3.3 Distributions of MS and PGP in 912 GBS isolates ...... 38

3.4 Predictive analysis by machine learning ...... 44

3.5Clusteringversuspredictiveclassifications...... 48

3.6 Pseudocode: modified cross-validation procedure ...... 53

3.7 AUC versus number of features post-feature selection ...... 57

3.8 Horizontal gene transfers and positive gene selections ...... 58

4.1 The workflow of in silico candidate gene prioritisation ...... 65

4.2 Characteristics-based prioritisation ...... 68

4.3 Inductive candidate gene prioritisation ...... 69

4.4 Co-occurrence and similarity ...... 72

4.5 Statistical feature integration ...... 77

5.1 Gene-function co-occurrence in bacterial genomes ...... 86

xvi 5.2 Using gene-function co-occurrence for gene prioritisation ..... 88

5.3TheworkflowofstatisticalCGP...... 90

5.4Theoccurrencematrix...... 92

5.5 2 × 2 contingency table ...... 93

6.1 Case study 1: peptidoglycan genes ...... 106

6.2Casestudy1:statisticalCGPperformance...... 113

6.3Casestudy1:partialprecisionandP-Rgraphs...... 117

6.4Casestudy1:theADTreemodel...... 121

6.5 Case study 1, simulation 1 ...... 127

6.6 Case study 1, simulation 2 ...... 128

6.7 Case study 1, simulation 3 ...... 129

6.8Casestudy3:performanceonKEGGpathways...... 140

6.9Amechanismsofgeneoccurrenceingenomes...... 144

7.1CurrentlyknownGBSvirulencefactors...... 164

9.1 Sub-sampling of training data ...... 179

10.1 Potential GBS virulence factors ...... 195

11.1 A prototype web-based CGP tool ...... 208

C.1 Variations in CGP performance using KEGG pathways ...... 278

xvii List of Tables

2.1 Clinical manifestations of perinatal GBS diseases ...... 23

2.2 GBS clinical risk factors ...... 25

3.1Characteristicsof912GBSisolates...... 35

3.2 Frequencies of GBS markers in invasive vs. colonising isolates . . 40

3.3 Frequencies of GBS markers in clinical subgroups (a) ...... 41

3.4 Frequencies of GBS markers in clinical subgroups (b) ...... 42

3.5Predictiveanalysiswithmachinelearningclassifiers...... 47

3.6 (Nopt,A) of classifier-feature-ranking algo. combinations ..... 55 3.7 Top GBS markers ranked by feature-ranking algorithms ...... 56

4.1ComparisonofCGPandIRsystems...... 67

6.1 Case study 1: List of peptidoglycan-related genes ...... 107

6.2Casestudy1:SA-2603genesrankedbystatisticalCGP...... 114

6.3 Case study 1: CGP performance on the SA-2603 genome ..... 118

6.4 Case study 1: CGP performance on the EC-K12 genome ..... 119

6.5 Case study 1: CGP performance on the control validation set . . . 122

6.6Casestudy2:CGPperformance...... 132

6.7 Case study 2: list of genes in the validation set ...... 133

6.8 Case study 2: statistical CGP prioritised genes ...... 135

xviii 6.9Casestudy3:evaluatedKEGGpathways...... 141

6.10 Case study 3: inductive CGP of KEGG pathway genes ...... 142

7.1ListofGBSvirulencefactors...... 152

7.2 The β-haemolysin/cytolysin gene cluster ...... 156

7.3 The GBS cps–neu genecluster...... 159

8.1PositivegenomeexamplesusedinstatisticalCGP...... 169

8.2Highly-rankedgeneclustersfromstatisticalCGP...... 170

8.3 Known virulence genes versus the sCGP prioritised rank ..... 172

9.1 Training genes for virulence gene discovery by inductive CGP . . 176

9.2 Rediscovery performance of training genes by inductive CGP . . . 185

9.3 Highly-ranked genes (training set: fbsAB and pavAgenes).... 186

9.4 Highly-ranked genes (training set: scpBandlmb genes) ..... 187

9.5 Highly-ranked genes (training set: GBS minor pilin genes) .... 188

9.6 Highly-ranked genes (training set: spb1andbca genes)...... 189

9.7 Highly-ranked genes (training set: cyl and cfb genes)...... 190

9.8 Highly-ranked genes (training set: cspAandhylBgenes)..... 191

9.9 Highly-ranked genes (training set: cps and neu gene clusters) . . 192

9.10 Highly-ranked genes (training set: pbp1Agene)...... 193

A.1GenomeexamplesusedinChapter6.1...... 255

B.1 Genome examples used in Chapter 6.3 ...... 266

C.1 AUC of original vs. processed sag00400 validation sets ...... 276

C.2 Genes in original vs. processed sag00400 validation sets ..... 276

C.3 KEGG pathway listed in Figure C.1 ...... 279

E.1 Negative genome examples in GBS virulence gene discovery . . . 288

1 Chapter 1

Introduction

Accurate prediction of pathogen virulence (the ability of a pathogen to cause in- fection) remains a challenge in clinical practice and the basic sciences. The ability to discriminate pathogenic microorganisms from non-pathogenic counterparts can improve therapeutic decisions in infectious diseases medicine. Improved virulence prediction contributes to health economic benefits, as well as limiting the extent of antibiotic resistance with more precise antibiotic prescribing behaviour. In the last two decades, a variety of molecular genetic techniques have been developed to give us the ability to identify different subpopulations of bacterial pathogens.

Using such data to predict virulence, however, remains largely unexplored.

A second challenge lies with the discovery of virulence genes, the genetic determinants that confer the ability to cause infections on pathogenic microbes.

Identification of such genes can lead to improvements in diagnostics, therapeutics, and development. The traditional method for finding such genes can be characterised as a “fishing expedition” involving years of painstaking laboratory screening. The use of comparative genomics with high-throughput experimental methods, such as microarrays and whole-genome sequencing, are gradually accel-

2 erating this process, although such methods are considerably more expensive to conduct.

The number of whole-genome sequences of bacteria have been increasing over the last decade. Effective bioinformatic methods of genomic data analysis can provide a low-cost alternative in assisting with virulence gene discovery in the laboratory.

This thesis focuses on the topics of virulence prediction and virulence gene discovery. The two challenges were addressed in the analysis of virulence of an important perinatal pathogen, Streptococcus agalactiae or group B streptococcus

(GBS). GBS, normally a part of the microflora of the colon and female urogenital tract, is the leading cause of neonatal sepsis in developed countries. The first part of this thesis will describe using machine learning models to predict clinical outcomes by analysing GBS molecular epidemiology data. Subsequent chapters will focus on developing two computational candidate gene prioritisation (CGP) methods to assist with virulence gene discovery. The later part of this thesis will apply the

CGP methods to the task of identifying unknown virulence factors in GBS. Lastly, the potential virulence genes identified by the CGP methods will be discussed in detail.

1.1 Explosion of genetic information

The modern foundation of genetics was established with Mendel’s experiment with garden peas in 1865 [1, 2]. Decades later, the molecular mystery of genetic inher- itance was unravelled by the discovery of deoxyribonucleic acid (DNA), a scien- tific breakthrough that marked the beginning of the era of molecular biology and genetics [3]. In 1977, Frederick Sanger et al. first published the whole-genome sequence of Φ-X174 with 5,375 nucleotides [4]. A decade later, the

3 development of the polymerase chain reaction (PCR) by Mullis et al. permitted large-scale studies of molecular genetics through sequencing projects [5]. In 1995, the first genome sequence of a free-living organism, Haemophilus influenzae Rd, was determined by using the whole-genome shotgun sequencing method [6]. Since then, the number of sequenced genomes has grown exponentially (Figure 1.1). In

2001, an international collaborative effort sequenced the complete haploid genome of Homo sapiens and yielded approximately three billion base-pairs of genetic in- formation [7]. In 2008, the inception of a massive parallel sequencing project achieved the complete sequence of a diploid of a single individual

(Dr. James Watson) within two months at a fraction of the cost [8].

These rapid advancements in modern genomics have now led to the anticipa- tion of how a “thousand-dollar” genome sequence will revolutionise healthcare and medical sciences [9]. Besides our interest in understanding the biological world, the potential for using genomic data in medicine is driving speculation about promised individualised care. Developing effective method for translating bench- side genetic and genomic data into useful clinical information is thus an important task in the decades to come.

1.2 Translational bioinformatics

The novel field of translational medicine bridges the “research-clinical gap” by achieving a close collaboration between laboratory science and clinical practice.

As shown in Figure 1.2, translational medicine is a two-way process that connects the progresses of both basic and [12]. The expansion of genetic in- formation mandates an increasing requirement for rapid and accurate data analysis.

To serve these needs, the field of translational bioinformatics deals with develop- ing effective analytic methods in accomplishing this goal. The resources and tools

4 1000      100    n  H. inf. Rd   10  ? 

1 1994 1996 1998 2000 2002 2004 2006 2008 Year

Figure 1.1: Number of completely sequenced bacterial genomes in NCBI databases between years 1995 to 2007 [10]. Abbreviation: H. inf. Rd: Haemophilus influenzae Rd. at the cornerstone required to achieve an effective translation process include data mining and advanced statistical methods.

The high-level theme in this thesis is to apply the translational bioinformatic approach to the domain of clinical microbiology and infectious disease medicine.

The main contribution of this thesis is to demonstrate how biomedical informatics methods are able to assist clinical sciences in the formulation of new scientific hy- potheses. Specifically, within the “translational” framework, this thesis will first study how to apply bacterial genetics to predict clinical outcome by analysing bac- terial genetic markers with data mining tools. The later part of this thesis will explore methods based on comparative genomics and bacterial sequence data to discover potential virulence genes.

5 Biomarker discovery Biostatistics methods

Exploration

Application

Phase I clinical trials Bench-side Rapid diagnostics Bed-side Predictive algorithms

Figure 1.2: Translational research. Developing systematic scientific methodology in both directions are required to achieve effective translation. [11]. In molecular genetics, methods in bridging the clinical–research gap include screening for de novo biomarkers and development of systematic approach to discovery of disease- related genes. Rigorous evaluation of research methodology is an important step in the exploration and the application of translational efforts. In translational research of bacterial virulence studies, the exploration phase of the translational cycle may involve identification of virulence strains via descriptive molecular epidemiology studies. In the opposite direction, the application phase may utilise the bacterial genetic data to predict clinical outcomes.

1.3 Bacterial pathogens

The observation of animalcules in 1684 by Antonie van Leeuwenhoek marked the

first discovery of bacteria in human history. Bacteria are the simplest free-living unicellular microorganisms organised in a variety of shapes and arrangements and contain a nucleoid with circular DNA [13]. Most bacteria possess a rigid layer of peptidoglycan with various surface components such as flagella or fimbria.

1.3.1 Bacterial classification and typing

Bacterial classification

Bacteria have been classified by phenotypic features such as morphology, metabolism, staining results, and specific enzymatic activities. Bacterial classification not only has implications in taxonomical studies, it also has clinical significance in dis- tinguishing potential pathogens from non-virulent commensals empirically [14].

6 A B C

Figure 1.3: Tracking of bacterial clonal lineages by typing. Classification of bac- terial subpopulations (for example types A, B, and C) can be made by identify- ing features that distinguish individual clonal lineages (depicted by different node colours in the diagram).

In clinical settings, bacterial classification is important in identifying causal agents and in guiding appropriate drug therapies [14]. Nevertheless, it is evident that these broad classification schemes cannot explain the wide range of pathology caused by within simple taxonomical groups, in which better discrimination within a is required in differentiating subpopulations within a bacterial species.

Bacterial typing

In order to distinguish different bacterial subpopulations (for example, within a species) from one another, or strains of bacteria, many methods for bacterial typing have been developed. Typing establishes the clonal lineage of a particular bacte- rial strain, which enables tracking of its ancestry and origin. Methods of bacterial typing are important in defining the molecular epidemiology of a bacterial species, a discipline that focuses on the investigation of the spread of bacterial clusters in host populations [15,16]. The goal of typing is to determine which bacterial strains are colloquially the “same” or indistinguishable (whether they are derived from the

7 same ancestor, Figure 1.3), thus enables the determination of the pathogen origins, the tracking of routes, and aids the discovery of virulence genes [17].

Typing provides the ability to track the clonal expansion of a bacterial popula- tion and has application in control. For example, the use of pulsed-field gel electrophoresis (PFGE) typing in enterohaemorrhagic strains of Escherichia coli (O157:H7) is a well-established method in public health surveillance in the

United States, as serotyping provides the ability to track the transmission of this pathogenic strain during sporadic food-borne outbreaks [18]. In addition, bacte- rial typing has been applied to the clinical diagnostics and the study of population genetics [19].

Advances in biotechnology have led to better typing techniques and improved clonal discrimination. Typing can be either phenotype-based or based on molec- ular targets (gene-based). The classical typing methods are phenotype-based and require differential expressions of bacterial phenotypes, such as using patterns of electrophoretic mobility of soluble cellular (for example, multilocus en- zyme electrophoresis or MLEE [20]), the use of lytic bacteriophage (for example,

Staphylococcus aureus phage typing [21]), and serotyping (for example, Strepto- coccus pneumoniae [22]). Gene-based typing methods involve the characterisation of genetic variations in bacterial or . Several gene finger- printing methods have been investigated, including as restriction fragment length polymorphism (RFLP) [23], ribotyping [24], variable number tandem repeat anal- ysis, single nucleotide polymorphisms, and multilocus sequence typing [25]. In general, gene-based typing methods offer better and discriminatory power compared to the phenotypic methods [25].

8 1.3.2 Virulence mechanisms and virulence genes

Definition of virulence

Broadly speaking, the virulence of a is defined as the degree of pathogenicity or the relative capability of a microbe to cause host damage. To de-

fine the term “virulence” sensu stricto, however, is non-trivial. Many attributes of virulence have been proposed, including the toxicity of a pathogen (the “dosage” of bacteria required to cause disease), the aggressiveness of the pathogen (the time- liness or severity of disease), the ability to colonise and multiply in the host, the ability of pathogen to spread from one host to another (pathogen transmission), and the ability of pathogen to evade or elicit inappropriate immune responses in the host [26, 27]. Because virulence encompasses complex interactions with the host, several attempts have been proposed to define virulence with the inclusion of host factors and host immune status [26, 28].

Virulence genes

Virulence genes are the genes that encode a constellation of biochemical mecha- nisms to confer pathogenicity in a microorganism. In bacterial pathogens, the vir- ulence gene products may have roles in attachment, colonisation, multiplication, spreading, invasion, and self protection [29]. Wassenarr and Gaastra classified vir- ulence genes into three broad categories, namely the true virulence genes (protein invasins or genes that directly interact with the host), virulence-associated genes

(genetic factors required to activate true-virulence genes), and virulence-lifestyle genes (genes assisting with the colonisation and replication of the pathogen) [30].

9 Characteristics of virulence genes

Identification of virulence genes is important in assisting with the scientific de- velopment of clinical diagnosis and therapeutics of infectious diseases. In 1880s,

Koch proposed a set of criteria that must be fulfilled (known as Koch’s postulates) in establishing an aetiological role of a microbe in causing human infection [31].

Analogous criteria were proposed by Falkow to cover the contributions of bacterial genetic factors in human infections (the molecular Koch’s postulates [32]):

1. The virulence genes should be present in all pathogens and absent in non-

pathogens.

2. The presentation of disease should be associated with the pathogenic mi-

crobes and vice versa.

3. The inactivation of the virulence gene(s) should demonstrate an attenuated

virulence in the pathogen and its clones.

4. The reversal of virulence gene functions, with allelic replacement, should

restore pathogenicity.

Alternative frameworks to Falkow’s postulates have been proposed, for exam- ple, Fredericks and Relman’s adoption of Hill’s criteria of causation in inferring a potential pathogenic role of a putative pathogenicity gene [33]. In the second part of this thesis, the first rule of molecular Koch’s postulates will be explored for de- veloping statistical candidate gene prioritisation (CGP), a bioinformatic method that assists with the task of virulence gene discovery.

10 1.4 Antibiotics and their optimal use

Antibiotics

The discovery of antibacterial action of Penicillium fungus by Fleming revolu- tionised the treatment of bacterial infections [34]. Antibiotics (antibacterial drugs used for infectious diseases treatment) selectively target pathogenic bacteria by either direct damage to bacterial cells (bactericidal) or by inhibiting its multipli- cation (bacteriostatic). Advances in bacteriology have led to the development of antibacterials targeting different components of bacterial cells, including inhibit- ing cell wall (β-lactams, vancomycin) or protein synthesis (macrolides, tetracy- cline, aminoglycosides, linezoids), interfering with metabolism (sulphonamides and trimethoprim), and inhibiting nucleic acid synthesis and replication (metron- idazole, rifampin, and quinolones) [35].

Antibiotic resistance

Despite the success of antibiotics in treating bacterial infections, it was realised as early as 1948 that overuse of penicillin could lead to a phenomenon, known as drug indifference, allowing bacteria to survive under constant antibiotic treatments [36].

The decrease of antibiotic sensitivity is a significant clinical problem as it lim- its the choices for clinicians in managing infectious diseases effectively. Bacteria possess an array of versatile mechanisms to combat antibiotics, including efflux of antibiotic molecules (macrolide and tetracycline efflux pumps), inactivation of drug molecules (β-lactamase), and modification of enzymatic drug targets (altered penicillin-binding ) or genes responsible for replication, and translation (e.g. quinolones, linezoids, sulphonamide, trimethoprim, rifampin re- sistance) [35].

11 Resistance to antibiotics plays an an important part in pathogen colonisation

[37]. Administration of antibiotics exerts a positive selection pressure on hetero- geneous bacterial populations and favours the survival of resistance phenotypes, which can then undergo clonal expansion and dissemination, leading to the ex- pansion of a resistant bacterial population [37, 38]. The emergence of antibiotic resistance is evident in a recent , where resistance of oral strep- tococcal flora to macrolides was demonstrated to be inducible by only a short-term administration of azithromycin and clarithromycin [39].

Mechanisms for acquiring antibiotic resistance

The genes encoding resistance phenotypes can be acquired either through de novo mutations or horizontal gene transfer (HGT). HGT allows bacteria to share their resistance genes through mechanisms of DNA uptake (transformation), bacterio- phage transfection (transduction), or sharing via transposon-plasmid mechanism

(conjugation) [38]. HGT facilitates the spread of resistant phenotypes among pathogen strains or across bacterial species. For example, genes contributing to tetracycline resistance (tet) are almost invariably associated with mobile genetic elements in forms of , conjugative transposons, or gene cassettes (inte- grons), allowing their spread between different pathogen strains and species [40].

Optimal prescribing of empirical antibiotic therapies

Empirical antimicrobial therapy is the “blind” treatment of infectious diseases with potential virulent pathogens in mind. An empirical regimen is a “cover-all” ap- proach in infectious disease management with unknown pathogen virulence. The imprecision of such antibiotic use poses a risk of accelerated emergence of antibi- otic resistance.

12 To minimise this risk of emerging resistance phenotype in bacteria, antibiotic guidelines frequently advocate “prudent prescribing”. However, such “prudence” cannot be efficiently achieved without precise identification of invasive bacterial types. Optimal antibiotic treatment thus requires accurate differentiation of virulent pathogens from their non-virulent counterparts.

Translating bacterial genetics to optimise antibiotic prescribing

The study of bacterial genetics may help us in identifying pathogens in clinical settings. There are two main of methods that can be applied to study the associ- ation between bacterial genetics and clinical outcomes. Descriptive studies with molecular markers investigate the specific distribution of bacterial strains in host populations. This type of study is useful in identifying pathogen transmission pat- terns through the tracking of molecular patterns for outbreak detection [15, 19].

As an alternative to descriptive studies, predictive studies aim to build models using multiple biological markers to forecast clinical outcomes [41,42]. Predicting clinical outcomes of infectious diseases with pathogen characteristics, in forms of

“pathogen profiles” with microbial features comprised of genomic, transcriptomic, proteomic, and metabolomic data, may contribute to early diagnosis thus resulting in better risk stratification and improved antibiotic therapy [43].

1.5 Virulence gene discovery

1.5.1 Classical approach

As illustrated in Figure 1.4, searching for virulence genes in bacteria requires a two-stage experimental approach. The first stage involves the confirmation of bi- ological role of a candidate gene by inactivation, followed by its isolation from invasive isolates, and insertion into a non-virulent strain to reproduce virulent

13

microbial factor hypothesised virulence trait

Candidate gene avirulent mutant Virulent Virulence gene Virulent clone extraction mutant identified infection

gene knock-out

host factor animal hypothesised survives

animal Animal Host factor infected Transgenic Virulence gene model identification animal characterised

Figure 1.4: The classical approach of virulence gene discovery. The traditional method of virulence gene discovery necessitates the demonstration of the gene’s pathogenicity by gene-knockout studies. The pathogenic mechanism of the viru- lence gene is subsequently investigated by using a transgenic animal model.

behaviour [44]. Mutants with inserted virulence genes then demonstrate their

pathogenicity in an animal model [44]. The second stage involves using meth-

ods such as using a transgenic animal model (for example, a knockout mouse) or

positional cloning to investigate host response to the virulence gene [32]. Over-

all, such experimental procedures are time-consuming, requiring years of lengthy

laboratory work.

The use of comparative genomics with computational analysis or other high-

throughput methods is gradually gaining wider acceptance and replacing the tradi-

tional approach.

14 1.5.2 Bacterial comparative genomics

Aspects of genomes that can be compared

Many aspects of bacterial genomes can be compared. Crude comparisons of bac- terial genomes, as summarised by Binnewies et al., may involve comparison of chromosome alignment, genome size, guanine-cytosine (GC) content, tRNA and codon usage, analysis of transcription factors, and BLAST atlas and matrices [45].

In depth analysis with comparative genomics aims to ask the following funda- mental questions, which are important in understanding microbial and assisting in the identification of gene functions [46]:

1. Which genes are unique to a bacterial species or genome? (Unique genes)

2. Which genes are required for normal functioning bacteria? (Core genes)

3. Which genes undergo selection, thus conferring survival advantage to a bac-

teria in the host environment?

Unique genes

The central dogma of molecular biology is the unidirectional flow of information from DNA, protein, to phenotype. As stated by molecular Falkow’s postulates, genes encoding for a particular function are expected to be present in the strains displaying the phenotype and absent in the genomes that do not. In virulence gene research, comparative genomics can reveal different gene compositions between pathogenic and non-pathogenic organisms, and has led to the identification of novel diagnostic markers and vaccine targets in several pathogenic bacterial species in- cluding E. coli, S. aureus,andHelicobacter pylori [44, 47].

15 Core genes

Genes required for basic functioning of a bacteria are usually well-conserved across a large number of species. These genes, or core set of genes, can be identified by comparing multiple genomes. For example, Arigoni et al. performed a compara- tive analysis by comparing an Escherichia coli genome with Mycoplasma genital- ium genome to reveal essential genes that are needed for bacterial survival [48].

Genes under pressure

Virulence genes are constantly under natural selection pressure in hostile environ- ments. Comparative genomics can be used to identify genes undergoing positive selection. Chen et al. compared uropathogenic E. coli with other E. coli species by phylogenetic analysis and identified positively-selected genes which may con- tribute to urinary tract infection and subsequently validated these in a case-control study [49].

1.5.3 Tools for comparing different bacterial genomes

DNA microarrays

The use of DNA microarrays to search for virulence genes allows population- based comparison of gene composition between pathogenic and non-pathogenic genomes [50]. Earlier studies with DNA microarrays included H. pylori and N. meningitidis. Analysis of H. pylori has identified a (cag gene cluster) associated with increased virulence, and several candidate virulence genes were also suggested [51]. DNA microarrays have also been applied to Neisse- ria spp. to identify specific genes of N. meningitidis and have identified several potential genes on laterally-acquired pathogenicity islands associated with viru- lence [52, 53].

16 Genome sequence comparison

The increasing number of publicly available whole-genome sequences should now allow in-depth analysis of pathogen biology. Genome sequences can help both gene screening and DNA microarray construction. Genome sequences analysis has several advantages over DNA microarrays in virulence gene discovery:

1. DNA microarrays are unable to detect unknown or uncharacterised genes.

2. The intergenic regions of bacterial chromosome are typically not printed on

DNA microarrays and so these areas are usually missed.

3. The highly specific process of microarray hybridisation means that unknown

mutations are less likely to be detected.

In S. agalactiae, genome sequence comparison has lead to the discovery and rediscovery of potential virulence candidates, including the Sip protein, CAMP factor, hyaluronidase, and β−haemolysin [54].

Reverse vaccinology is a novel application of comparative genomics to identify likely vac- cine candidates in bacteria. Classical “forward” vaccine development involves using a live- attenuated microbes pathogen to test for immunogenicity, or using protein screening to identifying immunogenic bacterial components. The process of vaccine development may take decades of lengthy research. With reverse vaccinology, candidates are first screened by in silico bioinformatic analysis, followed by concurrent immunogenicity test- ing using parallel recombinant DNA expression techniques [55]. This approach drastically shortened the vaccine development time to 1-2 years [56], and has successfully been ap- plied to develop vaccine for N. meningitidis serogroup B infections [57].

Box 1.1: Reverse vaccinology – an example of in silico comparative ge- nomics

17 1.6 Candidate gene prioritisation (CGP)

Biological validation with wet-lab experiments is the rate-determining step in vir- ulence gene discovery. Thus, each incorrect candidate gene adds time to the dis- covery cycle. An improvement in gene selection can thus accelerate the rate of discovery.

One approach to improve the “hit-rate” of gene search is to rank the candidate genes according to an objective likelihood measure, thus increasing the chance of

finding “correct” genes with the least number of trials. Several in silico CGP meth- ods have been reported for ranking human genes in search of inheritable diseases.

CGP methods will be reviewed in detail in Chapter 4.

1.7 Guide to thesis

This thesis explores several research questions around in silico virulence prediction and virulence gene discovery in three parts:

Part I examines the clinical and molecular epidemiology of a perinatal pathogen, group B streptococcus (GBS), with an aim to translate bacterial genetic data into useful clinical information by performing descriptive and predictive analyses.

Chapter 2 reviews the clinical manifestations, epidemiology, and methods of typing of GBS.

Chapter 3 contains a descriptive study of 912 GBS clinical isolates using 18 distinct molecular markers to investigate potential markers associated with GBS virulence. The second part of this chapter investigates the hypothesis that GBS virulence can be predicted by using molecular markers with supervised machine learning models.

18 Part II of the thesis develops and evaluates two CGP methods, based on phylo- genetic profiles, to prioritise bacterial genes associated with particular phenotypic traits. The purpose for developing such bioinformatic methods is to assist with the labour-intensive task of virulence genes discovery in bacterial pathogens.

Chapter 4 reports a literature review of methods of in silico CGP that are cur- rently used for discovering gene related to inheritable diseases in humans.

Chapter 5 develops two methods for prioritising bacterial genes in functional discovery. Statistical CGP extends the concept of Falkow’s molecular Koch’s postulates by ranking candidate genes based on their frequency of occurrence in genomes of known phenotypes. The second method, inductive CGP, aims to find genes of similar function by applying supervised machine learning to patterns of gene occurrence across multiple genomes.

Chapter 6 To benchmark these CGP methods, Chapter 6 reports three system- atic evaluations of both these CGP methods by rediscovering peptidoglycan-related genes, genes responsible for anaerobic mixed-acid fermentations, and genes in the pathways curated in the Kyoto Encyclopaedia of Genes and Genomes (KEGG) database.

Part III of this thesis applies both CGP methods to find unknown virulence genes in S. agalactiae.

Chapter 7 reviews the currently known virulence genes in GBS for discovering genes with inductive CGP in Chapter 9.

Chapters 8 and 9 performs CGP experiments to discover GBS virulence factors yet to be characterised.

19 Chapter 10 discusses the biological significance of the prioritised genes in Chap- ters 8 and 9. 

20 Part I

Virulence prediction in Streptococcus agalactiae

21 Chapter 2

Group B streptococcal diseases and typing methods

2.1 Introduction

Streptococcus agalactiae, or group B streptococcus (GBS), is a Gram-positive β- haemolytic bacteria which emerged as an important in the early

1960s [58,59]. GBS is normally part of the microflora in human female urogenital tract, with sporadic ability to cause serious infection around the pregnancy pe- riod. In recent years, GBS infections have also been increasingly reported in non- pregnant adults with underlying comorbidities, such as in immunocompromised individuals or in patients with diabetes mellitus or neoplastic disorders [60].

The significant morbidity associated with neonatal GBS infections poses a challenge for obstetricians and paediatricians. The manifestations of GBS neonatal infections may include life-threatening , meningitis, or sepsis, a range of serious clinical consequences that prompted scientific investigations into effec- tive public health measures to reduce its . The institution of prophylactic antibiotic therapy has been effective against an important subgroup of neonatal

22 Table 2.1: Clinical manifestations of perinatal GBS diseases Neonatal diseases Maternal diseases Sepsis Urinary tract infection Pneumonia Chorioamnionitis Acute respiratory distress syndrome Endometritis Meningitis Bacteraemia Septic arthritis Osteomyelitis Cellulitis infections, known as early-onset disease (EOD). However, the other subgroup of

GBS neonatal disease, late-onset disease (LOD), has not been reduced by the pre- ventative measures.

As discussed in Chapter 1, improvements in microbial typing technology have enabled the tracking of bacterial subpopulations and identification of individual strains of bacteria. Evidence suggests that some virulent GBS strains are associ- ated with certain typing markers. The later part of this chapter will discuss differ- ent typing methods used in distinguishing different GBS strains. With our ability to track GBS strains improving, it is anticipated that these typing methods can as- sist with risk stratification in clinical settings. In Chapter 3, methods for building predictive models with molecular markers will be explored in more detail.

2.2 Clinical manifestations of GBS diseases

Most GBS strains isolated from humans can be found colonising the female uro- genital tract. However, a small proportion of GBS can display virulent behaviour and cause a spectrum of diseases in pregnant women or in neonates.

23 2.2.1 Maternal carriage and disease

GBS is found to colonise 15–35% of women during the normal course of pregnancy

[61, 62]. Although most individuals colonised with GBS are clinically asymp- tomatic, up to 1% of these pregnant women can develop clinical or sub-clinical presentation of bacteraemia with leukocytosis, bacteriuria, concurrent urinary tract infection, chorioamnionitis, and endometritis [63]. The development of signs and symptoms of maternal infection are closely associated, although not a prerequisite, with early-onset neonatal infection (Table 2.2) and stillbirth [64].

2.2.2 Early-onset neonatal disease

EOD is defined as the clinical manifestation of bacterial infection within the first seven days of life. Data from 1970–1990 indicated that the incidence of EOD caused by GBS was up to 2.09 cases per 1000 live births [65]. EOD constituted up to 73% of all neonatal GBS diseases [66]. The pathogenesis of EOD caused by GBS is believed to be via (from mother to baby). The development of chorioamnionitis in pregnancy indicates an early infectious pro- cess, which leads to ascending spread of bacteria from the female genital tract to uterus and causes uteritis [67]. Foetal contact with GBS is believed to be caused by subsequent aspiration of contaminated fluids during delivery [67].

Compared with term infants, the mortality of EOD caused by GBS is consider- ably higher in premature infants. The overall mortality of EOD in term infants was found to be up to 7.4%, but 19–25% of infants born before 37 weeks of gestation can die as an unfortunate consequence [68].

Antibiotic prophylaxis in preventing GBS EOD

A risk-based prophylactic strategy has been used to treat the high-risk pregnancy group (pregnancies with clinical risk-factors) with intrapartum penicillin, and evi-

24 Table 2.2: Maternal and obstetric risk factors associated with early-onset GBS diseases [61] Risk factors Ref. Chorioamnionitis [69] Maternal GBS bacteriuria or urinary tract infection [69] Heavy maternal colonisation [61] (> 38.0◦) [61] Pre-term labour (< 37 weeks) [61] Premature rupture of membrane (PROM) [61] Prolonged PROM (> 12 hours) [69, 70] Previous stillbirth [64] Low birth weight [70] Prolonged obstetric monitoring or vaginal examinations [61] Gestational diabetes [68] Twin with early-onset GBS disease [69] dence has demonstrated that such strategies have effectively reduced the incidence of GBS EOD by up to 60–75% [71–74]. Several surveys conducted in 1990-2003 suggested that the incidence of EOD caused by GBS was reduced to 0.4–0.8 per

1000 live births in developed countries [68, 74, 75].

A screening-based strategy has been proposed as an alternative approach to treating high risk groups [76] based on evidence that heavy maternal GBS coloni- sation is strongly associated with EOD. The strategy recommends that intrapartum antibiotic should be given if GBS is present in cultures at the late stage of preg- nancy (35–37 weeks) [77]. Studies have found that universal screening prevents more cases of EOD than the risk-based method [78]. The benefits of antenatal screening were demonstrated in several population cohorts, in which the incidence of EOD caused by GBS was further reduced to 0.34 per 1000 live births in the

United States [79, 80]. On this basis, the screening-based approach was recom- mended in the revised guidelines published by Centre for Disease Control and Pre- vention of the United States [77].

25 Potential concerns over the emergence of antibiotic resistance

Current GBS guidelines recommend a risk or screening-based approach in pre- venting EOD in neonates. The high carriage rate suggests that these approaches grossly over-treat many GBS types that are part of the commensal microflora. As discussed in Chapter 1, antibiotic resistance is a potential concern with any antimi- crobial therapy. Although there is currently no evidence to suggest an increase in antibiotic resistance among GBS and non-GBS after the introduction of chemo- prophylactic regimes [81, 82], there remains a theoretical threat at the emergence of resistant pathogens due to the high prevalence of GBS carriage among preg- nant women. In addition, indiscriminatory exposure of mother and baby to peni- cillin may also increase the risk of replacing penicillin-sensitive anaerobes with penicillin-resistant counterparts. Infants who are exposed to penicillin during de- livery may be associated with delayed or distorted colonisation of normal flora.

To reduce the over-treatment of GBS carriers, a more targeted approach of antibiotic prescribing than the empirical recommendations is needed. Accurate classification of invasive GBS strains is essential in the identification of high-risk pregnancies. At present, there has been only limited work attempting to predict

GBS virulence. The next chapter (Chapter 3) will investigate prediction of GBS virulence using GBS genetic markers and supervised machine learning.

2.2.3 Late-onset disease

A distinct group of GBS infections in neonates, late-onset disease (LOD), occurs between 7 to 30 days of life. LOD was reported to have an incidence from 0.2 to 0.5 per 1000 live births and carried a mortality of up to 6% [61,80]. As with EOD, ma- jor risk factors associated with GBS LOD include prematurity and heavy maternal colonisation as evident in antenatal cultures [83]. Although the exact pathogenesis of LOD remains elusive, a nosocomial mode of transmission has been suggested as

26 there were case reports of GBS dissemination within neonatal intensive care units

(NICU) [67]. Infants with LOD present with bacteraemia, meningitis, osteoarthri- tis, and cellulitis, in contrast to EOD where pneumonia is the most common pre- sentation [84].

Antibiotic prophylaxis for EOD has not affected the overall incidence of LOD caused by either GBS and other non-GBS microorganisms [85], an observation that further supports the hypothesis that EOD has a different aetiology [73].

2.3 Typing of GBS strains

2.3.1 Serotyping

Methods of GBS typing allow GBS populations to be grouped and tracked. Typing of GBS strains is traditionally performed by immunophenotypical methods. Recent

DNA-based typing methods allow more accurate and rapid detection.

Group B antigen

All GBS strains possess the Lancefield group B antigen in the classification of β- haemolytic streptococci [86]. The group B specific antigen is a complex polysac- charide made up of four distinct oligosaccharides containing L-rhamnose, D-galactose,

D-glucitol, and N-glucosamine forming a multiantennary structure [87]. The group

B antigen, however, does not confer protective immunity in human and animal models or associated with invasiveness of GBS [84].

Polysaccharide capsule

The serological sub-classification of group B haemolytic streptococci was origi- nally performed by Lancefield in 1934 [86]. Nevertheless, the duration and poor sensitivity of capillary precipitation prompted the development of alternative meth-

27 ods to improve efficiency and accuracy, including latex agglutination [88, 89], mi- croimmunodiffusion [90], co-agglutination patterns [91], and inhibition - linked immunosorbent assay [92]. To date, nine distinct serotypes have been de- scribed (Ia, Ib, II to VIII) and a further serotype has been proposed [93]. Variations on serotypes originate from the genetic polymorphism of the cps gene cluster [94].

Genetics of the cps will be reviewed in detail in Chapter 7.4.3.

Distribution of GBS serotypes in clinical populations

Serotype Ia (16–32%) and III (20–59%) are the predominant serotypes in EOD

[61, 62, 66, 75, 80]. Up to 50–71% of LODs belong to serotype III [80, 95] and are particularly associated with meningitis [75, 95]. Serotype V is most common in causing infections in non-pregnant adults [80, 96, 97], although this serotype is responsible for up to 23% of neonatal invasive infections [75]. Serotype VIII

(25%) and VI (36%) are the most common serotypes colonising healthy Japanese women [98].

2.3.2 Molecular characterisation

Characterising GBS strains by serotyping can be subjective and up to 20% of iso- lates are non-typeable [99, 100]. Reasons for failure of serotyping methods in- clude non-expression of polysaccharide capsule or inadequacy of bacterial count

[15]. Such poor discrimination has prompted the development of gene-based typ- ing methods to improve the efficiency and accuracy in characterising GBS iso- lates [101].

Genotyping of the cps gene cluster

The GBS capsular exopolysaccharide gene cluster (cps cluster) is responsible for the serotype diversity of GBS (see Chapter 7.4.3). Several DNA-based methods

28 have been proposed to characterise GBS genotypes in concordance with serotypes, including the analysis of restriction fragment length polymorphism (RFLP) pat- terns [102, 103] and DNA microarrays [104]. Kong et al. developed a multiplex

PCR-based reverse line blot (mPCR/RLB) method, based on the mapping of se- quence polymorphism in the cpsE–G region to individual serotype groups (molec- ular serotyping, MS) [105, 106]. MS is included in the panel of markers for the analysis of GBS virulence in the next chapter.

Multilocus sequence typing

Multilocus sequence typing (MLST) distinguishes different clonal lineages of bac- teria by examining the genetic polymorphism of housekeeping genes. Jones et al. developed a GBS MLST system consisting of seven housekeeping genes [107].

The MLST system was used to characterise 158 clinical GBS isolates, in which two clonal complexes (CC) CC17 and CC10 were found to be cosegregating with mobile genetic elements GBSi1 and IS1548 respectively [108].

Single nucleotide polymorphisms

Single nucleotide polymorphisms (SNPs) have also been used to discriminate dif- ferent clonal complexes of S. agalactiae. Honsa et al. described a five-SNP typing method, together with the Not-N bioinformatic algorithm, capable of subclassify- ing GBS isolates into MLST-assigned clonal complexes, including the clinically significant strain CC-17 [109, 110].

Surface proteins C

The surface proteins Cα and Cβ were among the first antigenic proteins charac- terisedinGBS.Cα is a protein invasin (a protein that facilitates invasion into the host) of which there are several allelic variants (bca, alp1–5, and rib), which can be

29 detected by PCR-based methods to achieve surrogate serotyping [111]. The pres- ence of rib gene is associated with multilocus sequence typing (MLST) sequence types (ST)-17 and ST-19 [112,113], whereas ST-22 was found to be associated with bca [113]. Multiplex PCR-based characterisation of the second C protein antigen

Cβ (bac, Chapter 7.4.1) have also been reported [114]. Invasiveness was found to be associated with shorter tandem repeats serotype Ib [114].

Mobile genetic elements

Mobile genetic elements (MGE) play important roles in horizontal gene trans- fer and alteration of virulence in pathogenic bacteria. Several MGEs are well- characterised in GBS. IS861 (1,442bp), an insertion sequence sharing homology with IS3 and IS150 of S. pneumoniae, was reported to be present in the hyperviru- lent strain COH-1 (serotype III) [115]. ISSa4 (963bp) was found in some isolates to be inserted into cylB(β-haemolysin/cytolysin gene cluster, (Chapter 7.3.2), pro- ducing ahaemolytic GBS strains [116]. The presence of IS1548 (1,317 bp) has been reported in some strains and insertion of IS1548 in hylB gene causes the inactiva- tion of streptococcal hyaluronan lyase [117]. The group II intron GBSi1 was iden- tified to be inserted between the C5a peptidase gene (scpB) and laminin-binding protein gene (lmb) in GBS strains not containing IS1548 [118]. The presence of

GBSi1 is associated with higher proportion of meningitis type III isolates [119].

TwocopiesofISSag2 flanked the edges of a composite transposon containing the scpB–lmb gene region, which are present in nearly all human GBS isolates [120].

In addition to the association of MGE with virulence, the acquisition of new mo- bile elements can be used to track the evolutionary lineages of GBS isolates when compared with MLST studies [121], making MGEs candidates for molecular clas- sification.

30 The distribution of the MGEs in Australasian GBS strains is: IS1381 (87%),

IS861 (52%), IS1548 (17%), ISSa4 (6%), and GBSi1 (18%) [122].

Antibiotic resistance genes

The phenotypic traits of antibiotic resistance may assist in pathogen colonisation and are important factors associated with virulence. In particular, gene encod- ing mechanisms of antibiotic resistance are frequently associated with horizontal gene transfers. For example, the conjugative transposon Tn916 is found in several streptococcal species and is associated with the horizontal transfer of antibiotic resistance mediated by the tetracycline resistance tetM genes [123]. Zeng et al. described a multiplex PCR-based reverse line blot method for detecting a panel of antibiotic resistance genes simultaneously [124]. It was found that genes as- sociated with tetracycline-resistance (tetM, 77–82%) and the integrase of Tn916

(int-Tn, 57-84%) were the most prevalent markers in the 512 Australasian isolates studied [124].

2.3.3 Genome sequences

The genome sequences of several virulent strains of GBS have been determined.

In 2002, two GBS genomes were sequenced by Glaser et al. and Tettelin et al. re- spectively, including an invasive serotype III strain (NEM316, MLST ST-23) and a serotype V strain (2603 V/R) [54,125]. The analysis of NEM316 genome revealed

14 pathogenicity islands containing surface proteins [125]. The whole-genome sequences of six pathogenic GBS reference strains (A909, H36B, 18RS21, 515,

COH1, and CJB111) were further sequenced in 2005 [126]. Comparative genomic analysis of eight GBS genomes suggested that there are potentially infinite dis- pensable genes in the GBS “pan-genome”, a phenomenon that greatly contributes to genetic diversity of S. agalactiae [126].

31 The later part of this thesis will perform analyses on the whole-genome se- quences for searching for currently uncharacterised virulence genes. At the time when the experiments were conducted (April 2007), three GBS whole-genome se- quences (NEM316, 2603V/R, A909) were available from the National Centre for

Biotechnology Information (NCBI) database.

32 Chapter 3

Classification of GBS virulence with bacterial genetic markers

3.1 Introduction

As discussed in Chapter 2, the balance between benefit and cost needs to be con- sidered with any practice of any antimicrobial chemotherapy. Figure 3.1 illustrates that a more accurate identification of invasive microorganisms from the colonis- ing microflora should lead to clinical benefits, as potential emergence of antibiotic resistance will be minimised by more precise targeting of pathogenic strains of S. agalactiae.

Several molecular methods have been described to characterise the genetics of invasive GBS strains [97,127–129]. Despite advances in GBS genotyping methods, it remains largely unknown whether we can apply molecular markers to predict the spectrum of GBS diseases. It can be hypothesised that clinical outcomes can be accurately predicted using bacterial genetic markers. This chapter will explore the predictive relationships between GBS genotypes and virulence. Both statistical and supervised machine learning methods are applied to test this hypothesis.

33 Bacterial isolates Genotyping Bacterial genotype Prediction

Virulence prediction Non-invasive Invasive

Commensals Pathogens High-risk Low-risk Antimicrobial No therapy therapy or prophylaxis

Figure 3.1: The goals of virulence classification. Accurate identification of inva- sive pathogens can improve on empirical therapy, which leads to a reduction of imprecise antibiotic usage thus also reduces the risk of associated complications.

The results in this chapter have been reported in Pathology [130]

3.2 Descriptive analysis of GBS markers

3.2.1 Material and Methods

GBS Isolates

Nine hundred and twelve human GBS isolates from routine laboratory cultures

were obtained from Australia (n=331), New Zealand (n=448), North America

(n=58), Germany (n=10), and East Asia (n=65) from 1991–2005. GBS isolates

were collected by collaborators at the Centre for Infectious Diseases and Micro-

34 Table 3.1: Characteristics of 912 group B streptococcus isolates

Clinical group Isolate characteristics Invasive Colonising (n = 780) (n =132) Age group Neonate, age <7 years 182 (23%) 0 (0%) Neonate, age 7–90 days 105 (13%) 0 (0%) Adult 421 (54%) 132 (100%) Unknown or not specified 72 (9%) 0 (0%) Gender Female 351 (45%) 132 (100%) Male 278 (36%) 0 (0%) Unknown or not specified 12 (2%) 0 (0%) Site of isolation Vaginal swabs from routine gestational 0 (9%) 132 (100%) screening at 37 weeks Blood 680 (87%) 0 (0%) Cerebrospinal fluid 42 (5%) 0 (0%) Joint aspirate 7 (1%) 0 (0%) Other sterile sites 51 (7%) 0 (0%) Geographical origin Australia 230 (29%) 101 (76%) Germany 10 (1%) 0 (0%) Hong Kong 54 (7%) 0 (0%) Korea 11 (1%) 0 (0%) New Zealand 447 (57%) 1 (1%) North America 28 (4%) 30 (23%) biology, Sydney West Area Health Service, Sydney, Australia. Isolates with un- known sites of isolation were excluded from the analysis. The samples were grouped into two clinical categories (invasive versus colonising). In the invasive group, 780 isolates were acquired from sterile sites including cerebrospinal fluid, blood and joint fluid aspirates (from patients of any age). The colonising group consisted of 132 isolates obtained from vaginal swabs collected from routine an- tenatal screening according to the screening-based . Characteristics of the

35 GBS isolates (n = 912)

Invasive isolates Table 3.2 Vaginal colonising (n = 780) (all) isolates (n = 132)

Tables 3.3&3.4 (a)

Adult invasive Tables 3.3&3.4 (c) Neonatal invasive (n = 493) (n = 287)

Tables Women at Early-onset 3.3&3.4 (b) Late-onset childbearing age diseases diseases (n = 100) (n = 105) (n = 182)

Tables 3.3&3.4(d)

Figure 3.2: The important GBS clinical subgroups. Invasive isolates were com- pared with vaginal colonising isolates. The four clinical subgroup pairs were com- pared, including a) neonatal versus vaginal colonising isolates b) early- versus late-onset neonatal diseases c) adult versus neonatal invasive diseases d) women at childbearing age versus vaginal colonising isolates.

two categories of subjects from whom the isolates were obtained are shown in Ta-

ble 3.1.

Four clinically important subgroups were compared in the following pairs (Fig-

ure 3.2). Invasive neonatal GBS isolates (n=287) were compared with antenatal

vaginal isolates (n=132). In addition to the overall comparison, the following 4

subgroup pairs were compared. Adult invasive isolates (n=493) were compared

with neonatal invasive isolates. Isolates from early-onset neonatal diseases (n=182,

defined as infections occurring within the first 6 days of life) and late-onset diseases

(n=105, infections occurred after 7 days of life) were compared. Invasive isolates

36 from women of childbearing age (WCBA, defined as women aged between 15–45, n=100) were also compared with colonising isolates.

GBS molecular markers

A panel of GBS genotype markers from several virulence-associated genes, in- cluding markers for polysaccharide capsule genes (molecular serotype), variants of the surface antigen/invasin (Cα-like protein family or protein genetic profiles), mobile genetic elements, and antibiotic resistance-related genes was selected for genotyping. Multiplex PCR-based reverse line blot (mPCR/RLB) assays were also performed by collaborators at the Centre for Infectious Diseases and Microbiol- ogy, Sydney West Area Health Service, Australia. The GBS markers studied in this chapter are listed below. These markers were previously described in detail in

Chapter 2.3.2:

• molecular serotypes (MS: 9 types; Ia, Ib, II to VIII),

• molecular subtypes of MS-III (MSST: 4 types; III-1 to III-4),

• protein genetic profiles (PGP: 6 types; bca, alp1-3, rib,ornone),

• 7 mobile genetic elements (presence or absence of insertion sequences IS1381,

IS1548,IS861,ISSag1,ISSag2,ISSa4 and group II intron GBSi1), and

• 8 antibiotic resistance-related genes (presence or absence of the following bi-

nary markers: aad-6 and aph-3, aminoglycoside resistance genes; ermB and

ermTR, ribosomal methylase genes; int-Tn, encoding the integrase of trans-

poson Tn916; mefA/E, encoding macrolide efflux pumps; tetM and tetO,

both associated with tetracycline resistance, and mreA: encoding a flavoki-

nase).

37 Statistical analysis

The differences in binary marker distributions were analysed by Pearson’s Chi-

square statistic, except for groups with less than 5 isolates where Fisher’s exact

test was used. For genotypes with more than 2 alleles or types (MS, MSST, and

PGP), data were first analysed as n × 2 tables to derive the test statistic. Statistical

significance were determined at the level of α =0.01. The standard error of odds

ratios were calculated by methods described by Bland and Altman [131].

3.2.2 Results

VI (16) VII (6) VIII(3) none (8) V (161) Ia (201) bca (169) rib (309) IV (21)

Ib (111) alp1 (219)

III (317) II (76) alp3 (175) alp2 (32)

(a) Molecular serotypes (b) Protein genetic profiles

Figure 3.3: Distributions of MS and PGP in 912 GBS isolates. The numbers shown in the parentheses are number of isolates in each molecular serotype or protein genetic profile.

The overall distribution of molecular serotypes (MS) was Ia: 22.1%, Ib 12.2%,

II: 8.3%, III: 34.7%, IV: 2.3%, V: 17.7%, VI: 1.8%, VII: 0.7%, VIII: 0.3% (Fig-

ure 3.3(a)). The distribution of protein genetic profile was rib: 33.9%, alp3: 19.2%,

38 alp1: 18.5%, bca: 19.5%, alp2: 3.5% (Figure 3.3(b)). Eight isolates contained nei- ther rib nor any Cα-like protein genes. The frequency of other genetic markers are shown in Table 3.2.

Of the 18 markers studied, only alp3 demonstrated an association with inva- siveness, as shown by a significantly increased odds ratio (OR: 2.93, 99% CI:

1.29–7.90, p=0.0003). In contrast, MS III, rib,IS1548,IS861,andint-Tn were inversely associated with invasive isolates (also with the WCBA subgroup) when compared with antenatal vaginal isolates. Serotype III (OR: 2.71), III-2 (OR: 6.72), rib (OR: 2.45), GBSi1 (OR: 2.21), and IS861 (OR: 1.59) were associated with neonatal invasive disease when compared with adult infections in which serotype

V (OR: 0.31), alp3 (OR: 0.38) and IS1381 (OR: 0.40) predominated. Molecular serotype II was found to be associated with early-onset diseases when compared with late-onset diseases [132].

39 Table 3.2: Frequencies of GBS molecular markers in invasive versus colonising groups of isolates

Invasive Colonising Markers OR 99C.I. p-value (n = 780) (n = 132) Molecular serotypes (MS, 9 types) < 0.01 ∗ Ia 179 (23%) 22 (17%) 1.49 (0.79–2.02) 0.11 Ib 98 (13%) 13 (10%) 1.31 (0.60–3.32) 0.47 II 63 (8%) 13 (10%) 0.80 (0.35–2.07) 0.50 III 252 (32%) 65 (49%) 0.49 (0.30–0.82) < 0.01 ‡ Serosubtypes of MS III (MSST, 4 types) 0.05 III-1 150 (19%) 48 (36%) 0.42 (0.24–0.72) NS III-2 65 (8%) 15 (11%) 0.71 (0.32–1.72) NS III-3 7 (1%) 0 (0%) NS III-4 30 (4%) 2 (2%) 2.60 (0.48–53.3) NS IV 21 (3%) 0 (0%) 0.06 V 145 (19%) 16 (12%) 1.65 (0.81–3.78) 0.08 VI 15 (2%) 1 (1%) 2.57 (0.27–548.)0.49 VII 4 (1%) 2 (2%) 0.34 (0.03–8.89) 0.21 VIII 3 (0%) 0 (0%) 1.00 Protein genetic profiles (PGP) < 0.01 ∗ A(bca) 152 (20%) 17 (13%) 1.64 (0.81–3.64) 0.09 alp1 195 (25%) 24 (18%) 1.50 (0.81–2.96) 0.10 alp229(4%) 3 (2%) 1.66 (0.38–15.9) 0.61 alp3 164 (21%) 11 (8%) 2.93 (1.29–7.90) < 0.01 † R(rib) 234 (30%) 75 (57%) 0.33 (0.19–0.54) < 0.01 ‡ None 6 (1%) 2 (2%) 0.50 (0.06–12.2) 0.33 Mobile genetic elements (MGE) GBSi1 145 (19%) 37 (28%) 0.59 (0.33–1.06) 0.02 IS1381 661 (85%) 105 (20%) 1.43 (0.73–2.65) 0.16 IS1548 176 (23%) 51 (39%) 0.46 (0.27–0.79) < 0.01 ‡ IS861 381 (49%) 91 (69%) 0.43 (0.25–0.73) < 0.01 ‡ ISSag1 757 (97%) 131 (99%) 0.25 (0.00–2.20) 0.24 ISSag2 763 (98%) 131 (99%) 0.34 (0.00–3.16) 0.50 ISSa4 39 (5%) 2 (2%) 3.42 (0.65–69.3) 0.11 Antibiotic resistance-related genes aad-6 16 (2%) 1 (1%) 2.74 (0.29–583.)0.49 aph-3 12 (2%) 1 (1%) 2.05 (0.20–444.)0.71 ermB 26 (3%) 2 (2%) 2.24 (0.41–46.3) 0.41 ermTR 24 (3%) 8 (6%) 0.49 (0.17–1.78) 0.12 int-Tn 461 (59%) 95 (72%) 0.56 (0.32–0.97) < 0.01 ‡ mef 19 (2%) 1 (1%) 3.27 (0.36–687.)0.34 mre 780 (100%) 131 (99%) 0.14 tetM 651 (84%) 114 (86%) 0.80 (0.36–1.60) 0.44 tetO 23 (3%) 3 (2%) 1.31 (0.29–12.7) 1.00 Note: The groups MS, MSST and PGP were first analysed as 9×2, 4×2,and6×2 contingency tables respectively. The significance of subgroups was only reported if statistical non-independence was found in the overall group. The remaining binary markers were analysed as 2 × 2 tables. Statistical significance was determined by using Chi-square test or Fisher’s exact test (for groups with less than 5 isolates). Groups marked with (∗) were found statistically significant towards invasive (†)or colonising (‡)atα =0.01. Abbreviations: OR: Odds ratio; 99C.I.: 99% confidence interval;

40 Table 3.3: Frequencies of GBS molecular serotypes and Cα-like protein family genes in 4 clinical subgroups

Clinical subgroups Markers a) N. inv. vs. col. b) EOD vs. LOD c) N. inv. vs. A. inv. d) WCBA vs. col. N. inv. Col. EOD LOD N. inv. A. inv. WCBA Col. Sig. Sig. Sig. Sig. (n = 287) (n = 132) (n = 182) (n = 105) (n = 287) (n = 493) (n = 100) (n = 132) Molecular serotypes (MS, 9 types) Ia 71 (25) 22 (17) 46 (25) 25 (24) 71 (25) 108 (22) 31 (31) 22 (17) Ib 33 (12) 13 (10) 19 (10) 14 (13) 33 (12) 65 (13) 15 (15) 13 (10) II 17 (6) 13 (10) 16 (9) 1 (1) † 17 (6) 46 (9) 8 (9) 13 (10) III 133 (46) 65 (49) 82 (45) 51 (49) 133 (46) 119 (24) ∗ 22 (22) 65 (49) ‡ III-1 68 (24) 48 (36) 48 (26) 20 (19) 68 (24) 82 (17) 15 (15) 48 (36) III-2 50 (17) 15 (11) 24 (13) 26 (25) 50 (17) 15 (3) † 4 (4) 15 (11) 41 III-3 1 (0) 0 (0) 1 (1) 0 (0) 1 (0) 6 (1) 1 (1) 0 (0) III-4 14 (5) 2 (2) 9 (5) 5 (5) 14 (5) 16 (3) 2 (2) 2 (2) IV 4 (1) 0 (0) 3 (2) 1 (1) 4 (1) 17 (3) 1 (1) 0 (0) V 26 (9) 16 (12) 13 (7) 13 (12) 26 (9) 119 (24) ‡ 19 (9) 16 (12) VI 2 (1) 1 (1) 2 (1) 0 (0) 2 (1) 13 (3) 3 (3) 1 (1) VII 0 (0) 2 (2) 0 (0) 4 (1) 1 (1) 2 (2) VIII 1 (0) 0 (0) 1 (1) 0 (0) 1 (0) 2 (0) NT 0 (0) 1 (1) 0 (0) 1 (1) Protein genetic profiles (PGP) A(bca) 46 (16) 17 (13) 30 (17) 16 (15) 46 (16) 106 (22) 20 (20) 17 (13) alp1 76 (27) 24 (18) 52 (29) 24 (23) 76 (27) 119 (24) 32 (32) 24 (18) alp2 10 (4) 3 (2) 7 (4) 3 (3) 10 (4) 19 (4) 5 (5) 3 (2) alp3 34 (12) 11 (8) 19 (10) 15 (14) 34 (12) 130 (26) ‡ 22 (22) 11 (8) † R(rib) 121 (42) 75 (57) 74 (41) 47 (45) 121 (42) 113 (23) † 21 (21) 75 (57) ‡ None 0 (0) 2 (2) 0 (0) 6 (1)

Note: all values shown in the table are in number (percent) of isolates. The statistical significance was determined by Chi-square test or Fisher’s exact test (for groups with less than 5 isolates). Groups marked with (∗) were found statistically significant towards the left (†)orright(‡) group at α =0.01. MS, MSST and PGP were first analysed as 9×2, 4×2, and 6×2 contingency tables respectively. The significance of subgroups was only reported if statistical non-independence was found in the overall group. Abbreviations: N. Inv.: neonatal invasive isolates; A. Inv.: adult invasive isolates; Col.: colonising isolates from routine gestational screening; EOD: early-onset diseases; LOD: late-onset diseases; WCBA: women at childbearing age; NT: isolates that were not typeable. Table 3.4: Frequencies of GBS mobile genetic elements and antibiotic-resistance genes in 4 clinical subgroups

Clinical subgroups

Markers a) N. inv. vs. col. b) EOD vs. LOD c) N. inv. vs. A. inv. d) WCBA vs. col. N. inv. Col. EOD LOD N. inv. A. inv. WCBA Col. Sig. Sig. Sig. Sig. (n = 287) (n = 132) (n=182) (n=105) (n=287) (n=493) (n = 100) (n = 132) Mobile genetic elements GBSi1 76 (27) 37 (28) 43 (24) 33 (31) 76 (27) 69 (14) † 11 (11) 37 (28) ‡ IS1381 221 (77) 105 (80) 147 (81) 74 (71) 221 (77) 440 (89) ‡ 88 (88) 105 (80) IS1548 72 (25) 51 (39) ‡ 52 (29) 20 (19) 72 (25) 104 (21) 20 (20) 51 (39) ‡ IS861 161 (56) 91 (69) ‡ 98 (54) 63 (60) 161 (56) 220 (45) † 42 (42) 91 (69) 42 ISSag1 279 (97) 131 (99) 177 (97) 102 (97) 279 (97) 478 (97) 99 (99) 131 (99) ISSag2 284 (99) 131 (99) 180 (99) 104 (99) 284 (99) 479 (97) 98 (98) 131 (99) ISSa4 13 (5) 2 (2) 9 (5) 4 (4) 13 (5) 26 (5) 4 (4) 2 (2) Antibiotic resistance-related genes aad-6 2 (1) 1 (1) 1 (1) 1 (1) 2 (1) 14 (3) 4 (4) 1 (1) aph-3 2 (1) 1 (1) 1 (1) 1 (1) 2 (1) 10 (2) 2 (2) 1 (1) ermB 8 (3) 2 (2) 5 (3) 3 (3) 8 (3) 18 (4) 3 (3) 2 (2) ermTR 9 (3) 8 (6) 9 (5) 0 (0) 9 (3) 15 (3) 1 (1) 8 (6) int-Tn 175 (61) 95 (72) ‡ 114 (63) 61 (58) 175 (61) 286 (58) 52 (52) 95 (72) ‡ mef 4 (1) 1 (1) 3 (2) 1 (1) 4 (1) 15 (3) 1 (1) 1 (1) mre 287 (100) 131 (99) 182 (100) 105 (100) 287 (100) 493 (100) 100 (100) 131 (99) tetM 248 (86) 114 (86) 160 (88) 88 (84) 248 (86) 403 (82) 85 (86) 114 (86) tetO 5 (2) 3 (2) 5 (3) 0 (0) 5 (2) 18 (4) 3 (3) 3 (2)

Note: all values shown in the table are in number (percent) of isolates. The statistical significance was determined by Chi-square test or Fisher’s exact test (for groups with less than 5 isolates) at the significance level of α =0.01. Groups marked with (∗) were found statistically significant towards the left (†)orright(‡) group at α =0.01. Abbreviations: N. Inv.: neonatal invasive isolates; A. Inv.: adult invasive isolates; Col.: colonising isolates from routine gestational screening; EOD: early-onset diseases; LOD: late-onset diseases; WCBA: women at childbearing age. 3.2.3 Discussion

GBS molecular markers associated with virulence

In the single marker analysis, only alp3 was significantly associated with invasive disease and serotype II was associated with early-onset invasive disease (p<0.01).

MS III and rib (which were previously known to be associated with each other

[133]) were both associated with antenatal vaginal isolates. Serotype III GBS iso- lates have been frequently reported to be associated with neonatal invasive dis- ease [134]. In our comparison, the results suggested an inverse relationship (i.e., associated with colonising rather than invasive isolates) in the aggregate analysis

(Table 3.2). This may be attributable to the inclusion of a relatively large num- ber of adult invasive isolates which include a smaller proportion of serotype III than neonatal disease isolates. Further subgroup comparison, however, revealed no significant differences in comparing vaginal colonising with neonatal invasive iso- lates (Table 3.3). This result suggests that there may only be a limited association between serotype III with neonatal diseases. A similar association in the colonis- ing group was found with the protein rib. Thus, while certain markers could be present more frequently in certain populations (for example, serotype III and pro- tein rib with the neonatal period, and serotype V with the adults [96, 97]), there is a lack of evidence for definitive association between these molecular epidemio- logical markers and overall GBS invasive capability. The differences in serotype and genotype distribution may also illustrate the degree of genetic heterogeneity in infective GBS diseases. This diversity highlights the difficulty in ascertaining the relationships between specific GBS genotypes and their clinical manifestations.

43 Genotyping by mPCR/RLB performance generalisation of Evaluation y1-odcross-validation 10-fold by

Invasive Algorithms: (n = 780) NB LR GBS isolates - - SVM (n = 912) J48 MLP Non-invasive IBk (n = 132) ...

Grouping Genotype data ML Models

Figure 3.4: Predictive analysis of GBS genotyping data by supervised machine learning. Both invasive and non-invasive GBS isolates were typed by using mPCR/RLB. The genotype data were then used to train machine learning mod- els. Predictive accuracies were estimated by 10-fold cross-validation. mPCR/RLB: multiplex-PCR-based reverse line blot; ML: machine learning.

3.3 Predictive analysis by machine learning

In this section, machine learning classifiers are used to construct predictive models

that distinguish isolates according to clinical outcomes, using experimental geno-

type data.

3.3.1 Material and methods

Machine learning algorithms

Six machine learning algorithms were selected from Waikato Environment for

Knowledge Analysis (WEKA, version 3.5.2) [135]. Logistic regression (LR)was

applied with ridge value of log-likelihood of 10−8. k-nearest neighbour classi-

fier (IBk) was applied with inverse-distance weighing function and the number of

neighbours was determined by leave-one-out cross-validation. A network of feed-

44 forward multilayer perceptrons (MLP) with one hidden layer consisted of 16 nodes was trained by the backpropagation algorithm. The information-theoretic decision tree J48 was studied with pruning confidence level set at 0.70. Support vector ma- chine (SVM) with second degree polynomial kernel was trained by the sequential minimal optimisation (SMO) algorithm. Logistic models were built to allow proper estimation of posterior probabilities in trained SVM models [136]. The na¨ıve Bayes classifier (NB) was also used. A majority class predictor (ZeroR), which always predicts the isolates as invasive, was used as control to compare the improvement in performance of the above classifiers.

All markers described in Section 3.2.1 were included in classifier training and virulence prediction.

Evaluation

Performance of classifiers was assessed by both classification accuracy and area under ROC curve (AUC). AUC measures the discriminative power of a classifier.

An AUC of 0.5 indicates that the classifier is no better than chance in discriminat- ing clinical groups of isolates, while an AUC of 1 denotes perfect discrimination; an AUC of greater than 0.8 is usually expected for clinical applications. Classifica- tion accuracy is defined as the proportion of cases correctly classified into invasive or colonising at the default threshold and the standard error was estimated by bi- nomial distribution, where sˆ = p(1 − p)/n. AUC measures the probability of differentiation between groups from a randomly collected sample independent of prior probabilities or test threshold [137], [138]. In this analysis, AUC was esti- mated non-parametrically by using the trapezoidal rule and the standard error of

AUC was estimated by Hanley-McNeil method [139]. The generalisation perfor- mance of all classifiers was evaluated by using stratified 10-fold cross-validation

45 in this testing. GBS subgroups were compared in the same pairs as described in

Section 3.2.1 and shown in Figure 3.2.

3.3.2 Results

In both aggregated and subgroup predictive analyses by machine learning, most of the classifiers produced no significantly better performances in accuracy compared with the majority class predictor ZeroR with exceptions of LR in groups compar- isons (a) and (c), SVM in (c) and NB in (c) and (d) in Table 3.5. Overall, machine learning algorithms separated invasive and colonising groups of GBS isolates with suboptimal accuracy but achieved statistically significant AUC compared to chance

(0.5). The AUCs under 10-fold cross-validation were between 0.64–0.67. In sub- group analyses, classifiers trained to distinguish neonatal invasive from colonising isolates produced an AUC between 0.57–0.63. AUCs between 0.57–0.60 were found when comparing early-onset versus late-onset isolates, 0.67–0.66 in adult versus neonatal invasive disease isolates, and 0.63–0.68 when comparing invasive strains among women at childbearing age to colonising isolates around the perina- tal period.

3.3.3 Discussion

Predictive versus descriptive analysis of virulence by GBS molecular markers

This analysis applied machine learning algorithms to predict clinical outcomes by

GBS genotypes. GBS genotypes have been traditionally assigned based on combi- nations of genetic markers (for example, the genotyping system developed by Kong et al. studied in this chapter [112]) or by studying sequence polymorphisms (e.g.

MLST developed by Jones et al. [107]). Based on genotypes, virulence clusters are then assigned by a phylogenetic dendrogram generated by clustering algorithms such as the Unweighed Pair Group Method with Arithmetic Mean algorithm.

46 Table 3.5: Predictive analysis with machine learning classifiers: performance measured in classification accuracy and AUC with stratified 10-fold cross- validation Accuracy AUC Classifier Accuracy 95 C.I. p-value∗ AUC 95 C.I. p-value† Overall comparison: Invasive versus colonising isolates ZeroR 85.5% (84.7%–) control 0.49 (0.48–0.51) control LR 84.9% (84.0%–)0.88 0.67 (0.66–0.69) < 0.05 IBk 84.2% (83.4%–)0.98 0.66 (0.64–0.68) < 0.05 MLP 83.9% (83.0%–)0.99 0.65 (0.63–0.67) < 0.05 J48 84.0% (83.1%–)0.99 0.65 (0.63–0.66) < 0.05 SVM 85.6% (84.8%–)0.42 0.64 (0.62–0.67) < 0.05 NB 75.3% (74.3%–)10.64 (0.63–0.66) < 0.05 (a) Neonatal invasive versus colonising isolates ZeroR 68.5% (66.9%–) control 0.49 (0.47–0.51) control LR 71.1% (69.5%–) < 0.05 0.60 (0.58–0.62) < 0.05 IBk 69.5% (67.8%–)0.19 0.60 (0.58–0.62) < 0.05 MLP 67.1% (65.4%–)0.90 0.59 (0.55–0.61) < 0.05 J48 69.0% (67.4%–)0.33 0.58 (0.55–0.60) < 0.05 SVM 69.9% (68.3%–)0.10 0.57 (0.55–0.59) < 0.05 NB 63.3% (61.6%–)10.57 (0.55–0.59) < 0.05 (b) Early-onset versus late-onset diseases ZeroR 63.4% (61.4%–) control 0.48 (0.46–0.51) control LR 59.9% (57.9%–)0.99 0.55 (0.52–0.57) < 0.05 IBk 63.8% (61.7%–)0.40 0.57 (0.55–0.60) < 0.05 MLP 58.5% (56.5%–)10.58 (0.56–0.61) < 0.05 J48 62.7% (60.7%–)0.70 0.59 (0.56–0.61) < 0.05 SVM 65.2% (63.2%–)0.10 0.57 (0.54–0.59) < 0.05 NB 60.3% (58.2%–)0.98 0.58 (0.56–0.61) < 0.05 (c) Adult versus neonatal invasive isolates ZeroR 63.2% (62.0%–) control 0.49 (0.47–0.52) control LR 66.5% (65.3%–) < 0.05 0.66 (0.64–0.69) < 0.05 IBk 64.4% (63.2%–)0.08 0.64 (0.62–0.67) < 0.05 MLP 64.0% (62.7%–)0.17 0.63 (0.61–0.66) < 0.05 J48 63.1% (61.8%–)0.57 0.63 (0.61–0.66) < 0.05 SVM 67.7% (66.5%–) < 0.05 0.58 (0.55–0.60) < 0.05 NB 65.6% (64.4%–) < 0.05 0.65 (0.62–0.67) < 0.05 (d) WCBA versus colonising isolates ZeroR 56.9% (51.5%–) control 0.49 (0.42–0.57) control LR 63.4% (58.1%–)0.75 0.63 (0.55–0.70) < 0.05 IBk 63.8% (58.6%–)0.34 0.64 (0.56–0.71) < 0.05 MLP 62.9% (57.7%–)0.25 0.68 (0.61–0.75) < 0.05 J48 65.5% (60.4%–)0.29 0.65 (0.58–0.72) < 0.05 SVM 60.8% (55.5%–)0.79 0.64 (0.57–0.71) < 0.05 NB 62.5% (57.3%–) < 0.05 0.66 (0.59–0.73) < 0.05 ∗ one-tailed t-test compared with the majority class classifier (ZeroR), df=9 † two-tailed t-test compared with chance (0.5), df=9

47 Study isolates

Isolate Clustering Machine learning characterisation

Clustering Group isolates by site of isolation

Assign genotypes Select predictive models

Match site of isolation to Train models using genotypes the isolates

Label genotype Isolates for Trained models clusters prediction

Match genotypes to Isolate Predict outcome acluster characterisation from models

Predicted outcome

Figure 3.5: The traditional clustering technique versus predictive classifications using supervised machine learning

48 There are several differences in using machine learning to classify isolates compared to genotype assignments using clustering methods (Figure 3.5):

1. With supervised machine learning, GBS isolates were firstly grouped and

assigned a class label which allow the training of models that best separate

these classes. This is different from the unsupervised clustering methods

where the genotypes were first described before the corresponding pheno-

types were matched onto the genotype groups.

2. Clustering is prone to model overfitting and reduces the generalisability of

genotype assignment in virulence prediction. In theory, each clinical group

can be perfectly matched to at least one genotype with infinite combinations

of genetic markers in characterising a bacterial strain. However, the many-to-

one (genotype-clinical group) relationships can be overly optimistic, as the

newly discovered strains may not have the exact genotype described in the

study samples. Most modern machine learning algorithms circumvent the

problem of indiscriminately adding genotype markers by methods of early-

stopping (e.g. in artificial neural networks) or pruning of trained machine

models (e.g. J48 decision tree).

3. Clustering methods are focused on maximising the descriptive power of bac-

terial epidemiology, whereas the predictive methods are focused in translat-

ing the genotypes into forecastable results, such as clinical outcomes.

GBS molecular markers are poor predictors of clinical outcome

The molecular markers chosen in this study have been previously found to dis- criminate different GBS clonal linages [122]. It was anticipated that these markers could be “linked” with the virulence phenotypes which would enable us to fore- cast the risks of developing invasive diseases. The predictive powers of the chosen

49 classification algorithms, however, were poor in both aggregate (Table 3.5) and subgroup data (Table 3.5(a)–(d)). Given that similar performance was obtained across all classifiers, it seems unlikely that all classification methods would fail, by chance, to discriminate between the different groups. Our observations supported that the overall predictive power of the individual molecular markers and combi- nations studied here may be weak, and may contradict with positive relations as suggested by previous studies (for example, GBSi1 and serotype III). A similar study of group A streptococcus by machine learning also failed to identify signifi- cant links between genotypes and phenotypes, despite occasional previous studies reporting associations of individual genes with virulence [140]. These findings highlight the need for discovery of new genes and for more comprehensive and po- tentially discriminatory collection of markers of virulence, together with host and environmental factors to achieve improved disease risk stratification.

3.4 Potential limitations

The poor classification performance could be attributed to several factors apart from the poor discriminability of the molecular markers investigated:

Lack of suitable comparators for adult invasive diseases

To delineate the genetic characteristic of invasive GBS isolates, isolates from rou- tine antenatal screening were collected for comparison. Differences in bacterial populations, however, could have been evident in non-pregnant adults and in peri- natal isolates. The colonising isolates from non-pregnant adults were unavailable due to practical constraints. Future studies for studying infections in non-pregnant adults should consider a systematic collection of colonising isolates (for example, rectal swabs) for comparison. Despite the lack of a suitable comparative group,

50 the analytic approach presented in this chapter should reveal markers associated with the “general invasiveness” of GBS. The selection of vaginal isolates was also justified as vaginal colonisation isolates are major sources of neonatal invasion.

Overlapping of clinical groups

It is a common practice to analyse microbial virulence by dichotomise isolates by case-control assignment (“colonising” versus “invasive”), but separation between the two groups cannot be perfect using this method. It seems unlikely that samples in sterile sites such as CSF, or obtained in neonatal infections, would have been in- correctly labelled as invasive. On the other hand, a small proportion of colonising isolates from mucosal sites are clearly capable of displaying virulent behaviour, ei- ther in different circumstances or in different hosts, since these sites are the sources of invasive isolates. Consequently there is overlap in the two classes using site of isolation as the class assignment rule, and it will not be possible to build completely discriminating classifiers in such a circumstance. The best that could be achieved would be to rank colonising strains as more or less likely to cause invasive disease than would be explained by their prevalence at mucosal sites. This could par- tially explain why serotype III and rib isolates were overrepresented among ante- natal colonising isolates – serotype III and rib isolates are common among vaginal colonisers and therefore the most common isolates to which susceptible neonates are exposed.

Other practical factors

Several practical factors could have also contributed to the poor classification per- formance. For example, the choice of GBS isolates were limited such that most colonising isolates were collected from Australia and New Zealand. Clinical vari- ables, such as obstetric risk factors associated with pregnancy of neonatal isolates

51 or mobilities associated with adult invasive isolates (for example, diabetes melli- tus), were also unavailable for analysis. Integration of additional clinical data may improve the predictive power of the models and should be considered in future investigations.

3.5 Selection of discriminatory features

Reducing the number of non-discriminative features, especially in data with high dimensionality, may improve the performance of machine learning algorithms [141].

In Section 3.3, the poor classification results could in part be attributed to poorly discriminating genetic markers. Table 3.2 shows that only a few binary markers were significantly associated with GBS invasion at α=0.01. Therefore, it can be hypothesised that the use of discriminative subset of molecular markers (from the panel of 18 markers) may improve classification performance. Several research questions can be posed:

1. Can a reduction in the number of markers improve classification perfor-

mance?

2. What are the minimum set of markers needed to achieve the best classifier

performance?

3. Which markers are associated with the best classification performance?

4. What feature selection algorithm is best suited to perform this task?

3.5.1 Methods

Four feature ranking algorithms (ReliefF [142], symmetrical uncertainty [143], in- formation gain [144], Chi-square feature selection [145]) were studied in combi- nation with the six classifiers listed in Section 3.3 to ensure reproducibility. The

52 ranks produced by FR algorithms were determined by a modified 10-fold cross- validation, as the the standard cross-validation technique could overestimate clas- sifier performance when the subset of features were also selected from the same data set [146].

Figure 3.6 illustrates the pseudocode describing the procedure of how to obtain the optimal number of features combined with each machine learning algorithm A

(Nopt,A).

X = GBS genotype data (18-markers, 912 isolates) y = the corresponding clinical outcome of X (invasive or colonising)

for each classifier yˆ = C(y, X) do

for each feature ranking algorithm A do divide X into 10 folds

for fold i =1to10 do Xi,ts = test set of fold i (10% of data) Xi,tr = training set of fold i (the remaining 90% of data)

apply A on (y, Xi,tr) to feature rank Ri

for N =1to18 do Xi,N,tr = Xi,tr, with only top N features ranked in Ri Xi,N,ts = Xi,ts, with only top N features ranked in Ri

train C using (y, Xi,N,tr) ` ´ test C on the (y, Xi,N,ts) to obtain AUCi C(y, Xi,N,ts) . end for

end for

calculate the mean AUC for each N,where

10 ` ´ 1 X ` ´ AUC C(XN ),A = AUCi C(y, Xi,N,ts) 10 i=1 . ` ´ obtain Nopt,A ← argmax AUC C(XN ),A N end for

end for

Figure 3.6: Pseudocode: the modified cross-validation procedure for evaluating classifier performance after feature selection

53 3.5.2 Results

Classification performance

The classifier performance of the feature-selected rank is shown in Figure 3.7.

Compared with AUCs without feature selection, no statistically significant results were achieved at the significance level at p<0.05 (Table 3.6). The feature ranking algorithms achieved the best AUC at between 2 and 11 markers (overall median =

2.5, Table 3.6). The best AUCs achieved were between 0.62—0.70 across 6 clas- sifiers in data sets after feature selection. The na¨ıve Bayes classifier achieved the best overall AUC (0.70) when combined with the ReliefF algorithm.

Selected markers

Most feature ranking algorithms identified MS and PGP as the top molecular mark- ers for classification. Within the top-third of the ranking consisting of the 18 markers, markers int-Tn (ranked by 4 algorithms), IS861 (ranked by 3 algorithms),

IS1548 (ranked by 3 algorithms), and GBSi1 (ranked by 3 algorithms) were con- sistently placed at the top of the list (Table 3.7).

54 Table 3.6: Number of features (Nopt,A) required to achieve the best AUC for each classifier.

Feature ranking algorithm Classifier∗ Median Best AUC p-value† ReliefF Sym.Unc. Inf.Gain ChiSq. NB 7 4 3 3 3.5 0.70 0.09 IBk 2 1 11 5 3.5 0.64 0.64 J48 3 3 3 3 3 0.62 0.25 MLP 3 1 2 2 2 0.64 0.69 55 LR 5 1 2 2 2 0.66 0.63 SVM 2 1 1 1 1 0.65 0.79

∗ Abbreviations: machine learning algorithms: NB:na¨ıve Bayes classifier; IBk: k-nearest neighbour classifier; J48: J48 decision tree; MLP: multilayer perceptron; LR: logistic regression; SVM: support vector machine with linear kernel; The parameters of the classifiers are identical to those described in Section 3.3.1. Feature selection algorithms: ReliefF: ReliefF algorithm; Sym. Unc.: symmetric uncertainty algorithm; Inf. Gain: information gain feature selection algorithm; ChiSq.: chi-square feature selection algorithm; † The null-hypothesis was defined as no difference with AUC produced by the same algorithm without feature selection (Ta- ble 3.5). Paired t-test with df =18was used to compare the differences in AUCs. Standard error was estimated by Hanley- McNeil method [139]. Table 3.7: Top one-third of the GBS markers ranked by feature ranking algorithms

Feature ranking algorithm Rank∗ ReliefF Sym.Unc. Inf.Gain. ChiSq. 1 MS PGP PGP PGP 2PGPIS861 MS MS 3 int-Tn MS IS861 IS861 4 tetM IS1548 IS1548 IS1548 5IS1381 mre int-Tn int-Tn 6 GBSi1 int-Tn GBSi1 GBSi1 ∗ The ranks were determined by 10-fold cross-validation Feature selection algorithms: ReliefF: ReliefF algorithm; Sym. Unc.: symmetric uncertainty algorithm; Inf. Gain: information gain feature selection algorithm; ChiSq.: chi-square fea- ture selection algorithm;

3.5.3 Discussion

The results after feature selection suggest that the reduction of marker numbers did not improve classification performance. In addition, the median number of optimal features (Nopt,A) was between 1–3.5 markers, and including MS and PGP, indicating that only these markers are important predictors of GBS virulence. This

finding also suggests that most of the remaining markers have weak predictive power in virulence classification.

3.6 Evolutionary considerations

So far we have investigated the prediction of GBS clinical outcomes by bacterial genotypes using both statistical and machine learning methods. It was found that although different GBS strains could be distinguished with genotyping systems, no markers or their combinations yielded sufficient power to predict clinical outcomes robustly. Apart from the limitations discussed in Section 3.4, several evolutionary aspects of bacterial genetics may have roles in affecting the predictability of the molecular markers.

56 NB IBk 0.4 0.5 0.6 0.7 0.4 0.5 0.6 0.7

0 5 10 15 20 0 5 10 15 20 Number of features (N) Number of features (N)

J48 MLP 0.4 0.5 0.6 0.7 0.4 0.5 0.6 0.7

0 5 10 15 20 0 5 10 15 20 Number of features (N) Number of features (N)

LR SMO 0.4 0.5 0.6 0.7 0.4 0.5 0.6 0.7

0 5 10 15 20 0 5 10 15 20 Number of features (N) Number of features (N)

Feature ranking algorithm ReliefF SymmetricalUncert InfoGain ChiSquared

Figure 3.7: Performance of machine learning classifiers (AUC) versus number of features as selected by feature ranking algorithms. Each point indicates an AUC obtained from one run of 10-fold cross validation by a given classifier trained with top-N attributes. Abbreviations: machine learning algorithms: NB:na¨ıve Bayes; J48: J48 decision tree; LR: logistic regression; IBk: k-nearest neighbour algorithm; SMO: support vector machine (linear kernel); MLP: multilayer perceptron. Feature ranking algorithms: ReliefF: ReliefF algorithm; SymmetricalUncert: symmetrical uncertainty algorithm; InfoGain: information gain algorithm; ChiSquared: chi- squared algorithm.

57 

  58 =⇒

 

A B C A B C ⇓ ⇓ Invasive Invasive

(a) Horizontal gene transfers (HGT). Assume the virulence trait is contributed by the (b) De novo mutations under positive selection pressure. In this example, a new viru- true virulence genes (marked ), HGT of the virulence gene from B to C invali- lence gene arising from type C degrades the predictive power of the typing system dates the ability to track invasive behaviour by the markers originally capable of over the bacterial generations. distinguishing subpopulations of type A, B, and C.

Figure 3.8: Evolutionary mechanisms for non-cosegregation of markers with the virulence traits. As illustrated by the diagrams, there are two potential scenarios that may cause disruption of linkage between pathogen subtypes and virulence. 3.6.1 The clonal assumption of genotyping is defied by mechanisms of horizontal gene transfer

A fundamental assumption of bacterial typing necessitates a perfect clonal rela- tionship between a bacterial type and its precursors. Thus, an ideal scenario would be that virulence properties are genetically linked with molecular markers (Fig- ure 3.8(a)), and the cosegregation of virulence mechanisms with the markers would enable tracking of invasive phenotype. Realistically, however, bacteria readily share their genes by mechanisms of horizontal gene transfer (HGT), which would allow the spread of true virulence factors without the co-segregation of molecu- lar markers. The weak linkage of marker and true- may cause a predictive marker to gradually lose its predictive power over generations.

3.6.2 Positive selection of virulence genes

Positive selection pressure applied to true virulence genes could also be a con- tributing mechanism to the poor predictive power. Bacterial pathogens constantly undergo selection pressures to survive the hostile environment created by the host immune system and antimicrobials. Fitter virulence genes that confer survival ad- vantage are rapidly positively selected (Figure 3.8(b)). The evolutionary rates for virulence genes are faster than the the rest of the genome [49]. Thus, typing by la- belling a “hypervirulent clone” with more stable markers than the virulence genes may result in the disruption of linkage between the virulence trait and the markers selected for typing, resulting in a gradual decay of predictive power. In supporting this hypothesis, there was evidence that GBS virulence trait could independently segregate with commonly used markers including serotypes and MLST, a phe- nomenon that may be explained by the process of positive selection [126].

59 3.6.3 Virulence gene typing

To aid in effective selection of molecular markers to predict virulence, two empir- ical criteria are suggested:

1. An ideal molecular marker should be closely-linked with true virulence genes

2. An ideal molecular marker should have rates of evolution faster or equal

than virulence genes, such that

rateclone  ratevirulence gene

≤ ratemarker

 ratestrain (3.1)

Intuitively, the perfect typing method would be whole-genome sequencing (per- fect linkage rate and ratemarker = ratestrain). Nevertheless, genome sequencing for individual pathogens is impractical, as the associated cost and time do not match the immediacy required in clinical decision making. A practical alternative is to perform genetic characterisation on true virulence genes, or virulence gene typing, which would yield a perfect linkage with ratevirulence gene = ratemarker. One challenge in identifying true virulence genes, as stated in Chapter 1, is the time and resource constraints associated with experimental failures from incorrect selection of candidate genes. In GBS, several virulence genes have been experi- mentally verified (Chapter 7). However, it is likely that our knowledge of virulence genes is non-exhaustive and more virulence genes are yet to be discovered. In as- sisting with functional discovery of genes that may help to explain GBS virulence, the later part of this thesis will develop an in silico candidate gene prioritisation

(CGP) method to achieve this goal by adapting comparative genomes analysis and

60 current CGP methods (which have been studied in the discovery of human genetic diseases) to bacteriology.

3.7 Chapter summary

This chapter examined the predictive relationship between GBS virulence and the eighteen molecular markers. The commonly attributed markers of virulence, such as molecular serotypes and protein genetic profiles, were analysed by both machine learning and statistical methods. Some markers were associated with invasive dis- eases (alp3) and antenatal colonising isolates (rib, GBSi1, IS1548,IS861,andint-

Tn) respectively. Other markers were also associated with important subgroups.

The molecular epidemiology markers studied in this chapter were previously found to be discriminative in differentiating GBS strains and defining important

GBS clusters (for example the association of III-2 and MLST ST-17 associated with neonatal invasive diseases). Based on machine learning analysis, however, these markers alone were inadequate in predicting clinical outcomes. Feature se- lection algorithms did not further improve classification performance. Although several limitations such as overlapping groups and other practical constraints could have affected the results, better classification results were expected if the molecular makers are more discriminative. Horizontal gene transfers and positive selective pressure on virulence genes are possible mechanisms that could have also con- tributed to the disruption of linkage association between the markers and the viru- lence factors. Thus, characterisation of virulence genes, or virulence gene typing, is postulated to be the key in virulence prediction to achieve better discriminative power. Developing effective methods of virulence gene discovery is therefore an important step in achieving this goal. 

61 Part II

Co-occurrence-based candidate gene prioritisation

62 Chapter 4

Candidate gene prioritisation

4.1 Introduction

Challenges in identifying bacterial virulence genes

The importance of virulence gene identification for predicting infectious disease outcomes was demonstrated in Chapter 3. In the discovery of virulence genes in pathogenic bacteria, Raskin et al. observed a trend toward increasing use of high- throughput genomic technologies with bioinformatic analysis compared with the traditional gene screening methods [44]. However, biological experiments to dis- cover virulence genes are resource intensive and time consuming. Thus, improving pre-experimental selection of candidate genes with computational methods could reduce the number of cycles of negative trials and thus increase the likelihood of gene discovery.

Genomic screening for positively-selected genes as virulence candidates

A computational comparative genomic approach has been described by Chen et al., where an uropathogenic Escherichia coli genome (E. coli UTI89) was com- pared with non-uropathogenic E. coli genomes to identify the positively selected

63 genes using phylogenetic analysis by maximum likelihood [49]. These genes were subsequently validated in 50 clinical isolates where significant variations in sev- eral genes (fepE, ompC, and amiA) were found. However, the approach proposed by Chen et al. requires multiple genomes of the same species with phenotypic variations. Therefore, the method is not directly applicable to our task of GBS virulence genes search, as there is only a limited subset of phenotypic variations in the current database of published genomes (2603 V/R, A909 and NEM316 were all invasive isolates). Alternative methods of in silico comparative genomic analysis are thus needed for this task of virulence gene discovery.

Methods of in silico candidate gene prioritisation (CGP) can assist with dis- covering genes responsible for inherited diseases in human

Several bioinformatic methods for prioritising candidate gene have been described in the search for human disease genes. Selecting appropriate genes for biological validation remains a key constraint in gene function discovery, given that there are more than 20,000 genes in the human genome [147]. CGP covers a wide variety of methods that automate the initial gene selection task for researchers. In this chapter, methods for computational prioritisation of candidate genes are reviewed and practical constraints in their application will be discussed. The purpose of this review is to explore the possibility of adopting the concept of CGP, and then to con- struct such a system for discovering bacterial gene functions and genes associated with virulence. Conceptual analogies between CGP and information retrieval (IR) methods will be drawn. Currently, there are no similar gene-ranking tools available in the field of microbiology.

64 DS 1 f1

f2 DS2

f3 Σ . . . .

fn DSn

Candidate Database Feature Feature Prioritised Data sources Features genes query processing Integration ranks

Figure 4.1: The workflow of in silico candidate gene prioritisation. For each can- didate gene to be ranked, databases are queried to retrieve information associated with the genes. Such information is then transformed into various gene features, which are quantities suggestive of the degree of relevance with the function of in- terest. Subsequently, individual gene features are integrated (by feature integration algorithms) to produce an overall rank. DS: Data source; f1,f2,...fn: gene fea- tures;

4.1.1 Overview of the in silico CGP process

Experimental biologists frequently encounter the problem of having a large set of

candidate genes needing functional validation. Computational CGP methods aim

to automate this gene selection process by ranking the list of candidate genes by a

relevance score. The process for gene prioritisation begins with the user providing

a list of candidate genes, together with certain search criteria (for example, dis-

ease names, keywords, numerical criteria, or sequence features) or a list of genes

with known relationship to the disease of interest (the training genes). The priori-

tiser then retrieves information related to each candidate gene from different data

sources and derives gene features associated with each candidate gene (see Sec-

tion 4.4). A feature integration algorithm then aggregates scores from individual

65 A recent case of using an in silico inductive gene prioritiser ENDEAVOUR to rank disease genes was described by Aert et al.. The authors performed a CGP to identify the potential role of YPEL1 gene in DiGeorge syndrome, a rare congenital condition caused by a deletion in human chromosome 22. Features derived from 11 different data sources and four dif- ferent ranks related to each of the disease characteristics were used. In their experiments, the YPEL gene was ranked within top 3 of the 58 candidate genes in the 2-million deletion region on the chromosome. The role of YPEL1 was subsequently confirmed with a gene-knockdown zebrafish model [148].

Box 4.1: A case study of of candidate gene prioritisation features of each gene to derive a final relevance score for the gene, to assign each candidate gene a rank. An ideal prioritiser would place candidate genes with rele- vant genes (genes that are relevant to the disease of interest) higher on the candidate list. The work flow of CGP is illustrated in Figure 4.1.

4.2 Types of CGP

A computational gene prioritiser can be viewed as a specialised IR system (Ta- ble 4.1). A fundamental assumption in IR posits that documents relevant to a given query share similar attributes such as occurrence of keywords (cluster hypothe- sis) [149]. By analogy, the process of gene prioritisation attempts to retrieving the most relevant genes from a gene collection (identical to finding the most relevant documents from a document corpus). A corresponding “gene cluster” hypothe- sis holds for identifying genes related to inherited diseases, such that the disease- related genes tend to be well-conserved and have important biological roles. [150].

Exploiting these conservation properties, candidate genes with similar character- istics could be identified through two well-understood inference mechanisms – characteristics-based or inductive prioritisation.

66 Table 4.1: Comparison of CGP and IR systems

System feature Candidate gene prioritiser Information retrieval system Search space Candidate genes Document corpus Primary task Prioritisation Best-match retrieval Target object Disease-relevant genes Document of interest Input (search criteria) Characteristics-based Gene features of interest ad hoc queries Name of disease-related genes Search terms Inductive Training genes Documents (training sets in document classification and clustering) Output Prioritised gene list Search results sorted by relevance Relevance model Conservation hypothesis Clustering hypothesis Data sources Primary Crude DNA or protein sequences Document text and structure data Secondary Biomedical literature Metadata (for example, tags of image terms or audio multimedia documents) Gene annotations Example of features Co-occurrence Term co-existence in abstracts Query term frequencies in a document Proximity of genes in genome frequency of term co-occurrence Gene co-expression in tissues Semantic relations Protein-protein interactions Similarity Sequence similarity Similarity between documents

4.2.1 Characteristics-based prioritisation – ranking candidate genes by ad hoc criteria

A characteristics-based prioritiser ranks candidate genes based on a set of user- defined gene features. Such features are usually positively correlated with both the disease of interest and the candidate genes. For example, prioritisers may use vo- cabulary or literature data, as the co-existence of disease and gene vocabulary in the same biomedical document (e.g. abstract of a paper) may signify the gene is asso- ciated with the disease of interest [151]. The traditional reciprocal BLAST search for orthology can also be viewed as characteristics-based prioritisation, as genes sharing similar sequences (i.e. lower E-values with higher identity) are generally assumed to have a higher likelihood of sharing similar biological functions [152].

67 A- B+ C 0.8 D+ ... #1 0.95 A- B- C 0.9 D- ... #2 0.89 A+ B- C 1.0 D+ ... #3 0.83 A+ B+ C 0.2 D+ ... #4 0.74 A- B- C 0.5 D- ... ϕ(A, B, C, D,...) #5 0.56 A+ B- C 0.1 D- ... #6 0.47 A- B+ C 0.4 D+ ... #7 0.42 A+ B- C 0.2 D- ... #8 0.36 A+ B- C 0.3 D- ... #9 0.03

Candidate Prioritised Feature association Scoring genes list

Figure 4.2: In characteristics-based prioritisation, scores from different gene fea- tures are integrated by using an ab initio scoring function.

Such sequence-function relationship forms the basis by which BLAST extracts rel- evant genes from a list of candidates.

4.2.2 Inductive prioritisation – finding genes with similar features

An inductive prioritiser builds inference models using genes known to be associ- ated with the function or disease of interest (training genes). Inductive prioritisers assumes that functionally similar genes should share similar biological, ontologi- cal, or literature features. Inductive CGP is useful in recovering important genes that would otherwise be neglected. Several recently described gene prioritisers, including DGP [153], ENDEAVOUR [148], and PROSPECTR [154], all use in- ductive models as the method of inference for in silico disease gene discovery.

68 Known genes Feature association Training of machine learning models

A+ B- C 1.0 D+ ... A+ B+ C 0.9 D+ ... A+ B- C 0.9 D+ ... A+ B- C 0.8 D- ... M(A, B, C, D,...) A- B+ C 0.1 D+ ... A- B+ C 0.2 D- ... A- B- C 0.0 D- ...

A- B+ C 0.8 D+ ... #1 0.95 A- B- C 0.9 D- ... #2 0.89 A+ B- C 1.0 D+ ... #3 0.83 A+ B+ C 0.2 D+ ... #4 0.74 ∗ A- B- C 0.5 D- ... M (A, B, C, D,...) #5 0.56 A+ B- C 0.1 D- ... #6 0.47 A- B+ C 0.4 D+ ... #7 0.42 A+ B- C 0.2 D- ... #8 0.36 A+ B- C 0.3 D- ... #9 0.03

Candidate Machine learning Prioritised Feature association genes prediction list

Figure 4.3: In inductive prioritisation, the scoring function is replaced by a ma- chine learning model, which is trained by using a list of genes known to be associ- ated with the function of interest (training genes). The trained model is then used to predict genes sharing similar characteristics as the training genes.

4.3 Data sources

4.3.1 Primary data source – raw gene or protein sequences

Gene prioritisers use many sources of knowledge to make assessment of how rele- vant genes could be. Primary data sources provide raw nucleotide or polypeptide sequences of a gene. Gene features derived from the primary data sources may include gene length, untranslated terminal regions, intergenic distances, number of exons, or GC content [154]. Gene expression profiles from expressed sequence

69 tags (EST) databases dbEST1, microarray expressions, and transcriptional factors databases (for example, TRANSFAC2 have also been employed [148].

4.3.2 Secondary data sources – meta-knowledge about a gene

The use of external knowledge associated with a gene, or secondary data sources, provides information that is otherwise not available in primary data sources. Sec- ondary data can be viewed as “meta-data” of gene (an analogy to the meta-data of a document or a multimedia). The use of biomedical literature data (for example,

MEDLINE abstracts3, ontological relations (for example, Gene Ontology4, gene- gene interactions (for example, Biomolecular Interaction Network Database5), func- tional databases (for example, Kyoto Encyclopaedia of Genes and Genomes6), or gene homology (for example, BLAST databases7) all belong to this category. In- tuitively, secondary data sources reflect the state of knowledge about a candidate gene and its relevance to the disease of interest.

4.3.3 Differences between primary and secondary data sources

Primary data sources are generally well-determined but the gene-disease relation- ships are less obvious compared with those found in secondary data sources. For instance, it is difficult to predict gene function by examining raw polypeptide se- quences or gene expression levels. In contrast, secondary data sources provide links with other biological entities thus present a stronger gene-disease relationship compared with crude sequence data. However, missing values are frequently found

1dbEST: http://www.ncbi.nlm.nih.gov/dbEST/ 2TRANSFAC: http://www.gene-regulation.com/pub/databases.html 3PubMED: http://www.ncbi.nlm.nih.gov/pubmed/ 4Gene Ontology: http://www.geneontology.org/ 5BIND: http://www.bind.ca/ 6KEGG: http://www.genome.jp/kegg/ 7BLAST: http://www.ncbi.nlm.nih.gov/BLAST/

70 in secondary data sources, representing “gaps” in our knowledge. These missing values may introduce potential biases which are discussed in Section 4.7.1.

4.4 Gene features

Raw data from primary or secondary data sources need to be refined into meaning- ful entities, or gene features, suitable for the prioritisation tasks. A gene feature is a gene-specific characteristic that correlates with a phenotype of interest. To fa- cilitate the ranking process, the strengths of gene features are frequently described in boolean or numerical values (feature scores). A characteristics-based prioritiser combines the feature scores with user-specified criteria similar to “search terms” or “queries” used in database search in IR system. In an inductive prioritiser, the features form different “attributes” around which the inference models are trained.

Gene features can be broadly classified into two categories, either co-occurrence- based or similarity-based.

71 known known Gene 1a Gene 1a Article 1 association association

Shared Disease A Disease A keyword

unknown association Gene 1b Gene 1b Article 2 association hypothesised 72 (a) Co-occurrence: similar to gene 1a, gene 1b may also be a contribute factor to disease A because the gene name “gene 1b” is also co-present with the keyword “disease A” in a biomedical article. known known association Disease B in association Disease B in Gene 2a Gene 2a mouse mouse

Sequence similarity

Disease B in Disease B in Gene 2b Gene 2b unknown human association human association hypothesised

(b) Similarity: genes sharing similar features (for example, nucleotide sequences) are likely to have contributing roles to similar traits.

Figure 4.4: Co-occurrence and similarity are common relevance measures of a gene to a function of interest. 4.4.1 Co-occurrence

Co-occurrence of vocabularies in a biomedical text may suggest functional association between gene and disease

Co-occurrence is a frequently used concept in information retrieval tasks. In CGP, the co-occurrence of gene and disease names in a biomedical text may suggest a possible association between the two, as both vocabularies need to coexist to de- scribe a causal relationship within a single document. The degree of co-occurrence of vocabularies has been quantified by measuring frequencies or fuzzy membership in previous CGP studies [151, 155, 156].

Co-occurrence of biological entities within a higher structure may suggest a functional association

The concept of co-occurrence can be extended to describe entities other than biomed- ical texts, such as gene ontologic or metabolic pathways. In these cases, genes sharing the same ontology terms or metabolic pathway could be postulated to have similar functions or similar pathogenic potential. In addition, interactions between genes or gene products can also be viewed as co-occurrences. For example, the use of the protein-protein interaction database BIND may assist the discovery of genes participating in the same functional unit [157]. Expression databases such as EST have also been used to identify genes sharing similar co-expression patterns in the same tissue, as these are more likely to have similar functional roles. In addition, genes closer to a gene with known disease gene or genomic region may suggest likely candidates, as the genes may be located on a potential gene cluster and are likely to be expressed together in vivo [148, 155].

73 4.4.2 Similarity

Similarity between sequences may suggest similar function

The degree of similarity between the features of a candidate gene and a known gene may suggest functional similarity. For example, sequence comparison algorithms may determine gene similarity. Figure 4.4(b) illustrated how similarity measures

(e.g. identity, E-value of BLAST) may suggest a candidate gene contributes to the disease of interest through homology. In this example, gene 2a is known to be associated with disease B in a mouse model. Another gene (gene 2b) with unknown correlation with the disease shares a similar sequence with gene 2a. Thus, it is reasonable to postulate that a defect in gene 2a may also contribute to disease B in human.

Similar biomedical texts referring to a target disease name may suggest likely gene candidates

Similarity measures can also be derived from the content of biomedical texts or ontology terms. The concept of similarity may be represented by distance func- tions, such as Euclidean distance or cosine similarity function between two text vectors (more similar texts have higher scores), to indicate the similarity between a document with known disease-gene relationship with another document [148,155].

4.5 Feature integration

Using a single feature to rank a list of candidate genes may be insufficient to dis- criminate relevant genes and multiple features are usually required. The process of feature integration aggregates individual features into a single score that generate an overall gene rank.

74 In general, the relevance score function can be written as:

relevance ≡ φ(g)=ϕ F(g) = ϕ f1(g),f2(g),...,fn(g)

where φ(g) is the scoring function, ϕ F(g) is the feature integration function, and the feature vector F(g)= f1(g),f2(g),...,fn(g) consists of individual features related to the candidate gene g.

4.5.1 Simplest methods of integration — ad hoc sorting and filtering

The simplest method for determining the overall relevance score of a gene is by sorting based on a single feature. The relevance score is simply:

φ(g)=ϕ F(g) = fi(g) where i ∈ 1 ...nare specified by the user.

One implementation of the ad hoc ranking method is the UCSC gene sorter8, where candidate genes may be ranked by combining scores of gene proximity, expression profiles, BLAST E-value, and other features relevant to the candidate genes [158]. Although ad hoc ranking is simple and intuitive, more complex gene- function relationships may be not captured.

Filtering by user-specified criteria is another simple method for processing can- didate gene lists. The user can specify a set of Boolean criteria Q =(q1,q2,...,qn) and apply the feature vector such that ⎧ ⎨⎪ 1 fi(g) ∈ qi ∀i φ(g)=⎪ ⎩ 0 (excluded) otherwise

8UCSC gene sorter: http://genome.brc.mcw.edu/cgi-bin/hgNear

75 As is well-known in information retrieval systems, Boolean retrieval is a method of exact matching. The practical application of filtering may be limited because the concept of relevance cannot be modelled with precision [159]. Nevertheless, fil- tering may be important in characteristics-based prioritisation, which can be used in conjunction with other feature integration methods. For example, it is useful in limiting a lengthy gene list by a certain criteria (for example, restricting the output to a GO group) or in post-processing the results [156].

4.5.2 Vector-space model

Vector-space models (VSM) are used extensively in IR, as various similarity func- tions can be applied to measure the distance between two document vectors. Fuhr

[159] described the retrieval of a document d based on query q, using relevant terms t, under a maximum entropy condition as:

relevance ≡ P (d → q)= P (d → t)P (t → q)=d · q t where d is the “document indexing” vector (sensitivity of each term to each document) and q is the vector describing individual term weights.

Similarly, an analogy can be drawn with characteristics-based prioritisation in constructing a VSM vector F such that:

relevance ≡ gene g → features F

and thus n ϕ F(g) = wifi(g)=w · F i=1 The weight vector is determined by the user in characteristics-based prioritis- ers. Commonly used integrating functions, such as the sum and mean of individual

76 Top-ranked genes Prioritised rank Probability distribution of scores

Score

Figure 4.5: Feature integration with a generalised statistical models feature scores, can be viewed as special cases of VSM where w =(1, 1, ··· , 1) 1 1 ··· 1 and w =(n , n , , n ) respectively [155, 160].

4.5.3 Statistical models

As discussed in Section 4.4, the relevance of a gene is positively correlated with feature scores. (more relevant genes have a higher relevance score). From a prob- abilistic point of view, the top-ranked genes have feature scores above a certain threshold in the distribution of all scores (Figure 4.5). By assuming feature scores are identically and independently distributed for each candidate gene to be ranked, the overall probability distribution of the individual feature ranks can be viewed as the binomial transformation of the respective probability distribution. In particular, the probability of obtaining a score above an arbitrary threshold τ can be described

77 by using first order statistic such that

relevance ≡ P F(g) ≥ τ

= S(1)(τ) =1− P fi(g) <τi, ∀i =1...n n =1− 1 − S(τi) (4.1) i=1

where n is the number of features and S(xi) is the corresponding cumula- tive probability distribution of the feature scores. The POCUS prioritiser uses the probability from the above formula to normalise the probability derived from each feature score (shared terms in functional annotations) [160].

The ENDEAVOUR prioritisation system applies a joint distribution of order statistics to combine multiple p-values, obtained from individual feature score ranks, into a single rank (data fusion), where a combined p-value is approximated by calculating the distribution of a Q-statistic such that:

relevance ≡ 1 − P Q(r1,r2,...,rn) where

r1 r2 rn−1 Q(r1,r2,...,rn)=n! ··· dsn(g)dsn−1(g)dsn−2(g) ···ds1(g) 0 s1(g) sn−1(g)

where s1(g) ···sn(g) are the probability distribution of individual rank scores of f1(g) ···fn(g),andr1 ···rn are the rank ratios obtained from each feature rank [148].

78 4.5.4 Machine learning models

Supervised machine learning models can be trained as integrative models to pre- dict the relevance of candidate genes to the known ones. Different machine learn- ing models have been applied in gene prioritisation. In DGP, Lopez-Bigas¨ and

Ouzounis applied an information-theoretic decision tree using conservation scores derived from taxonomic groups as features [153]. In PROSPECTR, alternating de- cision tree (ADTree) was applied in ranking the candidate genes based on various sequence features [154].

4.6 Methods of evaluation

A critical evaluation of gene prioritisers can provide insights into the relative ben- efits of the different prioritisation models.

4.6.1 Validation of CGP methods by rediscovery experiments

The internal validity of a CGP method can be assessed by performing rediscovery experiments, where a set of “gold-standard” genes involved in disease or phenotype of interest (validation genes or validation sets) are selected as benchmarks to judge prioritiser output. Results from characteristics-based prioritisation methods can be directly compared with the validation genes. In inductive prioritisation, techniques such as cross-validation, jackknife, or a separate validation gene set can be used to estimate the ranking performance [148, 153, 154].

4.6.2 Threshold-dependent performance measures

IR statistics: recall, precision and F-measure

The performance of a CGP method can be quantified by traditional information retrieval statistics such as precision and recall. From the prioritised gene list, an

79 ad hoc cut-off value can be assigned and threshold-dependent measures can be ob- tained. Precision, or positive predictive value, measures the proportion of correct genes identified above a certain threshold in a prioritised rank (posterior probabil- ity). Recall, or sensitivity, describes the fraction of disease-related genes that will be identified through the prioritisation model. The harmonic mean of precision and recall, or F-measure, is another frequently used measure in selecting optimal threshold.

Probability enrichments

In addition to conditional probabilities, several authors describe the use of prob- ability enrichment to evaluate prioritiser performance [154, 155, 160]. Probability enrichment indicates that how many times more likely it is that the prioritiser can discover correct genes compared with an unprioritised candidate gene list, such that:

number of disease genes with score >τ number of genes with score >τ folds enrichment η = (4.2) number of disease genes number of genes

where τ refers to ad hoc threshold score.

Most CGP systems report internal validation on a variety of genetic diseases by testing the model against the Online Online Mendelian Inheritance in Man (OMIM) database. Performance varying widely between 1 and 300-folds enrichment have been reported [148, 155].

80 Limitation of threshold-dependent measures

Although η is a convenient measure in describing and comparing improvements conferred by CGP systems, one should practice caution in interpreting perfor- mances by probability enrichments η alone because:

1. The ad hoc choice of τ means that true ranking performance cannot be eval-

uated objectively with certainty. As pointed out by Gaulton et al., η is not

calculated with a random expectation [155].

2. The selection of gene candidates, particularly the negative controls, may

grossly under- or over-estimate the true value of η.

3. η does not take into account the specificity, or true negative measures, of the

model.

4.6.3 Area under receiver operating characteristic (ROC) curve

A threshold-independent measure that balances sensitivity and specificity, such as

ROC curve analysis, can provide a more objective view of the true performance of prioritisation rank. The area under ROC curve (AUC) provides an aggregate measure based on the average sensitivity and specificity of all possible thresholds

[148, 154]. The meaning of AUC was previously discussed in Chapter 3

4.6.4 External or functional validation

The ultimate method of evaluation is by external or functional validation of the selected candidate rank. To date, only ENDEAVOUR has performed functional validation based on prioritised ranks. ENDEAVOUR used scores derived from position-weighed matrices from the cis-regulatory elements library TRANSFAC as features to identify more genes that were differentially expressed in the differ-

81 entiated versus undifferentiated HL-60 cell line in addition to the identification of

YPEL1 in craniofacial dysmorphism [148].

4.7 Discussion

4.7.1 The conservation hypothesis and potential biases

Computational CGP methods exploit the concept of conservation hypothesis, which states that genes associated with inherited diseases are usually well-conserved and often have important physiological roles. Nevertheless, prioritising candidate genes by this hypothesis has several methodological limitations. For instance, fea- tures in CGP are usually based on frequency of co-occurrence or the degree of sim- ilarity between biological or contextual entities. Based on the concept of “finding alike genes”, disease genes that are dissimilar to the specified features are scored more unfavourably. The specific issue of literature bias describes the problem that better-studied genes receive higher prioritisation scores compared to genes that are unknown. Similarly, less well-annotated genes may be inappropriately penalised when annotation databases are used (annotation bias). Techniques to minimise these biases have been described, including using a correlation matrix between features [155], using rank ratios instead of absolute ranks [148], or normalisation of feature scores prior to feature integration [161]. In inductive prioritisation, sam- pling biases can also occur when the choice of training genes misrepresent or are non-representative of the true disease gene candidates.

4.7.2 Challenges in finding GBS virulence genes by CGP

In adopting CGP for prioritising bacterial candidate genes, several challenges re- main with regards to the availability of secondary data sources. For example, there are no systematically collected gene-phenotype databases (such as OMIM) for all

82 bacterial genomes with exception of a limited number of systematically curated database (for example, EcoCyc for E. coli9). In addition, to achieve de novo dis- covery of unknown virulence genes, using secondary data sources is likely to intro- duce literature or annotation biases. Fortunately, there is an abundance of primary data sources; there are approximately 500 bacterial genome sequences available in public databases for exploitation (as of April 2007). In the next chapters, a pri- oritisation system with multi-genomic analysis will be developed and explored in detail.

9EcoCyc: http://ecocyc.org/

83 Chapter 5

Methods for prokaryotic gene prioritisation using phylogenetic profiles

The challenges in selecting appropriate genes for functional investigation are mul- tifactorial in nature. Firstly, the potential gene space is enormous, leading to the problem of information overload in gene selection. Secondly, there is a persistent challenge of information obscurity at the level of predicting gene function solely by studying raw gene sequences. Thirdly, there is an element of inefficiency asso- ciated with ad hoc selection of genes by performing exhaustive literature survey and database screening.

The same challenges exist in virulence gene discovery of bacterial pathogens.

Firstly, there are thousands of genes in a bacterial genome, rendering the discovery of virulence genes by trial-and-error screening impractical. Secondly, the process of bacterial infection involves complex interactions between both pathogen and host. Bacteria may only demonstrate a virulence trait through interaction with hosts, which means the concept of virulence is meaningless without same atten-

84 tion to host factors; thus the gene-virulence relationships may not be immediately obvious by analysing the bacterial genome alone. Thirdly, as discussed in Chap- ter 1.5, validation of virulence genes requires labour-intensive experiments in animal models of infection. Improving accuracy in identifying potential viru- lence candidates may therefore accelerate the discovery of genes responsible for pathogenesis. Methods of computational candidate gene prioritisation (CGP) aim to address these issues by ranking candidate genes through systematic integration of existing data and knowledge.

The material in this chapter was published in BMC Bioinformatics. 2009; 10:86

[162].

Phylogenetic profiles for prokaryotic functional discovery

Several methods for computational gene function discovery have been studied, in- cluding chromosomal proximity method, domain fusion analysis, analysis of gene expression patterns, and phylogenetic profiles [163]. In particular, the phyloge- netic profile method exploits the knowledge of gene occurrences across a range of sequenced genomes and postulated that genes involved in the same metabolic pathway are frequently co-inherited. Phylogenetic profiles have been applied to un- supervised clustering of proteins to discover their functional linkages [164] and to discover conserved gene clusters in microbes (with probabilistic phylogenetic tree models) [165]. Supervised approaches of phylogenetic profiles have also been ap- plied in inferring protein networks (with canonical correlation analysis [166]) and predicting protein functional class in (with tree-based kernels [167]), in the discovery of protein localisation in eukaryotes [168], in func- tional annotation of genes (by correlation enrichments [169]). These studies sug- gested that the phylogenetic profile method provides a valuable tool for predicting gene-function linkage. It was thus hypothesised that such a concept can also be

85 exploited as gene features for prioritising genes contributing to a particular phe- notypic trait of interest, thus providing a practical and generalisable tool to guide microbiologists in gene selection.

This chapter will explore methods of how the concept of phylogenetic profiles may be applied to develop a CGP tool to accelerate the discovery of gene func- tions in prokaryotes. The specific goal of developing such a tool is to facilitate the discovery of GBS virulence genes which will be studied in later chapters.

5.1 Gene-function co-occurrence

unknown association Gene Function

(a) The link between gene and function is unknown

belongs to Genome Species

displays

is present in

Gene Function association hypothesised

(b) The missing link between gene and function can be provided by using bacterial genome sequences

Figure 5.1: Using bacterial genomes to provide indirect links between gene and function

86 As explained in the previous chapter (Chapter 4), the concept of co-occurrence can be used to assess the degree of relevance of a candidate gene to a function of interest. Figure 5.1 illustrates a method of how to use genome sequences to perform gene prioritisation. To discover the unknown association between gene and function, we may connect these two entities by providing three relationships:

1. Gene-genome relationship: The first relationship is defined by whether a

gene, or its homologous counterpart, is present or absent in a given genome.

This relationship can be readily determined by examining genome sequences

and open reading frames (ORF) in genomes. Sequence comparison algo-

rithms such as Basic Local Alignment and Search Tool (BLAST) can elicit

this relationship.

2. Genome-species relationship: The second relationship, the relationship be-

tween a genome and a given bacterial species, is definite by taxonomical

classification (Chapter 1.3.1). For example, Streptococcus agalactiae 2603

V/R genome belongs to species S. agalactiae.

3. Species-function relationship: The third relationship defines the relation-

ship between a bacterial species and the function of interest. It is well-

acknowledged that bacterial species display certain phenotypes or functions

specific to the species, or taxonomic group. For example, the genus

Staphylococcus possesses a Gram-positive cell wall where as Escherichia

coli always displays the traits of Gram-negative bacteria.

By using a sequenced genome to provide the missing link between gene and function, it can be postulated that genes contributing to a particular function should occur more frequently in the genomes displaying the phenotype (Figure 5.2). In other words, a function and its contributing gene are frequently co-present in genomes that display the phenotypic trait. Similarly, the gene should exist less frequently in

87 Candidate Genome Genes examples

Most relevant is in

Positive displays examples

displays

Phenotype does not display

Negative does not display examples

Least relevant

Figure 5.2: Using gene-function co-occurrence in genomes to prioritise candidate genes. Genes contributing to the phenotype of interest are expected to be found more frequently in genomes displaying the phenotype. the genomes that do not display the phenotype (co-absence of the gene and the function of interest). If a gene is unrelated to the function, it is expected to be ran- domly distributed among genomes in both phenotypic groups. From the perspec- tive of bacterial infective pathogenesis, this concept of gene-function co-presence and co-absence is in concordance with the first molecular Koch postulates proposed by Falkow (see Chapter 1.3.2) [32].

5.2 Formal Definitions

• Gene gi:agene

88 • Strain sj: a bacterial isolate

• Genome Gj: The corresponding genome of sj, which consists of the entire set of genes such that

has sj → Gj = gj1 ,gj2 ,...,gjm

• Gene product yk: the product of a gene (i.e. protein or RNA). For every gene

product yk, there exists at least one encoding gene gi

gi ⇒ yk

• Gene equivalence: Two genes are considered equivalent if genes gi and gj

both encode for the same gene product yk, i.e.

gi ≡ gj if gi ⇒ yk and gj ⇒ yk

• Set of gene products: gene products of strain sj

encodes for Gj → Yj = yj1 ,yj2 ,...,yjn ,

• Phenotype or function p: phenotypic displayed by of a bacterial strain.

• Phenotypic examples Ep: For each phenotype p, a list of phenotype exam-

ples can be gathered. Each ej corresponds to a bacterial strain sj.

Ep = ep1 ,ep2 ,...,epn

where ep1...n ∈{sp1 ,sp2 ,...,spn }, are selected from bacterial strains dis- playing phenotype p.

89 Negative genome examples Genome example 2×2 contingency tables Scoring functions selection

Phenotype of interest . .

Present 30 Sn. Positive genome examples Absent 03 Sp.

Sn. + Sp. 90

F BLASTP 2 Present 11 χ

Absent ppv/npv

22 Score/relevance . . Candidate genes Occurrence matrix . Prioritised rank

Figure 5.3: The workflow of statistical CGP. In statistical CGP, two groups of genome examples (positive and negative) are selected according to the study phenotype. The occurrences of each candidate gene in the genome examples are determined by BLAST and stored in the occurrence (homolog) matrix. The frequencies of occurrence for each candidate gene in both groups are then counted and statistical scoring functions are applied to assign a final score which forms the basis of the gene rank. It is expected that genes ranked at the top of the list are more likely to be associated with the phenotype of interest. 5.3 Statistical CGP with frequencies of gene occurrence

This chapter develops an implementation of characteristics-based prioritisation, statistical CGP, a prioritisation method based on the concept described in Sec- tion 5.1. Statistical CGP begins with the user specifying a phenotype of interest p (study phenotype), together with a list of potential genes to be ranked (candi- date genes). The relevance of a candidate gene to p is determined by the method described below. The work flow of statistical CGP is illustrated in Figure 5.3.

5.3.1 Determination of genomic occurrences of candidate genes

For each prioritisation task, the polypeptide sequence of n candidate genes are compared with all orf of the m genome sequences (including all genes on the chro- mosomes and plasmids) downloaded from the National Centre for Biotechnology

Information (NCBI, accessed April 2007) by BLASTP program1. If a candidate gene reached the critical expect value (E-value) of 10−5 in a given genome, a gene or gene homolog is defined as present in the genome. If a gene does not reach the critical E-value in a genome, the gene is recorded as absent from the genome. The binary states of presence or absence of genes in example genomes are recorded in the n × m occurrence matrix (Figure 5.4).

As the primary objective of the occurrence matrix is not the detection of or- thologous genes, a less-conservative E-value cut-off of 10−5, which is at the lower end of the commonly-used 10−5 to 10−10 range used to detect protein orthologs, is appropriate. While the occurrences of paralogs are more likely to be introduced by such a value, it also allows proteins with shorter conserved domains to be iden- tified. Furthermore, it should produce richer occurrence profiles, resulting in better discrimination between genes.

1http://blast.ncbi.nlm.nih.gov/Blast.cgi

91 2CP-C HZ 11B C58 Cereon C58 UWash AAC00-1 SK2 ATCC 17978 Ellin345 MLHE-1 ATCC 7966 St Maries ATCC 29413 ZM4 JS42 . . . Alkalilimnicola ehrlichei Acidobacteria bacterium Acidothermus cellulolyticus Acidovorax Acidovorax avenae citrulli Aeromonas hydrophila Aeropyrum pernix Agrobacterium tumefaciens Agrobacterium tumefaciens Alcanivorax borkumensis Anabaena variabilis Anaplasma phagocytophilum Aquifex aeolicus Archaeoglobus fulgidus Zymomonas mobilis Anaeromyxobacter dehalogenans Anaplasma marginale Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 candidate genes

Gene 9 n Gene 10 . . Gene n

m genomes

Figure 5.4: The occurrence (homolog) matrix. This schematic heat-map illustrates the occurrence profile of whether candidate genes (down) have homologs present (red squares) or absent (white squares) in a given genome (across). Gene occur- rences are determined by BLASTP.

92 Positive genome Negative genome + − examples (Ep ) examples (Ep )

gi present True positives (TP) False positives (FP) + − number of genomes in Ep number of genomes in Ep number of genomes con- containing gi containing gi taining gi;

gi absent False negatives (FN) True negatives (TN) + − number of genomes in Ep not number of genomes in Ep not number of genomes not containing gi containing gi containing gi; + − number of genomes in Ep number of genomes in Ep = m+ = m−

Figure 5.5: The 2 × 2 contingency table derived from counting co-occurrence + − matrix. Note: gi: candidate gene; Ep : positive genome examples; Ep :negative genome examples; m+: number of positive genome examples; m−: number of negative genome examples;

5.3.2 The 2 × 2 contingency tables

From the m genomes, m+ genomes known to display the phenotype of interest p are selected as positive genome examples, and m− genomes not displaying p are chosen as negative examples. For each of the n candidate genes, the number of co-presence (homologs present in positive genome examples) and co-absence

(homologs absent in negative genome examples) are counted and presented into a

2 × 2 contingency table (Figure 5.5), from which a number of statistical scoring functions are calculated. These scores form the basis of the gene ranks R.

5.3.3 Scoring functions

Sensitivity (sens) and specificity (spec)

Sensitivity is the proportion of candidate genes g present in genome G displaying phenotype p, whereas specificity is the proportion of genes g absent in genomes

G that also do not display p. These measures are equivalent to the normalised rate of co-presence and co-absence of genes in the positive and negative genome examples respectively:

93 TP sens(g)=P (g|G ∈ E+)= (5.1) p TP + FN TN spec(g)=P (¬g|G ∈ E−)= (5.2) p TN + FP

Positive (ppv) and negative (npv) predictive values

The positive predictive values (ppv), or precision, measures the proportion of pos- itive genomes present when a gene is present. Similarly, the negative predictive values (npv) measured the proportion of negative genomes that are absent when a gene is absent.

TP ppv(g)=P (G ∈ E+|g )= (5.3) p i TP + FP TN npv(g)=P (G ∈ E−|¬g )= (5.4) p i TN + FN

Arithmetic (amss) and harmonic (hmss) means of sensitivity and specificity

Both scoring functions amss and hmss balance the rates of co-presence and co- absence. The amss scoring function is the arithmetic midpoint between sensitivity and specificity. The hmss scoring function, which defines the harmonic mean be- tween the conditional probabilities, is conceptually similar to amss but it penalises genes with very low sensitivities or specificities.

1 amss(g)= sens(g)+spec(g) (5.5) 2 1 hmss(g)= 1 1 (5.6) sens(g) + spec(g)

94 F-measure (F)

F-measure is a frequently used statistic in evaluating performance of information retrieval systems. It is defined as the harmonic mean between the sensitivity and precision, such that:

1 F (g)= 1 1 (5.7) sens(g) + ppv(g)

Odds ratios (OR)

The odds ratio compares the odds of a gene present in the positive example versus the odds of a gene absent in the negative examples, such that:

TP TP × TN OR(g)= FP = (5.8) FN FP × FN TN

Chi-square (chisq) and directional chi-square (bchisq) scoring functions

χ2 is a frequently-used statistic in testing variations between groups in discrete data. The chisq scoring function measured the deviation of the observed frequency from the expected proportion such that:

n (O − E )2 chisq(g)= i i Ei i=1 2 2 2 aij − E(aij) = (5.9) E(a ) i=1 j=1 ij

E(a )= (a1j +a2j )(ai1+ai2) a = 2 × 2 where ij a11+a21+a12+a22 , ij elements in the contingency table. The directional chi-square function (bchisq) is similar to chisq, but genes that display an inverse association are reversed to the bottom of the rank. bchisq ex- cludes genes that are inversely associated with p.

95 ⎧ ⎪ ⎨+chisq(g) if OR(g) >=1 bchisq(g)= (5.10) ⎪ ⎩−chisq(g) if OR(g) < 1

5.4 Inductive CGP with gene occurrence patterns

Inductive CGP ranks genes by finding genes with similar occurrence pattern across a number of bacterial genomes using supervised machine learning. A number of genes known to be associated with the study phenotype p are selected as positive gene examples for the training set. Similarly, genes that do not contribute to p are selected as negative gene examples. The occurrences of genes in k genome examples are used as features for model training as described in Section 5.3.1.

Candidate genes are ranked by the score or posterior probability from the output of machine learning classifiers, such that:

φ(gi)=ϕ F(gi) ∗ = M F(gi)

∗ where M F(gi) is the model M(F) optimised to classify the training set. For the experiments reported in Chapter 6, major classes of machine learning algorithms from Waikato Environment of Knowledge Analysis (WEKA) 3.5.6 are used in inductive CGP [135]. The algorithms that are evaluated including na¨ıve

Bayes classifier (NB), logistic regression (LR; ridge=10−5), J48 decision tree (J48; pruning confidence = 0.25, minimum instances per node = 2), k-nearest neighbour classifier (IBk; with inverse distance weighing; k was determined by leave-one-out cross-validation), alternating decision tree (ADTree; boosting iteration = 10), sup- port vector machines with polynomial (SVM/Poly; linear kernel) and radial basis

96 function (SVM/RBF; γ =0.01) kernels. Both SVMs were trained by sequential minimal optimisation algorithm (SMO) and the posterior probabilities are esti- mated by logistic models as previously described in Chapter 3. Unless otherwise stated, the performance of machine learning classifiers are evaluated by stratified

10-fold cross-validation.

5.5 Evaluation by rediscovery experiments

Systematic evaluations are important for benchmarking CGP systems, of which the discovery power can be examined by performing rediscovery experiments. In this section, the steps in CGP evaluation will be formalised.

The rank and rank fraction

The candidate gene rank R is defined as an ordered set of genes, sorted in descend- ing order such that genes with higher scores are placed at the top of the rank such that:

R = r1,r2,...,rn (5.11)

where {r1,r2,...,rn} = the set of genes {g1,g2,...,gn} and ∀φ(ri) ≥ φ(ri+1), i =1...n− 1.

The position of a gene in the rank R is defined as the i-th element from the top of the rank, such that posR(ri)=i, ∀i =1...n. Similarly, the rank fraction is defined by the normalised distance from the top of the rank, such that

1 i TR(r )= posR(r )= = f (5.12) i n i n

f can also be expressed in top percentiles (pct), such that

97 Top x percentile (pct)=TR(ri) × 100% (5.13)

Inverse rank fraction

The inverse rank fraction is defined as:

−1 Inverse function of TR(ri)=TR (f)=ri (5.14)

where TR(ri) ≤ f

Inverse rank fraction of a subrank

 {    }⊂ { } For a given subset R = r1,r2,...,rn R = r1,r2,...,rn ,therank  fraction of gene ri in subrank R is defined as:

j T  (r )= ,whereφ(r ) ≥ φ(r ) >φ(r ) (5.15) R i n j i j+1

−1 Similarly, the inverse rank fraction TR (f) can be defined as :

−1  TR (f)=ri (5.16)

   ≤  ≤ where TR (ri) f=1.

5.5.1 Evaluation by threshold functions and threshold-dependent mea- sures

In rediscovery experiments, sets of correct genes are used as the gold-standard to determine the performance of prioritised gene list. The simplest and most practical method for evaluating a candidate gene rank is by performing a binary split of the candidate genes into two classes R+ and R−.Thepositive gene examples R+

98 consist of a genes from rank R that are known to contribute to phenotype p, such + { + + +}⊂ + −contributes−−−−−−→ to that R = g1 ,g2 ,...,gi R and gi p for every i. Similarly, the negative gene examples, R−, are genes known to not contribute to phenotype − { − − −}⊂ −  p, such that R = g1 ,g2 ,...,gj R and gj p for every j.

The cumulative gain function

The cumulative gain (CG) function is useful in assessing the number of genes that can be discovered above certain position of the rank. By defining the genes above the threshold of rank fraction t as predicted positive (genes predicted to contribute to the phenotype), we can express the CG function as the rank fraction of the sub- rank R+ (Equation 5.15):

−1 CG(t)=TR+ TR (t) , 0 ≤ t ≤ 1 (5.17)

Partial precision

The practical purpose of gene prioritisation is to increase the efficiency of candidate gene selection. The selection efficiency is achieved by the reduction of potential search space from {g1,g2,...,gn} = {r1,r2,...,rn} to {r1,r2,...,rk}, k

+ −1 N TR+ TR (t) pppv(t)= (5.18) Nt

where N + is the number of genes known to display phenotype p,andN is the total number of candidate gene in the rank.

99 The average partial precision (pppv) can be calculated by the area under the pppv(t) function, which can be evaluated numerically using trapezium rule be- tween rank fractions 0 and 1:

1 1 1 pppv = pppv(t)dt = pppv(t)dt (5.19) 1 − 0 0 0

Probability enrichments

The absolute precision and partial precision of two prioritised ranks cannot be com- pared directly, as the baseline proportion of the correct genes are normally differ- ent. The measurement of the relative partial precision,orprobability enrichment

(PE), enables this comparison. PE is expressed as folds-improvement on baseline precision (Equation 4.2). It indicates how many times more likely a correct gene can be found compared with a gene is selected randomly.

By augmenting the definition of PE by Turner et al. [160], the definition of probability enrichment of which the average enrichment η at threshold t is defined as:

pppv(t) η(t)= ppv −1 TR+ TR (t) = (5.20) t 1 η = η(t)dt 0 1 −1 TR+ TR (t) = dt (5.21) 0 t

5.5.2 Area under receiver operating characteristic (ROC) curve in prioritisation rank

As discussed in Chapter 4, threshold-dependent analyses require the assignment of an arbitrary threshold and are thus unable to provide an assessment of the overall

100 performance. To evaluate the CGP performance in a threshold-independent man- ner, receiver operating characteristic (ROC) curves can be used to evaluate the trade-off between true positive and false positive signals [170]. ROC is defined by the following parametric equation:

−1 −1 ROC x(t),y(t) ≡ 1 − TR− TR (t) ,TR+ TR (t) (5.22)

The area under the ROC curve (AUC) indicates the probability of finding cor- rect genes when samples are randomly drawn [137]. The estimation of estimated by trapezium rule, of which approximation will be applied in Chapters 6, 8 and 9, is calculated as:

∞ AUC = TPR(τ) dF P R(τ) −∞ 1 N ≈ TPR(t )+TPR(t ) FPR(t ) − FPR(t ) 2 i+1 i i+1 i i=0 1 N = TR+ (r )+TR+ (r ) TR− (r ) − TR− (r ) (5.23) 2 i+1 i i+1 i i=0

where τ is the threshold score, TPR(τ) and FPR(τ) are the true and false positive rates above threshold τ, ti is the rank fraction at the i-th threshold, and ri is the i-th gene from the top of the candidate gene rank with N genes.

5.5.3 Evaluation measures used in this thesis

This section summarised and provided mathematical formalisations of standard evaluation measures. In addition, the novel use of the measure of average prob- ability enrichment (η) was described to help evaluate the general performance of

101 CGP. The measures of AUC, η,andη will be used to evaluate the performance of rediscovery experiments in later chapters.

102 Chapter 6

Co-occurrence-based CGP: case studies

The results in this chapter were published in BMC Bioinformatics. 2009; 10:86

[162].

6.1 Case study 1: Rediscovery of peptidoglycan genes

Peptidoglycan is an integral component present in most bacterial cell walls. The rigidity of the cross-linking polymer gives bacteria the ability to withstand osmotic pressure and maintains the shape of the bacterial cell [13]. Peptidoglycan is also an important target of antibiotics. Alteration of peptidoglycan structure by genetic mutations in the responsible enzymes are associated with antibiotic resistance and treatment failures [171]. The pathway of peptidoglycan metabolism has been well- studied [172] and is thus suitable for rediscovery experiments.

103 6.1.1 Methods

Selection of validation gene sets

Well-characterised genes responsible for peptidoglycan biosynthesis and metabolism in bacteria were used for CGP testing. However, the interweaving nature of bio- chemical pathways makes the selection of exact “causal genes” difficult. The an- abolism of peptidoglycan involves many pathways in precursor synthesis, back- bone assembly, and cross-linking of monomers. To address the complexity of the selection process, genes responsible for peptidoglycan metabolism were grouped into three nested validation sets as shown in Figure 6.1.

• The C (core) validation set consisted of genes responsible for the synthesis of

the peptidoglycan backbone, N-acetylmuramate–pentapeptide, from UDP-

N-acetyl-glucosamine (murAtomurG, mraY).

• The B (biosynthesis) validation set, extended the C set with genes involved in

precursor pathways including N-acetyl–D-glucosamine, meso-diaminopelamate

and D-alanyl–D-alanine, as well as genes responsible for undecaprenyl phos-

phate biosynthesis and recycling.

• The M (metabolism) validation set further extended the B set by includ-

ing genes responsible for the modification, recycling, and cross-linking of

peptidoglycan such as penicillin-binding proteins and N-acetylmuramoyl–L-

alanine amidase.

The genes in the validation sets were identified using two pathway resources, the Kyoto Encyclopaedia of Genes and Genomes (KEGG) [173] and EcoCyc [174].

104 Selection of candidate genes

Genomes of one Gram-positive bacterium (S. agalactiae 2603 V/R, SA-2603, 2124 genes, GenBank ID: AE009948) and one Gram-negative bacterium (E. coli K-

12, EC-K12, 4134 genes, GenBank ID: U00096) were selected for prioritisation.

Genes from the three validation sets in the two genomes are shown in Table 6.1.

105 Figure 6.1: Case study 1: genes involved in peptidoglycan metabolism The C validation set (shaded area) includes genes responsible for the biosynthesis of pep- tidoglycan backbone. The B validation set includes various accessory pathways (UAG-synthesis, D-Glu and D-Ala synthesis, meso-DAP synthesis, and und-PP syn- thesis and recycling). The M validation set further includes genes responsible for transpeptidation, transglycosylation, and other genes responsible for peptidoglycan metabolisms. Abbreviations: UDP: uridine diphosphate; NAG: N-acetylglucosamine; NAG- 1P: N-acetylglucosamine–1-phosphate; NAM: N-acetylmuramate; NAG-EP: N- acetylglucosamine–enopyruvate; Ala: alanine; Glu: glutamate; (D-Ala)2: D- alanyl–D-alanine; m-DAP: meso-diaminopimelate; und-PP: undecaprenyl diphos- phate; und-P: undecaprenyl phosphate; F6P: fructose-6-phosphate; D-Glc: D- glucosamine; D-Glc-6P: D-glucosamine-6-phosphate; D-Glc-1P: D-glucosamine-1- phosphate; L-Asp: L-aspartate; L-Asp-4P: L-aspartate-4-phosphate; ASA: aspar- tate semialdehyde; DHDP: L-2,3-dihydrodipicolinate; THDP: tetrahydrodipicol- inate; NS-AKP: N-succinyl-2-amino-6-ketopimelate; NS-DAP: N-succinyl–L,L- 2,6-diaminopimelate; L,L-DAP: L,L-diaminopimelate

106 Table 6.1: Case study 1: List of peptidoglycan-related genes

Genomes Gene Gene product Validation set SA-2603 EC-K12 Peptidoglycan biosynthesis and associated pathways murA UDP-N-acetylglucosamine 1-carboxyvinyltransferase C, B, M SAG0843 b3189 SAG0866 murB UDP-N-acetylmuramate dehydrogenase C, B, M SAG1112 b3972 murC UDP-N-acetylmuramate–alanine ligase C, B, M SAG1615 b0091 murD UDP-N-acetylmuramoylalanine–D-glutamate ligase C, B, M SAG0475 b0088 murE UDP-N-acetylmuramoylalanyl-D-glutamate–2,6-diaminopimelate ligase C, B, M SAG1391 b0085 murF UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate–D-alanyl–D-alanine C, B, M SAG0768 b0086 ligase mraY phospho-N-acetylmuramoyl-pentapeptide-transferase C, B, M SAG0288 b0087

107 murG UDP-N-acetylglucosamine–N-acetylmuramyl-(pentapeptide)-pyrophosphoryl- C, B, M SAG0476 b0090 undecaprenol N-acetylglucosamine transferase UDP-N-acetylmuramate biosynthesis glmU glucosamine-1-phosphate N-acetyltransferase UDP-N-acetylglucosamine pyrophos- B, M SAG1538 b3730 phorylase glmM phosphoglucomutase / phosphomannomutase B, M SAG0887 b3176 glmS glucosamine–fructose-6-phosphate aminotransferase (isomerizing) B, M SAG0944 b3729 Undecaprenyl biosynthesis and recycling uppP/bacA undecaprenyl pyrophosphate phosphatase B, M SAG0138 b3057 uppS/ispU undecaprenyl diphosphate synthase B, M SAG1916 b0174 D-alanyl–D-alanine metabolism ddl D-alanine–D-alanine ligase B, M SAG0767 b0381 b0092 alr/dadX alanine racemase B, M SAG1684 b1190 b1190 D-glutamate metabolism (Continue on next page) Table 6.1: Case study 1: List of peptidoglycan-related genes (continued)

Genes in genome Gene Gene product Validation set SA-2603 EC-K12 murI glutamate racemase B, M SAG1600 b3967 meso-diaminopimelate biosynthesis dapF diaminopimelate epimerase B, M b3809 dapE N-succinyl-diaminopimelate deacylase B, M b2472 dapC acetylornithine delta-aminotransferase B, M b3359 dapD 2,3,4,5-tetrahydropyridine-2-carboxylate N-succinyltransferase B, M b0166 dapB dihydrodipicolinate reductase B, M b0031 dapA dihydrodipicolinate synthase B, M b2478 asd aspartate-semialdehyde dehydrogenase B, M b3433 lysC aspartokinase B, M b4024

108 Peptidoglycan modification ami N-acetylmuramoyl–L-alanine amidase M SAG0094 amiA N-acetylmuramoyl–L-alanine amidase M b2435 amiB N-acetylmuramoyl–L-alanine amidase M b4169 amiC N-acetylmuramoyl–L-alanine amidase M b2817 ybjR N-acetylmuramoyl–L-alanine amidase M b0867 ampD N-acetyl-anhydromuranmyl-L-alanine amidase M b0110 ampG muropeptide transporter M b0433 mltA membrane-bound lytic murein transglycosylase A M b2813 mltB membrane-bound lytic murein transglycosylase B M b2701 mltC membrane-bound lytic murein transglycosylase C M b2963 mltD membrane-bound lytic murein transglycosylase D (predicted) M b0211 slt lytic murein transglycosylase, soluble M b4392 mepA murein DD-endopeptidase M b2328 glnA glutamine synthetase M SAG1763 b3870 mpl UDP-N-acetylmuramate:L-alanyl-gamma-D-glutamyl- meso-diaminopimelate ligase M b4233 (Continue on next page) Table 6.1: Case study 1: List of peptidoglycan-related genes (continued)

Genes in genome Gene Gene product Validation set SA-2603 EC-K12 Penicillin-binding proteins pbp1A penicillin-binding protein 1A M SAG0298 b3396 pbp1B penicillin-binding protein 1B M SAG0159 pbp2/mrdA penicillin-binding protein 2 M b0635 pbp2A penicillin binding protein 2A M SAG2066 pbp2B penicillin binding protein 2B M SAG0765 pbp2X penicillin binding protein 2X M SAG0287 pbp3/pbpB penicillin binding protein 3 M b0084 ftsI carboxy-terminal protease for penicillin-binding protein 3 (EC-K12) M b1830 mrcB fused glycosyl transferase and transpeptidase M b0149

109 pbp4 penicillin-binding protein 4 M SAG0146 pbp5/dacA penicillin-binding protein 5: D-alanyl–D-alanine carboxypeptidase M b0632 dacC penicillin-binding protein 6a/b: D-alanyl–D-alanine carboxypeptidase M b0839 dacD penicillin-binding protein 6a/b: D-alanyl–D-alanine carboxypeptidase M b2010 pbpG D-alanyl–D-alanine endopeptidase M b2134 pbpC transglycosylase/transpeptidase M b2519 mtgA biosynthetic peptidoglycan transglycosylase M b3208 (End of table) The validation sets C (core), B (biosynthesis), and M (metabolism) are defined in Section 6.1.1. Abbreviations: SA-2603: Streptococcus agalactiae 2603; EC-K12: Escherichia coli K-12. Statistical CGP

Four hundred and eighty three fully-sequenced bacterial genomes were down- loaded from National Enter for Biotechnology Information (NCBI) FTP site1.For the statistical CGP of peptidoglycan genes, 400 bacterial genomes known to pro- duce peptidoglycan were selected from the downloaded genomes as positive genome examples. Genomes of 17 bacterial species lacking cell wall, including Mycoplasma spp., Ureaplasma spp., Anaplasma spp. and Phytoplasma spp. were selected as negative genome examples. The genome examples used for statistical CGP in this experiment are listed in Appendix A.1.

For each of the candidate genes, the polypeptide sequences were queried against the nr database using the blastp program from Basic Local Alignment and Search

Tools (BLAST). The search space comprised all chromosomal and plasmids genes from the sequenced 417 genomes. The existence of a potential homologous gene in a given genome was determined by the critical E-value of < 10−5 and recorded in an occurrence matrix as described in Section 5.3.1. All scoring functions described in Section 5.3.3 were applied to produce candidate gene ranks.

Inductive CGP

The occurrences of each candidate gene in the same 417 genome examples were used as features for machine learning training. Each gene was labelled as positive if they belonged to one of the validation sets (C, B,orM), or negative if they were not listed in the validation set. All algorithms described in Section 5.4 were evaluated.

Evaluation

The relative position of a ranked candidate gene was measured in percentiles from the top of the rank (pct, Equation 5.13). The performance of statistical CGP was

1ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/, as of April 2007

110 measured by using the area under the ROC curve (AUC, Equation 5.23). While statistical CGP can be directly evaluated using the gene lists in Section 6.1.1, the performance of inductive CGP cannot be similarly evaluated. This is because the performance estimate (AUC) becomes overly optimistic if the algorithm is tested with the same genes that were used to develop it. A better test is to evaluate the algorithm against a previously unseen set of genes. To obtain a more accurate es- timation, the generalisation performance of inductive CGP was estimated by AUC using stratified 10-fold cross-validation. This involves an algorithm being devel- oped on a randomly selected subset of 90% of data, and being tested against the remaining 10%. Its performance is then assessed as the average over 10 such trials.

The performance of each inductive CGP algorithm was thus obtained by averaging areas under receiver operating characteristic curves (AUC) over the 10 runs. The cross-validation procedure described in this section is identical to the procedure used in Section 3.3.1.

The average and maximum probability enrichments were calculated by using

Equations 5.20 and 5.21.

Control

To compare the effectiveness of statistical CGP, the relative positions of the pep- tidoglycan genes were compared with an unrelated metabolic pathway (glycolysis genes) that acts as the control validation set. Genes encoding the glycolytic en- zymes (12 in SA-2603 and 15 in EC-K12) were also identified from KEGG.

6.1.2 Results

Statistical CGP

Area under ROC curve analysis The best scoring functions for rediscovering metabolic genes (M set) were amss, hmss,andnpv (AUC > 0.970) using whole-

111 genome of SA-2603 (2124 genes) as candidate genes. The chisq scoring func- tion achieved an AUC of 0.959 in rediscovering M set genes. Of the 25 known peptidoglycan-related genes, all except one gene were identified within the top

13% (median: top 1.0 pct) in SA-2603 (Table 6.3 and Figure 6.2). The top-scored genes in the SA-2603 genome are listed in Table 6.2. For M set genes in EC-K12

(51 known genes out of 4134 genes in the bacterial genome), an AUC of 0.911 was achieved by amss, and the median of the rediscovered genes was at the top 3.2 pct of the rank (Table 6.4). For B set genes, the best AUC for EC-K12 was 0.969 and

0.980 for SA-2603, both using hmss.ForC set genes, the best AUC was 0.989 for EC-K12 and 0.986 for SA-2603, both also between achieved by hmss. In the control validation set (glycolysis), the AUCs from the amss scoring function were

0.398 for SA-2603 and 0.341 for EC-K12.

Probability enrichments The performance of statistical CGP was also measured by folds-increase in precision (probability enrichments, Equation 5.21) compared to the non-prioritised rank. With the chisq scoring function, the ranked gene list achieved an average enrichment of 3.65 folds (maximum 28.3 folds) for SA-2603 and 3.16 folds for EC-K12 (maximum 27 folds). The probability enrichments of other validation sets are listed in Tables 6.3 and 6.4. The partial precision and precision-recall curves are shown in Figure 6.3.

112 S. agalactiae 2603 E. coli K−12 0 20 40 60 80 Rank position (pct, n−th percentile from top) 100

C B M Glycolysis C B M Glycolysis

Validation set

Figure 6.2: The performance of statistical CGP in rediscovering peptidoglycan- related genes (with amss scoring function). The validation sets C (core), B (biosyn- thesis), and M (metabolism) are defined in Section 6.1.1. The control validation sets (glycolysis) are included for both genomes for comparison of statistical CGP performance.

113 Table 6.2: Case study 1: SA-2603 genes ranked by amss scoring function

Rank pct Score COG Valid. Set Gene Locus Function 1 0.05 0.97 M M pbp2A SAG2066 penicillin-binding protein 2A 2 0.09 0.97 O groES SAG2075 co- GroES 3 0.14 0.96 aroC SAG1377 chorismate synthase 3 0.14 0.96 M B,M alr SAG1684 alanine racemase 5 0.24 0.96 M B,M ddl SAG0767 D-alanylalanine synthetase 6 0.28 0.96 M M pbp1A SAG0298 penicillin-binding protein 1A 7 0.33 0.96 M M SAG0159 penicillin-binding protein 1B, putative 7 0.33 0.96 E aroB SAG1378 3-dehydroquinate synthase 9 0.42 0.96 M C,B,M murD SAG0475 UDP-N-acetylmuramoyl–L-alanyl-D-glutamate synthetase 9 0.42 0.96 ftsW SAG0761 cell division protein, FtsW/RodA/SpoVE family

114 9 0.42 0.96 M C,B,M murF SAG0768 UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate–D- alanyl-D-alanyl ligase 9 0.42 0.96 M C,B,M murA-1 SAG0843 UDP-N-acetylglucosamine 1-carboxyvinyltransferase 9 0.42 0.96 M C,B,M murA-2 SAG0866 UDP-N-acetylglucosamine 1-carboxyvinyltransferase 14 0.66 0.96 M pbpX SAG0287 penicillin-binding protein 2X 14 0.66 0.96 M B,M mraY SAG0288 phospho-N-acetylmuramoyl-pentapeptide-transferase 14 0.66 0.96 M C,B,M murC SAG1615 UDP-N-acetylmuramate–L-alanine ligase 17 0.8 0.95 M SAG0140 glycosyl transferase, group 4 family protein 18 0.85 0.95 E aroE SAG1680 shikimate 5-dehydrogenase 19 0.89 0.95 TK SAG0322 DNA-binding response regulator 20 0.94 0.95 E cysK SAG0334 cysteine synthase A 21 0.99 0.95 M B,M murE SAG1391 UDP-N-acetylmuramoylalanyl-D-glutamate–2,6-diaminopimelate lig- ase 22 1.04 0.95 T SAG1507 phoH family protein 23 1.08 0.94 M B,M glmU SAG1538 UDP-N-acetylglucosamine pyrophosphorylase (Continue on next page) Table 6.2: Case study 1: SA-2603 genes ranked by amss scoring function (continued)

Rank pct Score COG Valid. Set Gene Locus Function 24 1.13 0.94 R SAG0480 ylmE protein, putative 25 1.18 0.94 O radA SAG0110 DNA repair protein RadA 26 1.22 0.94 Q dltA SAG1790 D-alanine–D-alanyl carrier protein ligase 27 1.27 0.94 rodA SAG0621 rod shape-determining protein RodA, putative 28 1.32 0.94 EH SAG1528 chorismate binding enzyme 29 1.37 0.94 O clpX SAG1312 ATP-dependent protease ATP-binding subunit 29 1.37 0.94 OU clpP SAG1585 ATP-dependent Clp protease proteolytic subunit 31 1.46 0.94 J rpsA SAG1150 30S ribosomal protein S1 32 1.51 0.94 H SAG0498 geranyltranstransferase, putative 32 1.51 0.94 H SAG1738 polyprenyl synthetase family protein

115 34 1.6 0.94 E SAG0339 aspartate kinase 35 1.65 0.94 O SAG1530 peptidyl-prolyl cis-trans isomerase, cyclophilin-type 36 1.69 0.94 I accC SAG0352 acetyl-CoA carboxylase 37 1.74 0.93 M B,M murI SAG1600 glutamate racemase . . Remaining peptidoglycan genes 43 2.02 0.93 I B,M uppS SAG1916 undecaprenyl diphosphate synthase 56 2.64 0.92 M B,M glmS SAG0944 D-fructose-6-phosphate amidotransferase 62 2.92 0.92 B,M uppP SAG0138 undecaprenyl pyrophosphate phosphatase 73 3.44 0.92 M C,B,M murG SAG0476 N-acetylglucosaminyl transferase 90 4.24 0.91 M M SAG0765 penicillin-binding protein 2b 108 5.08 0.90 B,M glnA SAG1763 glutamine synthetase, type I 159 7.49 0.86 M C,B,M murB SAG1112 UDP-N-acetylenolpyruvoylglucosamine reductase 264 12.43 0.81 G B,M SAG0887 phosphoglucomutase/phosphomannomutase family protein 276 12.99 0.81 M M SAG0146 penicillin-binding protein 4, putative (Continue on next page) Table 6.2: Case study 1: SA-2603 genes ranked by amss scoring function (continued)

Rank pct Score COG Valid. Set Gene Locus Function

566 26.65 0.66 NU M SAG0094 N-acetylmuramoyl–L-alanine amidase, family 4 protein (End of table)

Abbreviations: COG: Cluster of orthologous groups [175] (E: amino acid transport and metabolism; G: carbohydrate transport and metabolism; H: co-enzyme transport and metabolism; I: lipid transport and metabolism; K: transcription; M: cell wall, membrane, envelope biosynthesis and metabolism; N: cell motility; O: post-translational modification, protein turnover, chaperones; Q: secondary metabolites biosynthesis, transport and catabolism; R: general function prediction only; T: signal transduction mechanisms. COG functional category descriptions referenced from http://www.ncbi.nlm.nih.gov/COG/grace/fiew.cgi). Valid. set: validation sets. (C: core; B: biosynthesis; M: metabolism. See Section 6.1.1). Locus: the position of gene in the SA-2603 genome. pct: position (in percentile) from the top of the rank. Rank: the order of gene in the prioritised gene list. Score: score by amss scoring function (arithmetic mean of sensitivity and specificity). 116 Folds−enrichment of precision at rank position Probability enrichment, η (folds) 0 20406080

0.0 0.2 0.4 0.6 0.8 1.0

Rank position (fraction from the top of the rank)

(a) Partial precision graph. The best probability enrichment (85-fold) was achieved within the top-0.2 pct of the rank .

Precision−recall graph Precision 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

(b) Precision-recall graph. Recall at 0.5 precision was approximately 60%.

Figure 6.3: Case study 1: partial precision and precision-recall graphs of the rank prioritised by amss scoring function on M validation set.

117 Table 6.3: Case study 1: Prioritisation performance in AUC (Streptococcus agalactiae 2603 V/R, 2124 genes)

Validation sets Methods C (9 genes) B (18 genes) M (25 genes)

AUC (η/ηmax)AUC(η/ηmax)AUC(η/ηmax) Statistical CGP∗(scoring functions) sens 0.858 (2.0/5.4) 0.853 (1.9/4.3) 0.830 (1.8/3.8) spec 0.396 (0.5/1.5) 0.427 (0.7/2.6) 0.506 (1.1/5.2) ppv 0.420 (0.6/1.57) 0.504 (1.3/29.5) 0.590 (2.1/85.0) npv 0.966 (3.6/30.1) 0.964 (3.5/21.7) 0.978 (3.2/17.3) amss 0.985 (4.6/88.5) 0.980 (4.4/59) 0.970 (4.4/85.0) hmss 0.986 (4.8/88.5) 0.980 (4.5/64) 0.969 (4.5/85.0) 118 OR 0.415 (0.5/1.57) 0.509 (1.3/29.5) 0.592 (2.1/85.0) chisq 0.978 (4.2/59.0) 0.975 (3.9/34.7) 0.959 (3.7/28.3) bchisq 0.978 (4.2/59.0) 0.975 (3.9/34.7) 0.960 (3.7/28.3) F 0.932 (3.3/32.9) 0.915 (3.1/23.1) 0.881 (2.8/18.5) Inductive CGP†(machine learning algorithms) NB 0.901 0.879 0.843 LR 0.980 0.905 0.887 ADTree 0.996 0.944 0.975 IBk 0.948 0.950 0.974 J48 0.885 0.832 0.752 SMO/Poly 0.999 0.948 0.879 SMO/RBF 0.998 0.991 0.909 ∗ Areas under ROC curve (AUC) tested against validation sets C, B, and M. † Average AUCs of stratified 10-fold cross-validations. Abbreviations: η: average probability enrichment (n-fold); ηmax: maximum probability enrichment (n-fold); sens: sensitiv- ity; spec: specificity; ppv: positive predictive value; npv: negative predictive value; amss: arithmetic mean of sensitivity and specificity; hmss: harmonic mean of sensitivity and specificity; OR: odds ratio; chisq: χ2 scoring function; bchisq: signed χ2 scoring function; F: F-measure; NB:na¨ıve Bayes classifier; LR: logistic regression; ADTree: alternating decision tree; IBk: k-nearest neighbour classifier; J48: J48 decision tree; SMO: support vector machine trained by SMO algorithm; Poly: polynomial kernel; RBF: radial basis function kernel Table 6.4: Case study 1: Prioritisation performance in AUC (Escherichia coli K-12, 4131 genes)

Validation sets Methods C (8 genes) B (28 genes) M (51 genes)

AUC (η/ηmax)AUC(η/ηmax)AUC(η/ηmax) Statistical CGP∗(scoring functions) sens 0.913 (2.5/10.6) 0.891 (2.3/6.0) 0.818 (1.9/4.2) spec 0.321 (0.4/1.4) 0.310 (0.4/1.2) 0.418 (0.8/2.0) ppv 0.405 (0.8/5.2) 0.423 (1.2/18.4) 0.553 (1.7/28.6) npv 0.974 (3.9/42.0) 0.956 (3.5/20.9) 0.891 (2.8/13.2) amss 0.989 (4.8/110.) 0.966 (4.1/53.7) 0.911 (3.5/44.7) hmss 0.989 (4.9/113.) 0.969 (4.2/55.3) 0.909 (3.5/45.6) 119 OR 0.403 (0.8/5.2) 0.424 (1.2/18.4) 0.552 (1.7/28.6) chisq 0.984 (4.7/73.8) 0.963 (3.9/35.9) 0.902 (3.2/27.0) bchisq 0.984 (4.7/73.8) 0.963 (3.9/35.9) 0.903 (3.2/27.0) F 0.965 (4.0/45.8) 0.921 (3.2/22.5) 0.838 (2.5/15.1) Inductive CGP†(machine learning algorithms) NB 0.930 0.889 0.820 LR 0.882 0.935 0.828 ADTree 0.976 0.981 0.925 IBk 0.998 0.929 0.946 J48 0.935 0.828 0.752 SMO/Poly 0.997 0.876 0.933 SMO/RBF 0.963 0.932 0.964 ∗ Areas under ROC curve (AUC) tested against validation sets C, B, and M. † Average AUCs of stratified 10-fold cross-validations. Abbreviations: η: average probability enrichment (n-fold); ηmax: maximum probability enrichment (n-fold); sens: sensitiv- ity; spec: specificity; ppv: positive predictive value; npv: negative predictive value; amss: arithmetic mean of sensitivity and specificity; hmss: harmonic mean of sensitivity and specificity; OR: odds ratio; chisq: χ2 scoring function; bchisq: signed χ2 scoring function; F: F-measure; NB:na¨ıve Bayes classifier; LR: logistic regression; ADTree: alternating decision tree; IBk: k-nearest neighbour classifier; J48: J48 decision tree; SMO: support vector machine trained by SMO algorithm; Poly: polynomial kernel; RBF: radial basis function kernel Inductive CGP

For SA-2603 peptidoglycan genes, both SVMs achieved AUCs of higher than 0.99 in both C and B sets of genes upon stratified 10-fold cross-validation, whereas

ADTree had the best AUC of 0.975 in M set genes. The trained ADTree for identi- fying M set genes is shown in Figure 6.4. For EC-K12 genome, the best AUC was achieved by SVM/RBF in the M set (0.964). The best AUCs of 0.998 and 0.981 were achieved in the rediscovery of C and B set genes (by ADTree and IBk).

Comparison with control validation set (glycolysis)

The best AUCs in the statistical CGP experiment was achieved by amss and hmss scoring functions in prioritising peptidoglycan-related genes. The amss prioritised rank is compared against an unrelated validation set (glycolysis) to verify the rank is only specific to peptidoglycan-related genes. Eleven enzymes from the glycol- ysis pathway (equivalent to 12 genes in SA-2603 and 15 genes in EC-K12) were identified through KEGG. AUCs of 0.398 and 0.504 were achieved by the amss and hmss scoring functions respectively. Table 6.5 and Figure 6.2 showed the positions of glycolysis genes in the rank produced by the amss scoring function.

120 Start -2.196

Homolog in genome Homolog in genome Homolog in genome Homolog in genome "Oen. oeni "Clos. tetan. "Nit. europ."? "Wig. brevipalpis."? PSU-1"? E88"?

n y n y n y n y

-2.534 0.539 -1.065 0.323 -1.633 0.188 -1.449 0.224

Homolog in genome Homolog in genome Homolog in genome Homolog in genome "Ehr. ruminantium. "Buc. aphidicol. "Hah. chejuensis. "Myc. mycoides."? str. Welgevonden"? Cc Cinara cedri"? KCTC 2396"?

n y n y n y n y

0.428 -2.164 0.466 -0.873 0.095 -1.013 0.532 -1.378

Homolog in genome Homolog in genome "Ric. felis. "Por. gingivalis. URRWXCal2"? W83"?

n y n y

-0.46 0.623 0.725 -1.471

Figure 6.4: Case study 1: the alternating decision tree (ADTree) model trained by using SA-2603 M validation set for the identification of peptidoglycan genes. The model predicts whether the gene is peptidoglycan-related by summing the scores from each of the preceding nodes (from root, Start). A higher score produced by ADTree ranks the candidate gene higher in the prioritised list. This model achieved an AUC of 0.975 estimated by using stratified 10-fold cross-validation. Abbreviations of genome names: Nit. europ.: Nitrosomonas europaea (GenBank accession: AL954747); Wig. brevipalpis.: Wiggleswor- thia brevipalpis (AB063523, BA000021); Oen. Oeni PSU-1: Oenococcus oeni PSU-1 (CP000411); Clos. tetan. E88: Clostridium tetani E88 (AE015927, AF528097); Myc. mycoides.: Mycoplasma mycoides (BX293980); Ehr. ruminantium str.: Ehrlichia ruminantium str. Welgevonden (CR925678); Buc. aphidicol. Cc Cinara cedri.: Buchnera aphidicola Cc Cinara cedri (CP000263); Hah. chejuensis: Hahella chejuensis KCTC 2396 (CP000155); Ric. felis URRWXCal2: Rickettsia felis URRWXCal2 (CP000053–5); Por. gingivalis. W83: Porphyromonas gingivalis W83 (AE015924) Table 6.5: Case study 1: positions of glycolysis genes using amss scoring functions on peptidoglycan genome examples

SA-2603 EC-K12 Gene Gene product Locus pct Locus pct glk glucokinase SAG0471 19.5 b2388 36.4 pgi glucose-6-phosphate isomerase SAG0402 95.8 b4025 5.9 pfkA 6-phosphofructokinase SAG0940 97.6 b1723 49.7 b3916 99.6 fba, dhnA fructose-bisphosphate aldolase SAG0127 97.7 b2097 96.1 122 b2925 99.7 gap glyceraldehyde-3-phosphate dehydrogenase SAG1768 94.0 b1779 3.0 pgk phosphoglycerate kinase SAG1766 94.5 b2926 2.4 gpm phosphoglycerate mutase family protein SAG0092 31.6 b0755 7.2 SAG0752 16.9 b3612 100.0 SAG0764 5.0 b4395 17.0 eno phosphopyruvate hydratase SAG0628 32.1 b2779 11.3 pyk pyruvate kinase SAG0941 55.7 b1676 83.9 b1854 6.8 yccX acylphosphatase SAG1607 38.4 b0968 9.5

Note: Locus: the position of gene in the SA-2603 or EC-K12 genome. pct: position (in percentile) from the top of the rank. 6.1.3 Discussion

Both statistical and inductive CGP achieved encouraging results as demonstrated by good AUC in the rediscovery of peptidoglycan-related genes in both bacterial genomes. Statistical CGP achieved best AUCs of > 97% in these experiments, suggesting that these CGP methods are able to recover peptidoglycan genes with high accuracy. In addition, these genes were up to 4.8 times more likely to be found after statistical CGP by examining the values of probability enrichments. A max- imum of up to 113-fold enrichment suggested that some genes were ranked very highly on the list. Compared with the control validation set, the glycolysis genes were scattered among the prioritised peptidoglycan rank with marginal AUC, indi- cating that the prioritised rank was only specific to the peptidoglycan phenotype.

The specificity of the prioritised rank can be further examined by inspecting in- dividual genes close to the top of the rank, where several genes indirectly related to peptidoglycan metabolism were discovered. For example, the gene product of ftsW is required for the localisation of transpeptidase ftsI (penicillin-binding protein 3) during cell division [176]. The cell-shape determining protein rodA, on the other hand, although the exact enzymatic function is unknown, is closely associated with penicillin-binding protein 2A and ftsI and is essential for the elongation of bacteria during the growth phase [177]. The gene dltA (encodes for a D-alanyl–D-alanine ligase) is involved in the esterification of lipoteichoic acid, which is a major com- ponent and virulence factor found in Gram-positive cell walls [178]. Each of these genes have a specific functional role that is only present in cell-walled bacteria.

The results from inductive CGP were equally encouraging, with best AUCs close to 1.0 achieved by ADTree, IBk and the SVMs algorithms. These results supported our hypothesis that genes with similar functions have similar occur- rence pattern across genomes, which enables the possibility for inductive CGP algorithms to discover genes with function similar to the training genes.

123 In the selection of genome examples, only a handful of negative genome ex- amples (from the evolutionary branch of class Mollicutes spp.) were available. De- spite this, however, we were still able to recover peptidoglycan-related genes with very high accuracy. It may seem, however, that the CGP performance could be af- fected by sampling biases and such biases will be discussed in the later section. It is also unclear how many examples are needed to achieve optimal prioritisation. The effect of example size on the performance will be investigated in the next section.

6.2 How many genome examples are needed to achieve

optimal performance?

When selecting genome examples for statistical CGP, the optimal number of genome examples to achieve best prioritisation is uncertain. There is a concern that sam- pling biases could arise from the relative few (17) negative genome examples in the peptidoglycan experiment. To answer the question of how many genome examples are required, simulations were performed to investigate the relationship between the size of genome examples and CGP performance.

6.2.1 Methods

Three simulations were performed to investigate the effect of number of genome examples on CGP performance. The first two simulations were evaluating this ef- fect on statistical CGP. The first simulation repeatedly applied the amss scoring function on randomly selected subsets of 417 positive and negative genome exam- ples, using genes from validation set M. The number of positive genome example

(Np) and negative genome examples (Nn) were gradually increased in each subset.

For each combination of Np and Nn, 25 runs of prioritisation were performed and the median AUC was obtained. The second simulation was performed to deter-

124 mine the variability on performance. In this simulation, the proportion of positive and negative genome examples was kept the same (400:17) and the median and the range of AUC were then obtained over 1000 runs for each Np and Nn. The last simulation was performed to determine the effect of genome example sizes on inductive CGP performance. Twenty five subsets of N genomes (from 417 genomes) were randomly selected as features with N gradually increased from

1 to 417. For each N, stratified 10-fold cross-validations were performed with

SVM/Poly using all genes from SA-2603 genome as candidates. Median AUCs from 25 random subset of N genomes were obtained.

6.2.2 Results

A total of 401×18×25 runs were performed in simulation 1. The median AUC was monotonically increasing with increasing number of Np and Nn (Figure 6.5). From

0√.05717 0.√04061 the simulation data, a model of AUCˆ median =0.98253 − − was Np Nn obtained by non-linear regression with approximation by ordinary least squares, adjusted R2 =0.8499 (p-value: < 2.2 × 10−16). From the regression model, the best median AUC was found to be asymptotically increasing towards 0.983 with increasing number of genome examples.

The second simulation demonstrated the distribution of AUC with varying number of examples. The range of AUCs was found to be considerably larger with very few genome examples (Figure 6.6). However, a high median AUC could still be achieved (> 0.95) with as few as 4 negative genome examples in the redis- covery of M-set genes when compared with the maximum AUC of 0.97 using all

417 bacterial genomes.

The final simulation performed on inductive CGP (with SVM/Poly) also achieved an median AUC >0.90 with only 20 genome examples (Figure 6.7). It was noted, however, that the AUCs peaked at 70 to 160 genome examples (with correspond-

125 ing AUCs between 0.93–0.95) and the performance gradually declined as more genome examples were added to the profile. Considerable variations of AUC were also noted when full 417 genomes were included in the profile panel (median AUC:

0.872; interquartile range: 0.858—0.898; range: 0.824—0.925).

126 Median AUC versus number of genome examples Statistical CGP (amss scoring function) on S. agalactiae 2603 V/R peptidoglycan-related genes

1.0

0.9

Median AUC 0.8

0.7

0.6

0.5

15 Negative examples (Nn) 10

5 400 300 200 100 0 0 Positive examples (Np) ˆ 0.05717 0.04061 median(AUC) = 0.98253 − − , r2 = 0.8499 Np Nn

Figure 6.5: Case study 1, simulation 1: median AUC versus number of genome examples (25 runs)

127 Areas under ROC curve versus number of genome examples amss scoring function, peptidoglycan genes (M-validation set) 128 Area under ROC curve 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Negative 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Positive 24 47 71 94 118 141 165 188 212 235 259 282 306 329 353 376 400

Number of genome examples

Figure 6.6: Case study 1, simulation 2: median AUC versus number of genome examples (1000 runs, equal proportion of positive and negative genome examples) Median AUC versus number of genome examples in inductive CGP 25 runs of stratified 10−fold cross−validations, M validation set 129 Area under ROC curve (AUC) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 417 0 100 200 300 400

Number of genome examples (N)

Figure 6.7: Case study 1, simulation 3: AUC versus number of genome examples in inductive CGP (SVM/Poly algorithm) on Strepto- coccus agalactiae 2603 peptidoglycan-related genes, M validation set. Twenty-five simulation runs of stratified 10-fold cross-validation were performed. The black line indicates the median AUC, the grey solid lines indicate the upper and lower quartiles, and the dotted grey lines indicate the maximum and minimum AUC obtained the from the simulations. 6.2.3 Discussion

The effect of the number of genome examples on statistical CGP was evaluated by two simulations. It was found that even with very few genome examples, the amss scoring function could still discover genes associated with peptidoglycan metabolism with high accuracy. The second simulation revealed a wide confidence level with very few (<3) genomes, suggesting that careful selection of genome examples are needed when there is a lack of genome examples to represent the phenotype of interest.

For inductive CGP, increasing profile dimension beyond 200 genome examples evidently resulted in decreases in subsequent median AUCs, implying that a pro- portion of genome examples was less informative and might not have contributed to the identification of genes of interest. This finding coincides with the obser- vation by Johti et al., where increasing the phylogenetic profile dimension with redundant genomes did not necessarily improve the accuracy in eukaryotic gene function prediction [179].

6.3 Case study 2: Anaerobic mixed-acid fermentation genes

In bacteria, energy generation for diverse cellular processes takes the forms of aer- obic or anaerobic respirations. Anaerobic bacteria can thrive in oxygen-deprived environments by utilising many molecules as terminal electron acceptors. The pro- cess of anaerobic respiration is another well-studied metabolic process in bacteri- ology. The experiments in this section will evaluate both CGP methods by testing its ability to identify the genes responsible for anaerobic mixed-acid fermentations.

130 6.3.1 Methods

Enzymes responsible for anaerobic respirations and fermentations were identified from the pathway databases [173, 174] and literature searches [180, 181]. Statis- tical and inductive CGP methods described in Chapter 5 were again applied to derive the occurrence matrix and to rank candidate genes to rediscover anaerobic mixed-acid fermentation in EC-K12. All 4134 genes in the genome EC-K12 were used as candidates for prioritisation. For statistical CGP, 200 bacterial genomes of obligatory and facultative anaerobes capable of performing anaerobic metabolism were selected as positive examples, and 142 genomes of obligatory aerobes that do not perform anaerobic respiration were selected as negative examples (Appendix

Table B.1). Both AUC and probability enrichment were obtained for all scoring functions (Equations 5.23 and 5.21) evaluated.

For inductive CGP, the occurrence patterns of 4134 candidate genes in the same

342 genomes were used to train supervised machine learning models in Section 5.4.

AUC with stratified 10-fold cross-validation was used to measure the generalisation performance of the inductive CGP algorithms.

6.3.2 Results

Statistical CGP on the anaerobic mixed-acid fermentation rediscovery task for EC-

K12 performed poorly (AUC: 0.46–0.77). However, the maximum probability en- richment was high (up to 108-fold, Table 6.6). The average probability enrichments were between 0.8–2.5 folds across all the scoring functions. The chisq scoring function achieved the best prioritisation performance with an AUC of 0.767. The position of the anaerobic genes are listed in Table 6.8.

Bacterial genes specific to anaerobic metabolism were identified with high ranking scores (the pfl complex genes: above 0.27 pct; adhE: 2.1 pct; ackA: 2.1 pct; pta:12pct; Tables 6.7 and 6.8) by amss. In contrast, genes shared with aerobic

131 Table 6.6: Case study 2: performance (in AUC) of anaerobic mixed-acid fermen- tation genes in E. coli K-12.

Statistical CGP Inductive CGP∗

Scoring function AUC (η/ηmax) Algorithm AUC sens 0.634 (1.2/1.8) NB 0.695 spec 0.464 (0.8/1.5) LR 0.796 ppv 0.519 (1.1/2.0) ADTree 0.780 npv 0.594 (1.8/11.0) IBk 0.860 amss 0.578 (2.4/96.6) J48 0.663 hmss 0.628 (2.4/95.1) SMO/Poly 0.848 OR 0.537 (1.2/2.3) SMO/RBF 0.782 chisq 0.767 (3.2/109) bchisq 0.585 (2.5/109) F 0.698 (2.5/69.9)

∗ Average AUCs of stratified 10-fold cross-validations. Note: Thirty-eight known genes were labelled as known (out of 4131 genes in the EC-K12 genome). The AUCs in inductive CGP were calculated using stratified 10-fold cross-validation. Abbreviations: η: average probability enrichment (n-fold); ηmax: maximum probability enrich- ment (n-fold); sens: sensitivity; spec: specificity; ppv: positive predictive value; npv:negative predictive value; amss: arithmetic mean of sensitivity and specificity; hmss: harmonic mean of sensitivity and specificity; OR: odds ratio; chisq: χ2 scoring function; bchisq: signed χ2 scoring function; F: F-measure; NB:na¨ıve Bayes classifier; LR: logistic regression; ADTree: alternating decision tree; IBk: k-nearest neighbour classifier; J48: J48 decision tree; SMO: support vector machine trained by SMO algorithm; Poly: polynomial kernel; RBF: radial basis function kernel respiration, for example, the fumerase genes (fumABC) and the phosphoenolpyru- vate carboxylase gene (ppc) were ranked much lower (61–96 pct and 57 pct re- spectively). For genes encoding fumarate reductase complex, there were mixed results: the membrane anchor subunits (frdCD) were ranked highly (10.1 and 7.8 pct respectively) and the catalytic subunits were placed at the bottom of the rank

(frdAB, 93 and 99 pct respectively).

In contrast to statistical CGP, the best inductive CGP methods achieved reason- able performance. The best AUCs with inductive CGP were produced by IBk and by SVM/Poly (0.86 and 0.85).

132 Table 6.7: Case study 2: List of anaerobic mixed-acid fermentation genes in the validation set

Gene Gene product Locus pct Genes specific to mixed-acid fermentation Pyruvate formate-lyase complex pflA pyruvate formate-lyase activating enzyme [E.C. 1.97.1.4] b0902 0.19 pflB pyruvate formate lyase I [E.C. 2.3.1.54] b0903 0.15 pflD formate C-acetyltransferase [E.C. 2.3.1.54] b3951 0.17 tdcE pyruvate formate-lyase 4/2-ketobutyrate formate-lyase b3114 0.05 pflC pyruvate formate lyase II activase b3952 0.07 ybiY predicted pyruvate formate lyase activating enzyme b0824 0.07 yjjW predicted pyruvate formate lyase activating enzyme b4379 0.07 ybiW predicted pyruvate formate lyase b0823 0.22

133 yfiD pyruvate formate lyase subunit b2579 0.27 Acetate formation pta phosphate acetyltransferase [E.C. 2.3.1.8] b2297 12.03 ackA acetate kinase [E.C. 3.6.1.7] b2296 2.08 Ethanol and lactate formation adhE alcohol / acetaldehyde dehydrogenase [E.C. 1.1.1.1 and 1.2.1.10] b1241 2.06 ldhA D-lactate dehydrogenase (NAD dependent) [E.C. 1.1.1.28] b1380 66.28 Formate hydrogenlyase complex hycA regulator of the transcriptional regulator FhlA b2725 30.19 hycB hydrogenase 3, Fe-S subunit b2724 10.34 hycC hydrogenase 3, membrane subunit b2723 99.27 hycD hydrogenase 3, membrane subunit b2722 99.71 hycE hydrogenase 3, large subunit b2721 99.83 hycF formate hydrogenlyase complex iron-sulphur protein b2720 99.69 hycG hydrogenase 3 and formate hydrogenase complex, HycG subunit b2719 99.93 (Continue on next page) Table 6.7: Case study 2: Anaerobic mixed-acid fermenation genes ranked by amss scoring function (cont’d)

Gene Gene product Locus pct hycH protein required for maturation of hydrogenase 3 b2718 21.98 hycI protease involved in processing C-terminal end of HycE b2717 23.00 fdhD formate dehydrogenase formation protein b3895 70.61 fdhE formate dehydrogenase formation protein b3891 15.47 fdhF formate dehydrogenase-H, selenopolypeptide subunit b4079 88.31 Fumarate/Succinate formation frdA fumarate reductase (anaerobic) catalytic and NAD/flavoprotein subunit b4154 93.00 frdB fumarate reductase (anaerobic), Fe-S subunit b4153 99.15 frdC fumarate reductase (anaerobic), membrane anchor subunit b4152 10.38 frdD fumarate reductase (anaerobic), membrane anchor subunit b4151 7.84

134 Genes that also share with other pathways Pyruvate kinases pykF pyruvate kinase I b1676 41.06 pykA pyruvate kinase II b1854 41.06 ppc phosphoenolpyruvate carboxylase b3956 57.15 Fumarate/Succinate formation mdh malate dehydrogenase, NAD(P)-binding b3236 2.35 fumA fumarate hydratase (fumerase A), aerobic Class I b1612 61.49 fumB anaerobic class I fumarate hydratase (fumerase B) b4122 61.49 fumC fumarate hydratase (fumerase C),aerobic Class II b1611 95.59 Ethanol and Lactate formation lldD FMN- linked L-lactate dehydrogenase b3605 91.77 adhP ethanol-active dehydrogenase/acetaldehyde-active reductase b1478 87.9 (End of table) Note: Locus: position of gene in the EC-K12 genome. pct: position of gene in the prioritised rank (percentile from the top of the rank). Table 6.8: Case study 2: gene rank produced by amss scoring function (anaerobic mixed-acid fermentation)

Rank pct Score COG Gene Locus Function 1 0.02 0.806 F nrdD b4238 anaerobic ribonucleoside-triphosphate reductase 2 0.05 0.806 C tdcE b3114 pyruvate formate-lyase 4/2-ketobutyrate formate-lyase 3 0.07 0.805 O ybiY b0824 predicted pyruvate formate lyase activating enzyme 3 0.07 0.805 O pflC b3952 pyruvate formate lyase II activase 3 0.07 0.805 O yjjW b4379 predicted pyruvate formate lyase activating enzyme 6 0.15 0.803 C pflB b0903 pyruvate formate lyase I 7 0.17 0.802 C pflD b3951 predicted formate acetyltransferase 2 (pyruvate formate lyase II) 8 0.19 0.797 O pflA b0902 pyruvate formate lyase activating enzyme 1 9 0.22 0.787 C ybiW b0823 predicted pyruvate formate lyase 10 0.24 0.781 O nrdG b4237 anaerobic ribonucleotide reductase activating protein

135 11 0.27 0.781 R yfiD b2579 pyruvate formate lyase subunit 12 0.29 0.774 R yhcC b3211 predicted Fe-S oxidoreductase 13 0.31 0.765 R ybhA b0766 predicted hydrolase 14 0.34 0.757 E pepT b1127 peptidase T 15 0.36 0.748 G nagE b0679 fused N-acetyl glucosamine specific PTS enzyme: IIC, IIB , and IIA compo- nents 15 0.36 0.748 crr b2417 glucose-specific enzyme IIA component of PTS 15 0.36 0.748 G bglF b3722 fused beta-glucoside-specific PTS enzymes: IIA component/IIB compo- nent/IIC component 18 0.44 0.746 G treB b4240 fused trehalose(maltose)-specific PTS enzyme: IIB component/IIC component 19 0.46 0.744 G murP b2429 fused predicted PTS enzymes: IIB component/IIC component 20 0.48 0.737 G ascF b2715 fused cellobiose/arbutin/salicin-specific PTS enzymes: IIB component/IC component 21 0.51 0.734 P thiI b0423 sulfurtransferase required for thiamine and 4-thiouridine biosynthesis (Continue on next page) Table 6.8: Case study 2: Gene rank produced by amss scoring function (anaerobic mixed-acid fermentation) (cont’d)

Rank pct Score COG Gene Locus Function 22 0.53 0.733 G ptsH b2415 phosphohistidinoprotein-hexose phosphotransferase component of PTS system (Hpr) 23 0.56 0.732 E brnQ b0401 predicted branched chain amino acid transporter (LIV-II) 24 0.58 0.725 F deoD b4384 purine-nucleoside phosphorylase 25 0.61 0.721 R ybiV b0822 predicted hydrolase 26 0.63 0.721 G nagB b0678 glucosamine-6-phosphate deaminase 27 0.65 0.718 P nirC b3367 nitrite transporter 28 0.68 0.715 G chbC b1737 N,N’-diacetylchitobiose-specific enzyme IIC component of PTS 29 0.70 0.714 manZ b1819 mannose-specific enzyme IID component of PTS 29 0.70 0.714 agaV b3133 N-acetylgalactosamine-specific enzyme IIB component of PTS

136 31 0.75 0.713 R cof b0446 thiamine pyrimidine pyrophosphate hydrolase 32 0.77 0.711 S yjjP b4364 predicted inner membrane protein 33 0.80 0.710 P focB b2492 predicted formate transporter 34 0.82 0.710 luxS b2687 S-ribosylhomocysteinase 35 0.85 0.709 agaD b3140 N-acetylgalactosamine-specific enzyme IID component of PTS 36 0.87 0.708 G agaI b3141 galactosamine-6-phosphate isomerase 37 0.90 0.707 manY b1818 mannose-specific enzyme IIC component of PTS 38 0.92 0.707 GT ulaC b4195 L-ascorbate-specific enzyme IIA component of PTS 39 0.94 0.707 ulaA b4193 L-ascorbate-specific enzyme IIC component of PTS 40 0.97 0.706 agaB b3138 N-acetylgalactosamine-specific enzyme IIB component of PTS (End of table) Abbreviations: COG: Cluster of orthologous groups [175] (C: energy production and conversion; E: amino acid transport and metabolism; F: nucleotide transport and metabolism; G: carbohydrate transport and metabolism; O: post-translational modification, protein turnover, chaper- ones; P: inorganic ion transport and metabolism; R: general function prediction only; S: function unknown; T: signal transduction mechanisms. COG functional category descriptions referenced from http://www.ncbi.nlm.nih.gov/COG/grace/fiew.cgi). Locus: the po- sition of gene in the EC-K12 genome. pct: position (in percentile) from the top of the rank. Rank: the order of gene in the prioritised gene list. Score: score by amss scoring function (arithmetic mean of sensitivity and specificity). 6.3.3 Discussion

In this experiment, anaerobic fermentation genes were prioritised using the iden- tical frameworks to the peptidoglycan experiment. It was found that statistical

CGP performed more poorly compared to its near-perfect performance in the pep- tidoglycan experiment. Despite the poor overall performance, the specificity of statistical CGP was again demonstrated. The results suggested that genes specific to anaerobic respiration were placed very highly on the rank (ηmax > 95-fold), whereas genes sharing with the obligatory aerobic bacteria (the negative genome examples) were ranked much lower. As these shared gene are present in both phe- notypic groups, finding these shared genes by applying statistical CGP alone is a challenging task. Alternative methods are needed to aid in the discovery of such non-specific genes.

Inspection of the top of the rank genes revealed several “false positive” genes appearing among the correct genes listed in our validation set. Among these false positives, the nrdDandnrdG genes were found amongst pfl genes within the top-10

(Table 6.8). The nrdDandnrdG genes encode for class III ribonucleotide reduc- tase and its activating protein, which have been shown to be expressed only in an oxygen-deprived environment in anaerobic bacteria [182]. Thus, the apparent false positive genes could be attributed to the discovery of unrelated anaerobic genes that are not specific to the fermentation process. Filters based on other knowledge about the gene, such as gene ontology or cluster of orthologous groups (COG), could be applied to limit the candidate genes to a particular functional group thus to improve the practicality of the CGP methods.

The better AUCs (> 0.84) achieved by inductive CGP algorithms (IBk and

SMO/Poly) compared with statistical CGP could be explained by the following statement. In the statistical CGP of anaerobic fermentation genes, specific gene occurrence patterns in genomes were explicitly specified, such that the relevant

137 genes were assumed to occur more frequently in anaerobic bacteria and vice versa.

However, better results from the inductive CGP experiment indicate that there is a different occurrence pattern from the pattern specified by the statistical CGP ex- periment groupings. While statistical CGP is useful in discovering genes specific to a function, inductive CGP can identify functionally similar genes from a set of genes with known functional characteristics. It seems reasonable to use machine learning to assess whether a candidate gene is likely to be associated with a par- ticular function. To validate this claim, a large-scale experiment was performed to further examine the generalisability of inductive CGP.

6.4 Case study 3: Validation of inductive CGP by using

KEGG pathways

Kyoto Encyclopaedia of Genes and Genomes (KEGG2) is an online resource that provides systematic curation of large collections of genes and genomes data an- notated into major functional categories [173]. In this comprehensive library, bio- chemical pathways are linked with gene loci in different genomes, which provides high-quality validation sets for exploitation by rediscovery experiments. In this section, an experiment was performed by using KEGG pathways in evaluating the generalisability of inductive CGP methods.

6.4.1 Methods

A large-scale rediscovery experiment was conducted on the curated KEGG metabolic pathways. Thirty-one metabolic pathways and functional groups with at least 10 genes involved in each pathway were selected for evaluation from the 81 known pathways available for the SA-2603 genome in KEGG (Table 6.9). The same

2KEGG: http://www.genome.jp/kegg/

138 occurrence matrix of 2124 genes in 483 available genomes were obtained as de- scribed in Section 6.1. All seven inductive CGP algorithms described in Section 5.4 were applied, and the generalisation performance of the algorithms were evalu- ated by stratified 10-fold cross-validation. Biochemical pathways with less than

10 genes were not selected as validation sets, as there were insufficient genes for training the positive class in each fold (< 1 positive training gene per fold), which do not allow meaningful comparisons by stratified 10-fold cross-validation.

6.4.2 Results

The best supervised machine learning algorithms identified 14 pathways (45%) with AUCs > 0.90 and 28 pathways (87%) with AUCs > 0.80 (Figure 6.8). The best performing algorithm was IBk which had the highest AUC in 10 pathways.

ADTree and SVM/Poly also performed well each with best AUC in 8 pathways.

SVM/RBF achieved best AUC in 4 pathways. NB and J48 did not achieve a best

AUC in any of the 31 pathways studied. The AUCs are outlined in Table 6.10.

139 Inductive CGP performance in evaluating 31 KEGG metabolic pathways

Fatty acid biosynthesis Phosphotransferase system (PTS) Propanoate metabolism Arginine and proline metabolism Aminoacyl−tRNA biosynthesis Pentose phosphate pathway Peptidoglycan biosynthesis Pentose and glucuronate interconversions Tyrosine metabolism Ribosome Aminosugars metabolism ABC transporters − General Glycolysis / Gluconeogenesis Oxidative phosphorylation Two−component system − General Galactose metabolism Glutamate metabolism Pyruvate metabolism Starch and sucrose metabolism Folate biosynthesis DNA replication Methionine metabolism Protein export Glycine, serine and threonine metabolism Pyrimidine metabolism Algorithm Purine metabolism NB Alanine and aspartate metabolism LR ADTree Carbon fixation IBk Fructose and mannose metabolism J48 Butanoate metabolism SMO/Poly SMO/RBF Phenylalanine, tyrosine and tryptophan biosynthesis

0.5 0.6 0.7 0.8 0.9 1.0 Area under ROC curve

Figure 6.8: Case study 3: the performance of inductive CGP in prioritising 31 KEGG metabolic pathways (stratified 10-fold cross-validation). Genes from the 31 metabolic pathways from S. agalactiae 2603 genome were rediscovered by 7 ma- chine learning algorithms. Abbreviations: NB:na¨ıve Bayes classifier; LR: logistic regression; ADTree: alternating decision tree; IBk: k-nearest neighbour classifier; J48: J48 decision tree; SMO: support vector machine trained by SMO algorithm; Poly: polynomial kernel; RBF: radial basis function kernel 140 Table 6.9: Case study 3: KEGG pathways (SA-2603) used in the evaluation of inductive CGP

Pathway ID Description Genes sag00061 Fatty acid biosynthesis 10 sag00271 Methionine metabolism 10 sag00640 Propanoate metabolism 10 sag00790 Folate biosynthesis 10 sag00190 Oxidative phosphorylation 11 sag00550 Peptidoglycan biosynthesis 11 sag00650 Butanoate metabolism 11 sag00040 Pentose and glucuronate interconversions 13 sag00052 Galactose metabolism 13 sag00350 Tyrosine metabolism 13 sag00330 Arginine and proline metabolism 14 sag00400 Phenylalanine, tyrosine and tryptophan biosynthesis 14 sag00530 Aminosugars metabolism 14 sag03060 Protein export 14 sag00710 Carbon fixation 15 sag02020 Two-component system - General 15 sag00260 Glycine, serine and threonine metabolism 17 sag00500 Starch and sucrose metabolism 18 sag03030 DNA replication 18 sag00051 Fructose and mannose metabolism 19 sag00030 Pentose phosphate pathway 20 sag00252 Alanine and aspartate metabolism 20 sag00620 Pyruvate metabolism 21 sag00251 Glutamate metabolism 22 sag00970 Aminoacyl-tRNA biosynthesis 22 sag00010 Glycolysis / Gluconeogenesis 27 sag02060 Phosphotransferase system (PTS) 29 sag00240 Pyrimidine metabolism 48 sag00230 Purine metabolism 55 sag03010 Ribosome 56 sag02010 ABC transporters - General 111

List of available KEGG pathways of SA-2603 genome (accessed April 2007): http://www.genome.jp/kegg-bin/show_organism?menu_type=pathway_maps&org=sag.

141 Table 6.10: Case study 3: cross-validation results of inductive CGP algorithms on 31 selected KEGG pathways Algorithms KEGG pathway or functional group NB LR ADTree IBk J48 SVM/P SVM/R ABC transporters - General 0.754 0.861 0.873 0.914 0.824 0.912 0.905 Alanine and aspartate metabolism 0.802 0.737 0.770 0.737 0.785 0.795 0.809 Aminoacyl-tRNA biosynthesis 0.886 0.928 0.960 0.905 0.926 0.929 0.893 Aminosugars metabolism 0.725 0.674 0.923 0.827 0.639 0.899 0.762 Arginine and proline metabolism 0.766 0.962 0.907 0.875 0.821 0.935 0.943 Butanoate metabolism 0.589 0.750 0.681 0.765 0.563 0.671 0.784 Carbon fixation 0.789 0.724 0.764 0.799 0.715 0.791 0.728 DNA replication 0.790 0.789 0.853 0.810 0.854 0.860 0.791 Fatty acid biosynthesis 0.893 0.989 0.994 0.974 0.776 0.988 0.857 Folate biosynthesis 0.751 0.787 0.849 0.861 0.607 0.836 0.812 142 Fructose and mannose metabolism 0.656 0.524 0.723 0.795 0.632 0.787 0.626 Galactose metabolism 0.643 0.742 0.791 0.811 0.743 0.875 0.831 Glutamate metabolism 0.766 0.767 0.810 0.872 0.746 0.802 0.797 Glycine and serine and threonine metabolism 0.746 0.772 0.843 0.755 0.730 0.802 0.812 Glycolysis / Gluconeogenesis 0.752 0.879 0.908 0.873 0.808 0.894 0.879 Methionine metabolism 0.739 0.846 0.812 0.753 0.777 0.858 0.793 Oxidative phosphorylation 0.666 0.740 0.785 0.906 0.720 0.806 0.751 Pentose and glucuronate interconversions 0.803 0.720 0.919 0.933 0.721 0.939 0.808 Pentose phosphate pathway 0.717 0.795 0.898 0.948 0.737 0.894 0.852 Peptidoglycan biosynthesis 0.760 0.882 0.853 0.902 0.653 0.757 0.940 Phenylalanine and tyrosine and tryptophan biosynthesis 0.709 0.663 0.668 0.715 0.621 0.682 0.729 Phosphotransferase system (PTS) 0.883 0.949 0.962 0.961 0.864 0.990 0.975 Propanoate metabolism 0.739 0.969 0.985 0.964 0.675 0.905 0.919 Protein export 0.646 0.788 0.745 0.692 0.646 0.857 0.749 Purine metabolism 0.740 0.776 0.789 0.813 0.783 0.797 0.791 Pyrimidine metabolism 0.761 0.666 0.816 0.741 0.678 0.713 0.718 Pyruvate metabolism 0.749 0.789 0.853 0.844 0.503 0.868 0.824 Starch and sucrose metabolism 0.792 0.738 0.766 0.866 0.746 0.858 0.772 Two-component system - General 0.783 0.698 0.772 0.772 0.723 0.893 0.840 Tyrosine metabolism 0.760 0.806 0.693 0.936 0.654 0.727 0.832 Ribosome 0.794 0.851 0.930 0.897 0.812 0.910 0.922 Number of pathways in which the algorithm achieved best AUC 0 1 8 10 0 8 4

Values shown in this table are the areas under ROC curves. The bold values indicated the best AUC was achieved by the algorithm in the pathway evaluated. NB: na¨ıve Bayes classifier; LR: logistic regression; ADTree: alternating decision tree; IBk: k-nearest neighbour; J48: J48 decision tree; SVM: support vector machine; SVM/P: SVM with polynomial kernel; SVM/R: SVM with radial basis kernel. 6.4.3 Discussion

Evolutionary pressure may contribute to specific gene occurrence patterns

This experiment demonstrated that inductive CGP performs well in rediscovering many KEGG-related metabolic pathways, suggesting that genes encoding complex phenotypes are frequently co-present and co-absent across the genomes, forming specific occurrence patterns, thus allowing the functional predictions by supervised machine learning. This gene co-occurrence phenomenon may reflect the process of natural selection and the adaptation of microorganisms into different evolutionary niches.

For bacteria undergoing positive selection, the acquisition of a particular gene group may result in phenotypes conferring survival advantage for the microorgan- ism to adapt to a new environment. It has been known that genes contributing to symbiosis or pathogenesis are frequently organised into genomic islands, in which the gene mobility is facilitated by horizontal gene transfers, conferring the ability to form a new relationship with the host [183]. The good AUC achieved by in- ductive CGP in KEGG pathways suggested such specific functional co-occurrence patterns of genes do exist, regardless of the physical proximity of the genes or the presence of a mobile genetic structure.

Similarly, the negative selection can also contribute to the co-absence of func- tional gene units across multiple genomes. For a complex phenotype encoded by multiple genes, the deletion of a critical gene could result in the non-expression of the phenotype, leading to the subsequent loss of other non-functional genes over time. Conversely, as illustrated in Figure 6.9, the loss of a gene with minor non- critical contribution to a phenotype p (illustrated by g3) would see retention of the co-occurrence pattern of phenotype critical genes. Thus, the differential co-

143 g g g g 1 2 3 4 Ancestral species displaying phenotype p, ··· ··· encoded by genes g1,g2, g3,andg4

Loss of non-critical Loss of critical gene gene g3 g2

g1 g2 g3 g4 g1 g2 g3 g4 ··· ··· ··· ···

Phenotype p persisted No phenotype p

Phenotype p selected for Phenotype p selected against

Further purifying selection Further purifying selection

Retention of functional Loss of non-functional genes g1,g2,g4 genes g1,g3,g4

g1 g2 g4 g1 g3 g4 ··· ··· ··· ···

Co-presence of g1,g2,g4 in Co-absence of g1,g2,g4 in descendant species descendant species

Figure 6.9: Potential mechanism for the co-occurrence of functionally similar genes in genomes occurrence patterns in genes can be exploited for comparative genomic studies, as demonstrated by our methods, in assisting our understanding of gene functions.

6.5 Potential limitations

Sampling biases

As discussed in Chapter 4.7.1, prioritising candidate genes with reliance on gene features derived from secondary data sources such as literature, ontology, or an- notation as background knowledge may introduce literature or annotation biases

144 [148, 155, 161]. While we minimised using external knowledge to avoid such bi- ases, several sources of sampling bias could have limited the performance of our gene prioritisation methods. With inductive CGP, performance may be impeded by incorporating inconsistent (or wrong) genes in the training set. Increasing the heterogeneity of the training genes may adversely influence prioritisation perfor- mance. An example can be found in the slight decrease of performance (in median

AUC) from C to M validation sets in the peptidoglycan experiments in EC-K12 candidates, as shown in Figure 6.2. With statistical CGP, accuracy can be affected by ad hoc selection of genome examples, especially by the choice of positive and negative examples representing the variations in the study phenotype.

Using KEGG as a validation data source

There were considerable variations in inductive CGP performances across differ- ent KEGG categories (Case study 3). By manually inspecting the worst-performing functional category (phenylalanine, tyrosine, and tryptophan biosynthesis), it was found that the phenylalanine and tyrosine tRNA synthases genes were also included in the validation set. The tRNA synthases have roles downstream of the biosynthe- sis pathways and thus are not involved in the anabolism of these essential amino acids. Removal of the unrelated genes improved overall performance (with best

AUC of 0.852 achieved by SVM/Poly, see Appendix C). This contrasts with the best-performing pathways (for example, fatty acid biosynthesis and peptidogly- can synthesis pathways) where only function-specific genes were included in the validation set. Since KEGG is a commonly-used resource for benchmarking com- putational methods of functional discovery [165, 166, 179], this finding suggests that careful selection must be practised in constructing validation sets, as mixture of distinct functional groups could lead to inconsistent results. This specific sam-

145 pling bias needs to be considered when explaining variations in the predicting of gene functions by in silico methods.

The inclusion of paralogs in the occurrence matrix

Reciprocal best BLAST hits frequently used in the search of orthologous genes.

In our experiments, we applied non-reciprocal BLAST E-value < 10−5 as the criterion for determining the sequence similarity between genes. While our results supported its use in functional discovery at the gene level, the use of such a criterion may include many paralogs and may affect CGP performance of large gene families with diverse functions. Detecting and excluding paralogs may be required to refine the gene ranking and warrant further studies.

6.6 Summary

A range of rediscovery experiments for benchmarking different CGP approaches was designed and performed in this chapter. Promising results were yielded from rediscovery of peptidoglycan-related genes, anaerobic respiration genes, and a di- verse range of bacterial metabolic pathways listed in the KEGG database. The CGP methods can be generalised and extrapolated to perform gene function discovery when a pair of positive and negative genome examples are available (statistical

CGP) or when a subset of genes with known function can be used for training ma- chine learning models (inductive CGP). With more genome sequences becoming available, it is anticipated that the demand of such methods will grow as many dif- ferent scenarios can be formulated and analysed. In the remaining chapters of this thesis, both CGP methods will be applied onto GBS to discover unknown virulence genes that may further explain the pathobiology of GBS neonatal infections. 

146 Part III

Virulence gene discovery in Streptococcus agalactiae

147 Motivation

The task of identifying virulence factors from more than 2000 genes in S. agalac- tiae genome is not trivial. In Chapter 3, we attempted to predict the virulence of

GBS using bacterial genotyping data, but failed to achieve sufficient discriminatory power to achieve clinical utility. It was concluded that GBS markers associated with true virulence genes were needed to improve the power of prediction. Identi- fying such virulence factors can advance the development of a typing system that improves risk stratification of GBS infective diseases.

In Chapter 6, promising results were demonstrated in rediscovering genes spe- cific to particular functions (statistical CGP) and genes with similar functions (in- ductive CGP). By considering bacterial virulence as a special trait, these compu- tational methods can be applied to guide the identification of unknown virulence factors in S. agalactiae.

The last part of this thesis is to apply the in silico method to the task of GBS virulence gene discovery. Chapter 7 will conduct a literature review to enumerate a list of currently known GBS virulence factors. In Chapter 8, statistical CGP will be used to discover potential virulence genes shared among bacterial pathogens causing neonatal sepsis. Similarly, inductive CGP will be applied in Chapter 9 to identify genes sharing similar occurrence profiles with currently known GBS virulence factors. The putative biological significance of the discovered virulence gene candidates will be discussed in Chapter 10.

148 Literature Identifying genes frequently co-present and Prioritised rank review (Appendix) co-absent in neonatal pathogens by applying (statistical CGP, Non-pathogens (331 scoring functions on occurrence matrix Chapter 8) genomes) Statistical CGP

Infrequent neonatal Scoring Avail. genome pathogens sequences

Frequent neonatal pathogens (30 genomes)

149 A prioritised rank for each class of virulence Candidate genes to be genes (inductive CGP, prioritised (all GBS genes Chapter 9) in three genomes) Identification of known virulence genes Inductive CGP vir. gene 1

GBS vir. gene 2 Supervised machine reference learning genomes . .

vir. gene 134 Literature review Modelling the occurrence patterns of (Chapter 7) virulence genes by machine learning

Schematic diagram of GBS virulence gene prioritisation Chapter 7

Review: virulence genes of Streptococcus agalactiae

7.1 Introduction

Bacterial pathogens use a variety of mechanisms to survive and multiply in the human host. Certain virulence factor combinations may improve the success of pathogen colonisation and invasion. It was previously demonstrated in Chapter 6.4 that genes contributing to the same biological function usually occur together. The co-occurrence property allows the identification by the inductive CGP method de- scribed in Chapter 5.4. By considering GBS virulence as a specific phenotypic trait, it can be hypothesised that these virulence genes share similar occurrence patterns across bacterial genomes, analogous to the rediscovery experiment of the

KEGG pathways in Chapter 6.4. The purpose of this chapter is to perform a litera- ture survey on currently known GBS virulence genes, for which mechanisms have been demonstrated in biological experiments, to serve as the evidence base in the identification of unknown virulence genes in Chapter 9.

150 In general, GBS virulence genes can be broadly classified into three functional categories [184, 185]:

1. Adhesins: genes responsible for the adherence to host epithelium and extra-

cellular matrix.

2. Invasins: Genes responsible for the invasion of host cells or the spreading of

bacteria across the extracellular matrix.

3. Evasins: Genes responsible for evading the host immune system and defence

mechanisms.

151 Table 7.1: List of known virulence genes of S. agalactiae

Gene loci in reference genomes Gene Function NEM316 (III) A909 (Ia) 2603 (V) I. Adhesins fbsA fibrinogen-binding protein FbsA GBS1087 SAK1142 SAG1052 fbsB fibrinogen-binding protein FbsB GBS0850 SAK0955 SAG0832 pavA fibronectin-binding protein GBS1263 SAK1277 SAG1190 scpB C5a peptidase GBS1308 SAK1320 SAG1236∗ lmb laminin-binding protein GBS1307 SAK1319 SAG1234 minor piln streptococcal minor pilin cluster GBS0628-32 SAK0776-80 SAG0645-49 II. Invasins cyl† β-haemolysin/cytolysin GBS0644-55 SAK0790-0801 SAG0662-73 152 SAG0499 haemolysin A GBS0545 SAK0600 SAG0499 SAG1319 haemolysin III GBS1389 SAK1350 SAG1319 SAG1966 haemolysin precursor GBS1953 SAK1927 SAG1966 cfb CAMP factor GBS2000 SAK1983 SAG2043 spb1 surface protein Spb1 GBS1477 SAK1440 SAG1407 hylB hyaluronate lyase GBS1270 SAK1284 SAG1197 SAG1215 exfoliative A GBS1287 SAK1301 SAG1215 rib‡ surface protein rib GBS0470 SAG0433 bca‡ C-α protein SAK0517 III. Immune system evading genes bac C-β protein SAK0186 cps cps cluster GBS1237-47 SAK1251-62 SAG1162-75 neu neu cluster GBS1233-36 SAK1247-50 SAG1158-61 scpB§ C5a peptidase (see above) SAG0416 putative protease GBS0451 (?absent) SAG0416 cspA‡ serine protease GBS2008 SAK1991 SAG2053 pbp1A/ponA penicillin-binding protein 1A GBS0288 SAK0370 SAG0298

NB: ∗)IS1548 is embedded upstream of scpBgeneinS. agalactiae 2603 V/R. †) although primarily an invasin, cyl is capable of damaging phagocytes and hence also have a role in immune system evasion. ‡) dual roles of both an invasin and an immune system evading gene. §) dual roles of both an adhesin and an immune system evading gene. 7.2 Adhesins

7.2.1 Fibrinogen-binding protein genes fbsAandfbsB

The fbsA gene (327 amino acids) is located at GBS1087, SAG1052, and SAK1142.

The gene product of fbsA is a surface protein that contains several 16-amino acid repeats of fibrinogen-binding motifs which promote the adhesion to epithelial cells via binding to human fibrinogen [186,187]. The characteristic multivalent fibrinogen- binding sites on the FbsA protein causes rapid fibrin polymerisation and forms large mesh-like frameworks [188, 189]. The molecular reaction triggers thrombo- genesis in GBS sepsis and shields GBS from [188, 189]. The FbsA protein also interacts with glycoprotein IIb/IIIa, a major platelet integrin, and pro- motes platelet aggregation [190]. In mouse models, the presence of the fbsAgene is associated with higher bacterial count in experimental infection, resulting in a higher likelihood for the infected animals to develop septicaemia and septic arthri- tis [191].

A second fibrinogen-binding protein FbsB, encoded by the fbsB gene, is a 753- amino acid protein located at GBS0850, SAK0955, and SAG0832. FbsB protein was demonstrated to interact with human fibrinogen in vitro, and the important role of the protein in the internalisation of GBS into host cells was also identified [192].

Specific variants of fbsAandfbsB genes were found to be associated with the hypervirulent MLST ST-17 clone [193].

7.2.2 Fibronectin-binding protein gene pavA

The pavA gene, which encodes a surface protein in GBS, is located at GBS1263,

SAK1277, and SAG1190. A homolog of GBS pavA gene was found in Strepto- coccus pneumoniae. Pneumococcal pavA protein was identified to have the ability

153 to bind to fibronectin, a major glycoprotein in the mammalian extracellular ma- trix [194].

7.2.3 Streptococcal C5a-peptidase (encoded by the scpB gene)

The scpB gene is located at GBS1308 and SAK1320 and encodes a protein of 1150- amino acids in length, a surface protease expressed by all serotypes of GBS [195].

The gene product of scpB serves a dual role in GBS pathogenesis, acting as both an adhesin to human fibronectin and an immune evasion gene by acting as C5a- peptidase. The fibronectin-binding activity of the scpB gene product was demon- strated in an adhesion assay, where a portion of scpB protein was found to have affinity to human mucosal fibronectin [196]. This fibronectin-binding ability is in- dependent of the C5a-ase activity [197], which will be discussed in Section 7.4.2.

7.2.4 Laminin-binding protein (encoded by the lmb gene)

The lmb gene (GBS1307, SAK1319, SAG1234) is located immediately down- stream of the scpB gene. The gene product of lmb shares homology with strep- tococcal LraI adhesin family proteins and has a role in binding to human laminin

[198, 199]. In vitro experiments with GBS strains with Δlmb deletion showed a reduced degree of invasion into human brain microvascular endothelial cells [200].

7.2.5 Minor pilin gene cluster

A recent multi-genome screening study revealed pilus-like structures extending from GBS cell surface which was visible by immunogold electron microscopy

[201, 202]. The minor pilin gene cluster is located at SAG0645-49 in 2603V/R.

It consists of several sortase genes and genes responsible for pilus assembly [203].

The minor pilin protein GBS52 (SAG0645) contains two immunoglobulin G (IgG)- like domains, similar to the structure of staphylococcal adhesion CnaB protein,

154 which was demonstrated to have binding specificity to human lung epithelial cell

A549 [203].

7.3 Invasins

7.3.1 Cα-protein (bca), Rib (rib), and α-like protein (alp) family

The Cα-like protein family belongs to a class of high molecular weight surface pro- teins consisting of several genetic variants (encoded by bca [204], rib: GBS0470,

SAG0433 [205], and alp1–3 [206, 207]). The Cα-protein is regulated by 5–base pair short tandem repeats upstream of the gene [208]. All alp family proteins con- tain the Gram-positive LPXTG cell-wall anchoring motif and possess large num- bers of tandem repeats in the genes [204], forming characteristic ladder-like for- mations on Western blots [209]. The variations in tandem repeats may contribute to variations in antigenicity which may also have properties in evading antibody recognitions and host immune system.

GBS strains with inactivated Cα protein have attenuated virulence in mouse models [210]. Both proteins Rib and Cα elicit passive immunity in experimental models [211, 212]. Cα protein promotes GBS adherence to glycosaminoglycan

(GAG) and facilitates the translocation of bacteria across the membrane of ep- ithelial cells via an actin-dependent mechanism [213, 214]. Decreased number of tandem repeats in the alp genes is associated with increased virulence in neonatal invasion in both murine models [210] and clinical isolates [215].

7.3.2 β-haemolysin/cytolysin (encoded by the cyl gene cluster)

The extracellular virulence factor, β-haemolysin/cytolysin (β-h/c), is encoded by the cyl gene cluster (GBS0644-0655, SAK0790-0801, SAG0662-0673, Table 7.2) and is among the most well-characterised GBS virulence factors. β-h/c causes

155 Table 7.2: The β-haemolysin/cytolysin gene cluster Loci in reference genomes Gene Function Ref. NEM316 2603 A909 cylX GBS0644 SAG0662 SAK0790 unknown function [224] cylD GBS0645 SAG0663 SAK0791 fatty acid biosynthesis [224] cylG GBS0646 SAG0664 SAK0792 fatty acid biosynthesis [224] acpC GBS0647 SAG0665 SAK0793 acyl carrier protein [224] cylZ GBS0648 SAG0666 SAK0794 fatty acid biosynthesis [224] cylA GBS0649 SAG0667 SAK0795 ABC transporter [224] cylB GBS0650 SAG0668 SAK0796 ABC transporter [224] cylE GBS0651 SAG0669 SAK0797 required for (complete) [216, 224] haemolysis cylF GBS0652 SAG0670 SAK0798 aminomethyltransferase cylI GBS0653 SAG0671 SAK0799 fatty acid biosynthesis cylJ GBS0654 SAG0672 SAK0800 glycosyltransferase [225] cylK GBS0655 SAG0673 SAK0801 unknown, required for haemol- [225] ysis epithelial cell injury by forming transmembrane pores on the surface of host cells leading to subsequent cell lysis [216]. Mutants of a key enzyme of the haemolysin cluster, encoded by cylE, have an 100-fold decrease in bacterial inoculum count in intratracheally infected rabbits [217]. Similar results were found in murine models, where deletion of cylE gene was found to have effects of enhanced GBS survival from phagocytosis [218].

In addition to the cytolytic activities, β-h/c is also known to have roles in trig- gering inappropriate immune response in hosts. β-h/c is implicated in septic shock- like syndrome by inducing the inducible nitric-oxide synthase, resulting in similar effects to Gram-negative lipopolysaccharide [219]. β-h/c also contributes to GBS meningitis by activating early acute response of brain endothelium (the blood-brain barrier endothelial cells), mediated by the upregulation of several proinflammatory cytokines including IL-8, Groα,Groβ, GM-CSF at bone marrow, ICAM-1 and

Mcl-1 [220]. β-h/c causes direct injuries to pulmonary epithelial and microvascu- lar endothelial cells in vitro [221, 222], possibly through an inappropriate immune response with neutrophil chemotaxis via IL-8-mediated mechanism [223].

156 7.3.3 Hyaluronate lyase (encoded by the hylB gene)

The hyaluronate lyase (hylB gene: GBS1270, SAK1284, SAG1197) catalyses the cleavage of N-acetylglucosamine (1β →4) glycosidic bond of hyaluronan, a major component of extracellular matrix, into glucosamine and glucuronic acid [226–

229]. Although the pathogenicity of hylB was postulated in assisting the bacterial to travel through extracellular matrix, the pathogenic role of hylB in GBS is debated as insertion of IS1548 into hylB is a common finding among invasive isolates [230].

7.3.4 Other invasins

The spb1 gene encodes a surface protein that was suggested to have a role in mediating internalisation of bacteria into epithelial cells in virulent serotype III

GBS [231]. The streptococcal CAMP factor1, encoded by the cfb gene, is a cy- tolysin that forms discrete transmembrane pores on susceptible membranes and causes lysis of epithelial cells [232].

7.4 Immune system evasion

7.4.1 Cβ protein (encoded by the bac gene)

The second C protein and antigen, Cβ surface protein (encoded by the bac gene), is known to bind to and the Fc portion of immunoglobulin A (IgA) [233].

GBS strains expressing Cβ protein cause attenuated opsonisation and phagocytosis in host phagocytes [234,235]. Episodes of GBS bacteraemia can elicit Cβ-specific immunoglobulin-M (IgM) and IgG in bac-positive strains [236]. Both C proteins

(C-α and C-β) elicit protective immunity in murine models [237] and hence are potential vaccine candidates. The Cβ protein is mostly associated with serotype Ib

1Originally appeared in: Christie, R. and Atkins, NE and Munch-Petersen, E. A note on a lytic phenomenon shown by group B streptococci. Aust. J. Exp. Biol. Med. Sci. 1944: 22; 197–200

157 and shorter tandem repeats are also associated with isolates with higher degrees of invasive propensity [114].

7.4.2 Streptococcal C5a-peptidase (encoded by the scpB gene)

In addition to the fibronectin-binding activity described in Section 7.2.3, the strep- tococcal C5a-peptidase also facilitates the cleavage of human complement C5a, which leads to diminished activities of the potent anaphylatoxin and neutrophil chemotaxin of the complement cascade [195]. The C5a-peptidase activity can be inactivated by a 4-amino acid deletion at the histidine active site, where the mu- tant was shown to have decreased colonisation potential in animal models [238].

Two integrin-binding motifs (Arg-Gly-Asp) exist on the C5a-peptidase, and these motifs were believed to have roles in stabilising and activating the C5a-ase activi- ties [239].

7.4.3 Polysaccharide capsule (encoded by the cps cluster)

158 Table 7.3: The GBS neu gene cluster and the enzymatic functions of individual genes

Gene Function of gene product References Backbone synthesis cpsE β-1 glycosyltransferase; initiation of PRU synthesis [240] cpsH serotype-specific polysaccharide polymerase [241] cpsG β-1,4 galactosyltransferase (serotypes Ia, Ib, II, and III) [241] α-1,4 galactosyltransferase (serotypes IV–VII) [241] cpsR β-1,4 rhamnosyltransferase (serotype VIII) [241] Side-chain synthesis cpsI β-1,3-N-acetylglutaminyltransferase (serotype Ia) [240]

159 β-1,6-N-acetylglutaminyltransferase (serotypes Ib, IV, V, and VII) [240] cpsK sialic acid transferase [242, 243] Regulatory genes cpsY LysR-family transcriptional regulator [244] cpsA transcriptional enhancer [245] cpsB–D chain length regulation; PRU polymerisation coordination [245] Sialic acid synthesis neuB catalyses the formation of free NeuNAc from ManNAc [246] neuC UDP-GlcNAc 2-epimerase; converts GlcNAc to ManNAc [247] neuA 1) synthesis of CMP-NeuNAC from free Neu5NAc [248] 2) competitive O-acetylation of Neu5NAc [249, 250] neuD 1) converts free NeuNAc into CMP-NeuNAc [251] 2) O-acetyltransferase activity [252] Abbreviations: PRU: polysaccharide repeating units; NeuNAc: N-acetylneuraminic acid; ManNAc: N-acetyl-D-mannosamine; GlcNAc: N- acetylglucosamine; CMP: cytidine monophosphate; UDP: uridine diphosphate. The cps gene cluster is the genetic determinant of GBS capsular polysaccha- ride composition, which is an important factor against opsonophagocytosis. The structural diversity of GBS serotypes is determined by polymorphism of cps gene cluster, leading to variations in serotypes [94]. Different serotypes are associated with varying degree of virulence in GBS infective diseases [127]. Experiments with transposon mutagenesis demonstrated that the acapsular type III mutants have reduced lethality in neonatal rat models, suggesting that the polysaccharide capsule plays an important role in GBS pathogenesis [253].

The cps operon

The cps gene operon consists of a 26 kb gene cluster containing 16–20 open reading frames transcribed as a single mRNA polycistronic transcript. The operon contains four distinct functional sections, including chain length regulation (cpsB–D), back- bone synthesis (cpsE–HR), sialic side chain synthesis (cpsIJK), and transportation of polysaccharide products (cpsL). Regulatory genes exist at both upstream (cpsY) and downstream (cpsA) of the origin of the cps operon. A sialic acid synthesis gene cluster (neu cluster, neuA-D) is located immediate downstream of the cps cluster.

The promoter regions are located in front of cpsA, cpsEandung1 genes [240].

Back bone synthesis

The cpsE gene encodes a 462 amino acid glycosyltransferase which transfers β-1- glycosyl backbone onto the new subunit. The presence of a functional cpsEgene is essential for initiating the synthesis of polysaccharide repeating units (PRU) in which the synthetic activity was demonstrated by ΔcpsE mutants that produced bacterial strain without capsules [240].

The serotype-specific polysaccharide polymerase, encoded by cpsH gene, is present in all GBS serotypes. The serotype specificity of cpsH was demonstrated

160 by in vivo expression of a type III cpsH in serotype Ia which produced type III- specific capsular polysaccharide [241].

The cpsG gene is also present in all serotypes except serotype VIII. In serotypes

Ia, Ib, II, and III, the cpsG product is a β-1,4 galactosyltransferase transferring the β-D-Gal-P unit onto position 4 of GlcP [240], an enzymatic activity that is enhanced by cpsF gene product [241]. The cpsG gene is phylogenetically similar to genes cpsIandcpsJ, which encodes a 315 amino acid β-1,4-galactosyltransferase

[94]. In serotype IV–VII, the cpsG product catalyses an α-1,4 glucosyltransferase activity. In serotype VIII, cpsG–K are replaced by cpsR, which acts as the β-1,4 rhamnosyltransferase that links the rhamnose unit to the Glc-P unit [240].

Side-chain synthesis (cpsI–K)

The cpsI gene in serotype Ia encodes a 324 amino acid β-1,3-N-acetylglutaminyl- transferase [240], whereas in serotypes Ib, IV, V, and VII, the cpsI product acts as a β-1,6-N-acetylglutaminyltransferase. In serotypes II, III, and VI, CpsI is respon- sible for catalysing the formation of polysaccharide back bone.

The cpsK gene is a sialic acid transferase (α-2,3-N-acetylneuraminic-transferase) that catalyses terminal sialylation, a process that attaches cytidine monophospho-

N-acetylneuraminic acid (CMP-NeuNAc) onto the galactose unit of the side chains

[242, 243].

Regulatory genes

Two regulatory genes exist in the GBS cps operon. The gene cpsY (previously annotated as cpsR in serotype Ia) is a LysR-family transcriptional regulator, which is encoded in the opposite strand to cps cluster and controls the expression of the cps operon [244]. The cpsA gene is a transcriptional enhancer that increases the

161 amount of capsular produced by the bacteria [245]. The cpsA-D genes are responsible for coordinating the polymerisation of PRUs [245].

7.4.4 Sialic acid synthesis (encoded by the neu gene cluster)

The neu gene cluster (neuA–D, Table 7.3) is located on a genomic region highly conserved across GBS strains, suggesting that the terminal sialylation of the polysac- charide capsule is an important virulence factor and required for GBS survival

[252]. Terminal sialylation of lipooligosaccharide in Neisseria meningitidis is known to interfere with phagocytosis [254]. In GBS, the terminally-sialylated polysaccharide capsule with N-acetylneuraminic acid residue (NeuNAc) also inter- feres with opsonophagocytosis, as well as preventing the deposition of C3a, result- ing in the diminished activation of alternative complement pathway [235,255,256].

In clinical isolates, a 15-kbp DNA restriction enzyme fragment containing neuA gene was found associated with neonatal meningitis [119].

7.4.5 Serine protease CspA

The cspA gene (GBS2008, SAK1991, SAG2053) encodes an extracellular serine protease with a role in cleaving the α-chain of human fibrinogen [257]. Mutants of cspA gene have decreased invasiveness in murine models and more sensitive to opsonophagocytosis by human neutrophils in vitro [257].

7.4.6 Penicillin-binding protein 1A (encoded by pbp1A/ponA)

Penicillin-binding proteins (pbp) are enzyme families responsible for modifying and cross-linking of peptidoglycan [258]. Pbp1A protein (GBS0608, SAK0713, and SAG0298) has a role in resisting phagocytosis. An experimental model has shown that GBS strains with ΔponA deletion are more susceptible to phagocytic killing [259]. The Δpbp1a mutants are also more susceptible to cationic antimicro-

162 bial peptides (AMP) [260]. A lower inoculum count with higher pulmonary clear- ance was found in neonatal rat models infected with Δpbp1a mutants in vivo [261].

7.5 Summary

GBS possesses many virulence mechanisms to support the successful colonisa- tion and invasion of susceptible perinatal hosts. Many molecular factors and their genetic determinants have been found to facilitate GBS adherence to epithelium, invasion into tissues and cells, and evasion of host immune systems. While most of these genes are present in the majority of strains and serotypes, it is possible that the polymorphisms of these genes or their regulators may give GBS strains with different degrees of invasive capability (for example variations in serotypes and the number of tandem repeats in C proteins). In summary, this chapter has reviewed a list of GBS virulence factors that are currently recognised by experimental con-

firmation of pathogenic roles. This list of GBS virulence genes will guide our in silico candidate gene prioritisation to discover virulence factors that are currently uncharacterised.

163 164

Figure 7.1: Currently known GBS virulence factors. This illustration is an original work except for the image of neutrophil granulocyte, which was obtained from Wikipedia Commons (http://en.wikipedia.org/wiki/File:Neutrophil.png) under GNU Free Documentation License. Chapter 8

GBS virulence gene discovery by statistical CGP

The objective of this chapter is to use statistical CGP to discover common genes that are present in neonatal pathogens. Different bacterial pathogens have tenden- cies in causing characteristic infections. For example, S. pneumoniae has a high propensity in causing lower respiratory tract infection in both extremes of life; uro- pathogenic E. coli and Proteus spp. are frequent causal pathogens of urinary tract infection; and Clostridium difficile is known to cause pseudomembranous colitis associated with the use of certain antibiotics. Since different species of bacte- rial pathogens may cause similar syndromes, it can be hypothesised that these pathogens may possess specific genes that may contribute to their characteristic clinical manifestations.

The approach to be presented in this chapter is different from other compara- tive genomic studies. Methods such as whole genome microarray studies [53, 54] and genome sequence comparisons [49] aim to compare pathogenic strains of bac- teria to colonising counterparts to discover genes distinct to the pathogenic strains.

It is known that genes associated with pathogenicity are likely to be strain-specific

165 and are frequently belong to the “dispensable” part of the bacterial genome [126].

However, virulence genes that are “indispensable” to a bacterial species are less well-studied by comparative genomics approaches. Instead of identifying genes that are specific to pathogenic strains, statistical CGP examines potential candidate genes that are shared among pathogenic bacteria causing neonatal sepsis (and ab- sent in non-pathogens). Highly-ranked genes may thus have important molecular roles in close association with the invasive capability they provide to pathogens.

Polymorphisms in these genes and their regulators may explain certain variations in virulent behaviours.

8.1 Material and methods

8.1.1 Genome example selection

Thirty genomes associated with both early- and late-onset neonatal sepsis were selected as positive genome examples, and the selection process is detailed in Ap- pendix D (Table 8.1). Three-hundred and thirty one bacterial genomes that rarely if ever cause neonatal infections were selected as negative examples. The nega- tive genome examples selected for this prioritisation task are listed in Appendix E,

Table E.1.

Some genomes were excluded from both positive and negative genome exam- ples, either because the pathogens are infrequently associated with neonatal in- vasion or there is no sequenced genome in the database. For example, although

Pseudomonas aeruginosa is associated with lethal outcomes in infected neonates, pseudomonal infections are infrequent and are usually only associated with late- onset diseases of nosocomial origin [262]; Salmonella enterica, a Gram-negative causing bacterial gastroenteritis in human, was inconsistently reported as a neona- tal pathogen in epidemiological studies; Klebsiella pneumoniae, the second most

166 common Gram-negative pathogen in the neonates, did not have a complete se- quenced genome at the time of analysis (April 2007). Similarly, anaerobes were excluded from the positive example group because they are primarily associated with neonates with significant co-morbidities [263].

The inclusion of non-pathogenic reference strains (K12/W3110), uropatho- genic strain (UTI98), enterohaemorrhagic strains (O157:H7), and other strains of

E. coli in the positive genome examples was justifiable, despite there being no ev- idence of these E. coli strains in causing neonatal sepsis. As discussed earlier, the rationale for this analysis was to uncover genes common to all neonatal pathogens

(as listed in the positive genome examples, which include all E. coli strains) which may be associated with virulence. It was hypothesised that such genes may have common roles in these pathogens that are capable of causing severe neonatal in- fections. Genes unique to individual pathogenic strains were thus not considered in this analysis. Such virulence genes would be ranked lower by statistical CGP as the frequency of occurrence would be low. The same criteria was also applied for selecting other bacterial species in the group of positive genome examples.

8.1.2 Statistical CGP

All genes from three S. agalactiae genomes (2603 V/R, GenBank accession: AE009948;

A909, accession: CP000114; and NEM316, accession: AL732656) were selected as candidate genes for CGP. One of the best performing scoring function amss from Chapter 6.1, was selected for the prioritisation task. The occurrence matrix was determined by performing candidate-against-all BLASTP (all candidate genes against all open reading frames) in the 30 + 331 genomes. The same criteria were applied for determining the existence of a homolog as described in Chapter 5.3.1.

167 8.1.3 Comparison with currently known virulence factors

An interesting question about the prioritisation result is to examine how many cur- rently known virulence factors were highly ranked. To answer this question, the

134 known virulence genes listed in Table 9.1 were used as benchmarks and tested against the rank prioritised by statistical CGP and the prioritisation performance of nine scoring functions (Chapter 5.3.3) was tested.

8.1.4 Clustering of orthologous genes

Because more than one genome of S. agalactiae was used as candidate genes, or- thologous genes would appear multiple times in close proximity in a prioritised rank because they have similar co-occurrence profiles. To allow better visualisa- tion and analysis, the prioritised results were collated by grouping the genes into orthologous clusters based on the reciprocal best BLAST hit method described by

Hirsh et al. [152].

8.2 Results

Three genes (C1925, C1375, and C1374) encoding the family 8 glycosyltrans- ferases (GT8) were ranked among top-10. All of three top-ranked GT8 genes have sensitivities of 93% and 89–93% specificities in positive and negative genome ex- amples respectively (amss score=0.91–0.93). Genes specific for anaerobic respira- tion, pyruvate-formate lyase genes (pfl genes: C1625, C0322, C1324, and C0316), ribonucleoside-triphosphate reductase activating gene (nrdG: C1945), and the gene encoding a hypothetical protein (C1982) which is a homolog of C4-dicarboxylate anaerobic carrier protein DcuC in other species, were ranked among the top-10.

The top-25 genes ranked by statistical CGP are listed in Table 8.2.

168 Table 8.1: Positive genome examples used in the statistical CGP of GBS virulence genes

Genome GenBank accessions∗ Gram-positive bacteria Streptococcus and Enterococcus Str. agalactiae 2603 (Ch. D.2.1) AE009948 Str. agalactiae A909 CP000114 Str. agalactiae NEM316 AL732656 Str. pneumoniae D39 (Ch. D.2.3) CP000410 Str. pneumoniae R6 AE007317 Str. pneumoniae TIGR4 AE005672 E. faecalis V583 (Ch. D.2.3) AE016830–3 Staphylococcus aureus (Ch. D.2.2) S. aureus COL CP000045–6 S. aureus MW2 BA000033 S. aureus Mu50 AP003367, BA000017 S. aureus N315 AP003139, BA000018 S. aureus NCTC 8325 CP000253 S. aureus RF122 AJ938182 S. aureus USA300 CP000255–8 S. aureus MRSA252 BX571856 S. aureus MSSA476 BX571857–8 Coagulase-negative staphylococci (Ch. D.2.2) S. epidermidis ATCC 12228 AE015929–35 S. epidermidis RP62A CP000028–29 monocytogenes (Ch. D.2.3) AL591824 Listeria monocytogenes 4b F2365 AE017262 Gram-negative bacteria Escherichia coli (Ch. D.3.1) E. coli 536 CP000247 E. coli APEC O1 CP000468 E. coli CFT073 AE014075 E. coli K12† U00096 E. coli O157H7 AB011548–9, BA000007 E. coli O157H7 EDL933 AE005174, AF074613 E. coli UTI89 CP000243–4 E. coli W3110† AP009048 Haemophilus influenzae (Ch. D.3.3) H. influenzae Rd L42023 H. influenzae 86-028NP CP000057

∗) Both bacterial and accessory plasmids were included in the analysis. †) Although these E. coli strains are non-pathogenic, this analysis intends to reveal genes that are shared among E. coli and other neonatal pathgen species. It is postulated that variations in such well-conserved genes may contribute to the virulence of a neonatal pathogen. See text for more discussion.

169 Table 8.2: Top ranked gene clusters from statistical CGP using amss scoring function # Score Cluster Gene / function Gene loci in genomes 1 0.942 C1812/G [lacD] tagatose 1,6-diphosphate aldolase SAG1928, SAK1887, GBS1333/1915 2 0.930 C1925/M glycosyl transferase, family 8 SAG2060, SAK1489, GBS1525/2015 3 0.921 C1375/M glycosyl transferase, family 8 SAG1460/2061, SAK1491, GBS1527/2016 4 0.918 C1625/C [pflD-2] formate acetyltransferase 1 SAG1727, SAK1735, GBS1772 5 0.918 C1982/S hypothetical protein SAG2124, SAK2063, GBS2083 6 0.917 C0319 [celC] PTS system, IIA component, lactose/cellobiose family SAG0328, SAK0398, GBS0316/1331 7 0.912 C2024/E [arcC-2] carbamate kinase SAG2167, SAK2125, GBS2126 8 0.912 C1983/E [arcC-1] carbamate kinase SAG2125, SAK2064, GBS2084 9 0.912 C1945/O [nrdG] anaerobic ribonucleoside-triphosphate reductase activating protein SAG2082, SAK2021, GBS2037 10 0.912 C1374/M glycosyl transferase, family 8 SAG1459, SAK1490, GBS1526 11 0.912 C0321/G [celB] PTS system, IIC component, lactose/cellobiose family SAG0330, SAK0400, GBS0318/1330

170 12 0.912 C1701/G PTS system, galactitol-specific IIC component SAG1805, SAK0529/1825, GBS1846 13 0.912 C1817/G PTS system, galactitol-specific IIC component, putative SAG1933, SAK0526/1893, GBS1920 14 0.911 C1315/E [pepT] peptidase T SAG1389, SAK1422, GBS1459 15 0.911 C0322/C [pflD-1] formate acetyltransferase 2 SAG0331, SAK0401, GBS0319 16 0.909 C0191/KGT transcriptional antiterminator, BglG family SAG0194/0196, SAK0259/0523, GBS0191 17 0.909 C1697/K transcriptional antiterminator, BglG family SAG1801, SAK1762/1821, GBS1842 18 0.899 C0297 [luxS] S-ribosylhomocysteinase SAG0305, SAK0376, GBS0294 19 0.897 C1819/GT PTS system, galactitol-specific IIA component, putative SAG1935, SAK0524/0528/1895, GBS1922 20 0.896 C0274/G PTS system, IIBC components SAG0282, SAK0354, GBS0272 21 0.894 C1324/O [pflA-2] pyruvate formate-lyase-activating enzyme SAG1398, SAK1431, GBS1468 22 0.894 C0316/O [pflA-1] pyruvate formate-lyase-activating enzyme, putative SAG0325, SAK0395, GBS0313 23 0.894 C0032/G N-acetylmannosamine-6-phosphate 2-epimerase SAG0033, SAK0066, GBS0032 24 0.892 C0756/K transcriptional antiterminator, BglG family SAG0789, SAK0914, GBS0809/1332 25 0.890 C0807/H [thiM] hydroxyethylthiazole kinase SAG0841, SAK0964, GBS0859 Note: #: rank position. Score: best score (amss) achieved in the ortholog cluster. Cluster: cluster ID/COG functional group [175] (C: Energy production and conversion; E: Amino acid transport and metabolism; G: Carbohydrate transport and metabolism; H: Coenzyme transport and metabolism; K: Transcription; M: Cell wall/membrane/envelope biogenesis; O: Posttranslational modification, protein turnover, chaperones; S: Function unknown; T: Signal transduction mechanisms). Gene loci: positions of genes in S. agalactiae reference genomes [SAGnumber: 2603 (V/R); SAKnumber: A909 (Ia); GBSnumber: NEM316 (III)]. Overall, the AUCs of scoring functions were between 0.38–0.60 when tested against the known virulence factors (Table 8.3). The average probability enrich- ments (η) were between 0.69–1.34, with ηmax of up to 11-folds.

8.3 Discussion

8.3.1 Possible significance of top-ranked genes

Several interesting genes, for example the family 8 glycosyltransferases, were highly ranked by the amss scoring function. Such highly-scored genes may have biological significance in other bacterial pathogens, and this will be discussed in more detail in the next chapter.

8.3.2 Few overlaps between the top genes with the currently known virulence factors

In the testing against known GBS virulence genes, the statistical CGP achieved

AUCs between 0.4–0.6 (with poor η between 0.7–1.3), suggesting that there were very few overlaps between the known factors and the prioritised list. This dis- crepancy may be explained by differences in the methodologies of traditional gene screening, comparative genomics by DNA microarrays, and statistical CGP. Gene screening focuses on discovering genes with specific traits such as antigenic pro- teins (for example, Cα protein). In addition, surface-located proteins have been better-studied because virulence genes are frequently anchored on cell walls, and heuristic strategies such as searching for Gram-positive sortase motifs (LPXTG) in genes have been applied in selecting candidate genes for experimental valida- tion [264]. This method of gene screening relies on preferential selection towards genes with known characteristics. On the other hand, virulence gene discovery by comparative microarrays usually focuses on comparing dispensable genes spe-

171 Table 8.3: Statistical CGP performance of S. agalactiae genomes tested on cur- rently known virulence genes

Prioritisation performance Scoring function AUC ηηmax sens 0.382 0.69 1.11 spec 0.603 1.27 2.39 ppv 0.557 1.34 11.3 npv 0.387 0.68 1.05 amss 0.454 0.73 1.09 hmss 0.469 0.76 1.16 chisq 0.490 0.92 2.17 bchisq 0.491 0.92 2.17 F 0.488 0.88 2.68

Note: AUC: area under ROC curve. η: average probability enrichment. ηmax: maximum probability enrichment. Scoring functions: η: average probability enrichment (n-fold); ηmax: maximum prob- ability enrichment (n-fold); sens: sensitivity; spec: specificity; ppv: positive predictive value; npv: negative predictive value; amss: arithmetic mean of sensitivity and specificity; hmss: harmonic mean of sensitivity and specificity; OR: odds ratio; chisq: χ2 scoring function; bchisq: signed χ2 scoring function; F: F-measure. cific to pathogenic strains of a species, resulting in omissions of virulence genes common to all pathogens. The analysis presented in this chapter thus provide an alternative perspective to the traditional frameworks of candidate gene selection, which may be applicable for finding genes associated with other diseases of inter- est.

8.3.3 Sampling bias involved in genome example selection

Selecting appropriate genome examples requires the definition of what exactly is a neonatal pathogen. In this experiment, only pathogens frequently involved in neonatal infections were included as positive examples, in hope to reduce the un- certainty associated with the prioritisation task. This sampling criteria is empirical and potential sampling bias could have arisen.

172 8.3.4 Overlapping with anaerobic-specific genes

Among the top-25 genes in the prioritised rank, there was a significant overlap with anaerobic-specific genes (see Chapter 6.3). This is not unexpected, however, as the majority of positive genome examples are also facultative anaerobes. Since

S. agalactiae is naturally a facultative anaerobe, it is not possible to rule out that some these genes may be related to virulence. Nevertheless, statistical CGP is unable to determine these relationships and biological validations will be required to examine the role of these genes in virulence.

173 Chapter 9

GBS virulence gene discovery by inductive CGP

Inductive CGP (Chapter 5.4) is based on the rationale that genes contributing to the same function should share similar occurrence profile (present and absent together in genomes). In Chapter 7, the experimentally-validated virulence genes of GBS were reviewed by literature search. By applying inductive CGP on these genes, unknown virulence genes sharing similar biological mechanisms with the currently known virulence factors may be revealed.

9.1 Methods

An occurrence matrix was calculated from 483 available genomes as described in

Chapter 5. Currently known virulence genes, as reviewed in Chapter 7, were used as training genes for machine learning (see Table 9.1). All genes from the three genome sequences (2603 V/R, A909, and NEM316) were used as candidate genes for ranking.

174 Four machine learning algorithms were applied to each orthologous gene or gene category. The algorithms were selected from the best performing algorithms in Chapter 6, including support vector machine with linear (SVM/Poly, trained by sequential minimisation optimisation algorithm) and RBF kernels (SVM/RBF), al- ternating decision tree (ADTree with number of boosting iterations set to 10), and k-nearest neighbour classifier (IBk with inverse distance weighing, and k was de- termined by leave-one-out cross-validation).

175 Table 9.1: Genes used in training inductive CGP models for prioritising GBS virulence genes

Number of genes Gene categories Gene product/function Total A909 (Ia) NEM316 (III) 2603V/R (V) Virulence-related genes 43 45 46 134 Adhesins 10 10 10 30 fbsA fibrinogen-binding protein 1 1 1 3 fbsB fibrinogen-binding protein 1 1 1 3 pavA fibronectin-binding protein 1 1 1 3 scpB∗ C5a peptidase 1 1 1† 3 lmb laminin-binding protein 1 1 1 3 minor pilin gene cluster streptococcus minor-pilin structure 5 5 5 15

176 Invasins 17 17 17 51 cyl β-haemolysin/cytolysin cluster 12 12 12 36 cfb CAMP factor 1 1 1 3 spb1 surface protein Spb1 1 1 1 3 hylB hyaluronate lyase 1 1 1 3 C-α protein family∗ rib surface protein Rib 1 - 1 2 bca C-α protein - 1 - 1 Immune evasion genes 19 20 21 60 bac C-β protein - 1 - 1 cps gene cluster capsular polysaccharide 11 12 14 37 neu gene cluster terminal sialic acid synthesis 4 4 4 12 cspA‡ serine protease 1 1 1 3 pbp1A/ponA penicillin-binding protein 1A 1 1 1 3 ∗ also have roles as immune system evading gene(s) † scpB is truncated by IS1548 in S. agalactiae 2603 V/R ‡ also an invasin 9.1.1 Rediscovery of training genes

Genes within the same functional group are expected to be ranked highly as demon- strated by the rediscovery of KEGG pathway genes with inductive CGP in Chap- ter 6.4. A successful rediscovery of the training set genes can prove the inter- nal consistency of the inductive CGP models for virulence gene discovery. The following procedure was used to evaluate the rediscovery performance: For each gene category with m virulence genes, a leave-one-out (m-fold) cross-validation was performed, with the remaining candidate genes assigned a negative class. The cross-validation procedure was identical to the method on page 110 with the re- placement of 10-fold by m-fold. Rediscovery performances are measured by AUC for each combination of algorithm and gene category.

9.1.2 Sub-sampling of negative examples in the de novo discovery of GBS virulence genes

A fundamental requirement for any rediscovery experiment is the precise selection of training and validation sets to evaluate the CGP performance. For example, in the earlier rediscovery experiment, we were required to label all the remaining genes as negative by assuming they are not related to the virulence function of study to evaluate the consistency of inductive CGP models. This requirement was also demonstrated earlier in Chapter 6, where several validation gene sets in the three cases studies were clearly constructed to benchmark different algorithms.

In contrast, the task of selecting negative examples is less trivial in de novo discovery of virulence genes. Specifically, some of the genes labelled negative in

Section 9.1.1 may be genuinely related to pathogen virulence and are thus inter- esting candidates for discovery. Since it is impossible to determine which of these genes are not virulence-related, we are likely to construct a highly-biased train- ing set that is strongly skewed towards the negative group, if the machine learning

177 models are trained with a large proportion of true virulence genes that are falsely labelled as negative (Figure 9.1(a)).

One method to ameliorate such bias is by performing multiple classifications with smaller sets of negative examples that are sampled randomly (Figure 9.1(b)).

The aim of random sub-sampling is to create some training sets consisting of smaller proportion of false negative genes. While keeping the positive genes (the known virulence genes) constant in the training set, the candidate virulence genes are thus more likely to be revealed in some training sets containing few false- negatives. Such genes are anticipated to be discovered more readily instead of being erroneously classified as negative with a biased aggregated training set. In supporting the use of the sub-sampling method, Terabe et al. previously demon- strated that the use of sub-sampling does not produce inferior performance when compared with an aggregated sample set [265].

Proof. Let A = the number of positive gene examples in the training set (A>0),

B = the number of unknown gene examples in the test set (B>0), p = the proportion of positive genes in B falsely labelled negative, and k = the proportion

kBp of B sub-sampled. Define f(k)= A+kB = the proportion of unknown genes mislabelled as negative in the training set. It can be shown that

df (k) pAB = > 0 dk (A + kB)2 for all values of 0

A simulation study was previously performed to determine the effect of pro- portion of overlapping classes on the performance of machine learning classi-

fier [266]. It demonstrated that the performance of classifiers, as measured by

178  × × × ×  Using all genes for model building:  × × true virulence genes may be incorrectly labelled as negative gene examples, resulting in the non-discovery ×× of other false negative genes ×× ×

(a) Training machine learning models with only one training set

  × ?   ? ×   × ? × ×  ? ?  × ×   ×  ×  × × Random exclusion of a subset ? ? of non-positive training genes × × ×× ?? (indicated by ?) permits false ×× ? ×× × ?? × negative genes to be predicted positive

(b) Training machine learning models with multiple training subsets

Figure 9.1: Sub-sampling of training data to allow the discovery of false-negative genes. ◦: known virulence genes. ×: genes labelled as non-virulence-related genes in the training set. ?: genes excluded from the training set.

AUC, is poorer when a higher proportion of overlapping classes are present in the training set [266]. It is therefore reasonable to apply the sub-sampling technique, which sets k<1 to reduce bias and improve performance.

Procedure

Each inductive CGP model was trained only with a subset of the unknown genes as negatives. For each gene or gene group, all virulence genes in the group were labelled as positive gene examples. The remaining 3/4 of candidate genes were randomly sampled without replacement and were assigned a negative class. Pre-

179 dictions were made on the remaining one-quarter of the unknown genes and scores from each run were obtained for each gene to be predicted. The above procedure was repeated for 1000 runs for each gene category to improve coverage. Scores from each run were averaged by arithmetic means which forms the basis of rank- ing.

Combining ranks from multiple models

A final aggregated rank was obtained by combining the four ranks from all machine learning algorithms by using Equation 4.1 (page 78). The gene ranks were also clustered into orthologous groups as described in Section 8.1.4.

9.2 Results

9.2.1 Rediscovery of GBS virulence genes or gene categories

The cross-validation procedure described in Section 9.1.1 was performed. As ex- pected, most gene categories with exactly one orthologous gene achieved perfect

AUCs in leave-one-out cross-validation, suggesting the occurrence profiles of or- thologous genes are near-identical. A perfect AUC was also achieved in the redis- covery of GBS minor pilin cluster and neu cluster. Overall, the currently known

GBS virulence genes achieved rediscovery AUCs between 0.80–0.96 with n-fold stratified cross-validation. AUCs of 0.96–0.97 were achieved for rediscovering

GBS adhesins. AUCs between 0.89–0.99 were yielded in the rediscovering of in- vasins and 0.95–0.98 in the recovery of genes related to host immune evasion. The results from the rediscovery experiments were listed in Table 9.2.

180 9.2.2 De novo discovery of GBS virulence genes

A total of 60,000 (15 categories × 4 algorithms × 1,000 sub-sampling runs) in- stances of inductive CGP were performed. The top-10 genes (out of 2370 ortholo- gous groups) from the final 15 aggregated ranks are shown in tables 9.3 to 9.10:

9.2.3 Adhesion genes fbsA, fbsB, and pavA (Table 9.3)

The top-10 genes ranked with fibrinogen-binding protein gene fbsA included a β- h/c cytolysin gene cylJ and the cps cluster genes cpsK/L; Several genes encoding hypothetical proteins were also ranked within the top-10, including C0348, C1856,

C1977, C0753 (the gene encodes a possible methyltransferase type 11), C0255, and

C1271 (encoding for a putative membrane protein). Gene of a patatin-like phos- pholipase family protein (C1924) was ranked second on the combined prioritised rank.

The top of fbsB gene rank included genes encoding a peptidoglycan-linked pathogenicity protein (C1927), a plasmid recombination protein (C0222), and a pu- tative staphylokinase homolog (C1080). Several highly-ranked genes were found to have a collagen- or fibronectin-binding domains, including a surface protein gene (C1377, similar to a staphylococcal adhesion protein) and a gene encoding a cna-protein B-type domain protein (C0623).

The pavA gene was found to be associated with two metallo-β-lactamases

(C1137, C1659) and competence gene comFA (C0327). Other genes are listed in Table 9.3.

9.2.4 lmb and scpB genes (Table 9.4)

The lmb gene is associated with zinc-binding adhesion lipoprotein and permease genes (adcA/B; C0520 and C0154), and manganese-binding adhesion lipoprotein and permease genes (mtsA/C; C1443 and C1445). A different gene encoding a

181 laminin-binding surface lipoprotein (C1821) was also highly ranked. A collage- nase U32-family peptidase gene (C0709) was also ranked within the top-10.

The scpB gene was closely linked with the cspA gene, an AcuB family protein gene (C1483), a PfkB family gene (C0664), a hypothetical protein gene (C2042), a cadmium resistance accessory protein gene (C1194), and gene encoding a helicase family protein containing Snf2 domain (C1211).

9.2.5 GBS minor pilin genes (Table 9.5)

Genes in the minor pilin cluster were closely associated with cspA. Also highly ranked were two MutT/NUDIX family protein genes (C1419, C1146), a putative cell-surface hydrolase gene, (C0106) and a gene encoding a Snf2 family protein

(C1211, which was also ranked in the scpB rank).

9.2.6 Genes encoding invasins spb1andbca (Table 9.6)

The spb1 gene, which encodes a cell-surface invasin, was found to be associated with genes in the minor pilin cluster (C0619, C0622), the recombination protein gene (pre, C0222), and a gene encoding for a CnaB-type domain protein (C2129).

The spb1 gene was also closely linked with a transmembrane protein gene (vex3,

C0651).

The inductive CGP of Cα-protein genes produced a rank with fbsAandcpsO genes placed at the top of the aggregated rank. Other top-scored genes associated with bca are enumerated in Table 9.6.

9.2.7 Cytolysins: the cyl gene cluster and CAMP factor (cfb gene)

The cyl gene cluster is closely associated with the fatty acid biosynthesis genes

(fabDFGZ) and acetoin reductase gene C0508. The fibrinogen-binding protein fbsA (C1007) was ranked among the top-10 genes in the cyl rank. The cfb gene

182 was found to be associated with the streptococcal hyaluronate lyase gene hylBand a number of hypothetical proteins listed in Table 9.7.

9.2.8 Genes encoding enzymes that degrades components of extracel- lular matrix (cspAandhylB genes, Table 9.8)

The cspA gene is closely associated with scpB, cadium resistance protein gene cadC, acuB family protein gene C1483, SNF2-family protein gene C1211, bca and bac genes, and cpsL gene from the cps gene cluster. The hyaluronidase gene hylB is closely associated with the glucuronyl hydrolase gene (ugl; C1789), the β- glucuronidase gene C0665, the α-galactosidase gene C2046, the neuraminidase/sialidase gene C1816, and the gene coding exfoliative toxin A (C1166).

9.2.9 Genes involved in host immune system evasion (cps and neu gene clusters, Table 9.9, and penicillin-binding protein 1A, Ta- ble 9.10)

The cps gene cluster is closely linked with several family 1 and family 2 glyco- syltransferases (family 1: C1330, C1367; and family 2: C1459). It is also closely associated with bac and mur ligase (C0849) genes, and signal peptidase gene lepB

(C1622). The neu genes are associated with the same family 1 glycosyltransferases

C1330 and C1367 as in the cps cluster, sensor histidine kinases (vncS and C1985), and cpsL gene. Several hypothetical proteins were also highly ranked.

The penicillin-binding protein 1A is associated with other peptidoglycan-related genes, including several penicillin-binding protein genes (pbp1B, pbp2B, and pbp2A), undecaprenyl pyrophosphate phosphatase gene (uppP), alanine racemase gene (alr), and the glutamate racemase gene (murI).

183 9.3 Discussion

This chapter used inductive CGP to suggest a number of candidate genes sharing similar phylogenetic profiles to currently known virulence genes. As discussed in

Chapter 5, inductive CGP searches for genes that co-occur with the training set genes across multiple genomes. Genes contributing to GBS virulence are postu- lated to share similar occurrence patterns and thus to be discoverable by machine learning. Some top-ranked genes could have biological significance and their pos- sible pathogenic roles are discussed in the next chapter.

Inductive CGP achieved encouraging cross-validation results in the rediscovery experiment with training genes in each category. In addition, near-perfect results were achieved in the gene categories possessing exactly one orthologous gene.

Similar to the findings in Chapter 6.4, the inductive CGP experiments carried out on each functional category (adhesins, invasins, immune evasion genes, and all virulence-related genes) all achieved very good AUCs, which further supported the hypothesis that genes contributing to the same function tend to coexist in bacterial genomes.

Examining the quality of discovered genes in the top of the ranks provides another validation of inductive CGP in de novo discovery of virulence genes. For instance, several peptidoglycan-specific genes were identified from the pbp1A rank

(Table 9.10). Seven peptidoglycan-related genes were ranked within the top-10 of the aggregated rank, suggesting that inductive CGP is a reliable method for func- tional discovery. This results is comparable with the rank derived in Chapter 6.1 where pbp1A was ranked 6th (0.28 pct) in the statistical CGP of peptidoglycan- related genes.

184 Table 9.2: Inductive CGP performance (AUC) of rediscovering the training set genes with stratified n-fold cross-validation

Machine learning algorithms Gene/Gene categories n SVM/Poly SVM/RBF ADTree IBk Virulence-related genes 134 0.960 0.951 0.848 0.980 Adhesins 30 0.965 0.960 0.968 0.961 fbsA 3 0.961 0.754 0.888 0.677 fbsB 3 0.957 0.959 0.874 0.971 pavA 31111 scpB∗ 31111 lmb 31111 Minor pilin gene cluster 15 1 1 1 1 Invasins 51 0.982 0.954 0.929 0.974 cyl gene cluster 36 0.980 0.962 0.950 0.988 cfb 31111 spb1 31111 hylB 31111 Cα family proteins† 3 0.978 0.979 0.933 0.967 Immune evasion genes 60 0.982 0.954 0.929 0.974 bac‡ ––––– cps gene cluster 37 0.967 0.948 0.960 0.966 neu gene cluster 12 1 1 1 1 cps-neu gene cluster§ 49 0.979 0.960 0.970 0.974 cspA¶ 31111 pbp1A/ponA 31111 ∗ scpB was also included as immune evasion genes. † Including both bca and rib; also included as immune evasion genes. ‡ bac was represented by less than two genes in the three reference genomes studied. No rediscovery experiment was performed. § Including all genes from the cps-neu operon. ¶ cspA was also included as an invasin.

185 Table 9.3: Top-10 genes ranked by inductive CGP (fbsAB and pavA genes)

Rank Cluster Gene / function Locus tags in ref. genomes

FbsA gene training set (np =3) 1 C0348/S hypothetical protein SAG0357, SAK0431, GBS0344 2 C1924/R patatin-like phospholipase family protein SAG2059, SAK1997, GBS2014 3 C1856/S hypothetical protein SAG1975, SAK1935, GBS1961 4 C0642/GC [cylJ] cylJ protein SAG0672, SAK0800, GBS0654 5 C1977/S hypothetical protein SAG2119 6 C1115 [cpsK] capsular polysaccharide biosynthesis protein SAG1163, SAK1252, GBS1237 CpsK 7 C0753/QR hypothetical protein SAG0786, SAK0911, GBS0806 8 C0255 hypothetical protein SAG0263, SAK0335, GBS0253 9 C1271/S hypothetical protein SAG1345, SAK1376, GBS1415 10 C2124 type IIG restriction enzyme and methyltransferase SAK1333, GBS1324

FbsB gene training set (np =3) 1 C1927 pathogenicity protein, putative, interuption N- SAG2063, SAK2002, GBS2018 terminus 2 C0222/D [pre] plasmid recombination enzyme SAG0226, SAK0286, GBS0219 3 C1080 hypothetical protein SAG1127, GBS1195 4 C2178 hypothetical protein SAK2093 5 C1377 cell wall surface anchor family protein SAG1462, SAK1493, GBS1529 6 C2043/G PTS system, galactitol-specific IIB component, puta- SAK0525 tive 7 C0623/M Cna protein B-type domain SAG0651, SAK0782, GBS0636 8 C1412 hypothetical protein SAG1498, SAK1524, GBS1559 9 C2045/G PTS system, galactitol-specific IIB component, puta- SAK0530 tive 10 C1177/L ISSdy1, transposase OrfA SAG1228/1243 SAK1314/1322 GBS1300/1310 PavA gene training set (np =3) 1 C0498/R HD domain protein SAG0512, SAK0662, GBS0558 2 C1161/R [rnz] ribonuclease Z SAG1210, SAK1296, GBS1282 3 C1137/R metallo-β-lactamase superfamily protein SAG1186, SAK1273, GBS1259 4 C1072/E [thrB] homoserine kinase SAG1119, SAK1204, GBS1186 5 C1659/R metallo-β-lactamase superfamily protein SAG1761, SAK1783, GBS1804 6 C0850 conserved hypothetical protein TIGR00159 SAG0885, SAK1008, GBS0902 7 C0972/R GTP-binding protein SAG1013, SAK1108, GBS1048 8 C1249/C [fni] isopentenyl pyrophosphate isomerase SAG1323, SAK1354, GBS1393 9 C0327/L [comFA] competence protein ComFA, putative SAG0336, SAK0406, GBS0324 10 C1194/K [cadC] cadmium resistance accessory protein CadX SAG1258, SAK2052, GBS2065

NB: Top-10 (out of 2370) genes were listed. Orthologous genes were collated by using reciprocal best BLAST hits [152] across the three index genomes . The functional cluster of orthologous groups (COG) [175] for each gene was also described.

186 Table 9.4: Top-10 genes ranked by inductive CGP (scpBandlmb genes)

Rank Cluster Gene / function Locus tags in ref. genomes

scpB gene training set (np =3) 1 C0646/O [cspA] cell surface serine endopeptidase CspA SAG0676/2053 SAK0804/1991 GBS2008 2 C1483/R AcuB family protein SAG1577, SAK1593, GBS1627 3 C0664/G carbohydrate kinase, PfkB family SAG0697/1906 SAK0823 GBS0670/1893 4 C2042/QR hypothetical protein SAK0522, GBS0486 5 C1194/K [cadC] cadmium resistance accessory protein CadX SAG1258, SAK2052, GBS2065 6 C1211/KL SNF2 family protein SAG1280/1618 SAK1633 GBS1352/1353/1666 7 C1260 [def ] peptide deformylase SAG1334, SAK1365, GBS1404 8 C0790 polysaccharide deacetylase family protein SAG0824, SAK0948, GBS0842 9 C0245/J acetyltransferase, GNAT family SAG0252, SAK0327, GBS0245 10 C2136/R phenazine biosynthesis protein, PhzF family SAK1767

lmb gene training set (np =3) 1 C0520/RP [adcA] zinc ABC transporter, zinc-binding adhesion SAG0535, SAK0685, GBS0580 liprotein 2 C1443/P [mtsC] manganese ABC transporter, permease protein SAG1531, SAK1554, GBS1587 3 C1821/P laminin-binding surface protein SAG1938, SAK1898, GBS1926 4 C0154/P [adcB] zinc ABC transporter, permease protein SAG0156, SAK0219, GBS0152 5 C1445/P [mtsA] manganese ABC transporter, manganese- SAG1533, SAK1556, GBS1589 binding adhesion liprotein 6 C1497/P K+ transporter (Trk) family protein TrkH, putative SAG1591, SAK1606, GBS1640 7 C0709 peptidase, U32 (collagenase) family SAG0742, SAK0868, GBS0763 8 C2042/QR hypothetical protein SAK0522, GBS0486 9 C0880/L LambdaSa2, site-specific recombinase, phage SAG0915/1885/1986/1993 integrase family SAK1943/2059/2094 GBS0482/1224/1969/2073 10 C1097/E sodium:alanine symporter family protein SAG1145, SAK1231, GBS1212

NB: Top-10 (out of 2370) genes were listed. Orthologous genes were collated by using reciprocal best BLAST hits [152] across the three index genomes . The functional cluster of orthologous groups (COG) [175] for each gene was also described.

187 Table 9.5: Top-10 genes ranked by inductive CGP (GBS minor pilin genes)

Rank Cluster Gene / function Locus tags in ref. genomes

Minor pilin genes training set (np =15) 1 C0646/O [cspA] cell surface serine endopeptidase CspA SAG0676/2053 SAK0804/1991 GBS2008 2 C1419/LR MutT/nudix family protein SAG1505, SAK1529, GBS1564 3 C2103 prophage LambdaSa04, holin SAK0760 4 C0106/R hypothetical protein SAG0108, SAK0158, GBS0107 5 C1211/KL SNF2 family protein SAG1280/1618 SAK1633 GBS1352/1353/1666 6 C0249/K RNA polymerase sigma factor, ECF subfamily SAG0256, SAK0331, GBS0249 7 C1146/LR MutT/nudix family protein SAG1195, SAK1282, GBS1268 8 C1901 hypothetical protein SAG2030, SAK1968, GBS1988 9 C0442/T hypothetical protein SAG0455, SAK0556, GBS0502 10 C0784/S probable proton-coupled thiamine transporter YuaJ SAG0817, SAK0940, GBS0835

NB: Top-10 (out of 2370) genes were listed. Orthologous genes were collated by using reciprocal best BLAST hits [152] across the three index genomes . The functional cluster of orthologous groups (COG) [175] for each gene was also described.

188 Table 9.6: Top-10 genes ranked by inductive CGP (spb1andbca genes)

Rank Cluster Gene / function Locus tags in ref. genomes

spb1 gene training set (np =3) 1 C2129/M cna B-type domain protein SAK1441 2 C2046/G α-galactosidase, putative SAK0535 3 C0651/V [vex3] ABC transporter, permease protein Vexp3 SAG0615/0682 SAK0700/0810 GBS0657 4 C0619/M cell wall surface anchor family protein SAG0646/1404 SAK0777 GBS0629/1474 5 C1191 Tn5252, Orf 9 protein SAG1251, GBS1339 6 C0036 hypothetical protein SAG0037, SAK0070, GBS0036 7 C0218 replication initiation factor, RepA family SAG0222/1299 SAK0282 GBS0215/0408/0738/0971/1149/- 1372 8 C0622/M cell wall surface anchor family protein, putative SAG0649/1408 SAK0780 GBS0632/1478 9 C0451 hypothetical protein SAG0465, SAK0567, GBS0512 10 C2044/G [rhaD] rhamnulose-1-phosphate aldolase SAK0527

bca gene training set (np =3) 1 C1007 cell wall surface anchor family protein, putative SAG1052, SAK1142, GBS1087 2 C0613/S hypothetical protein SAG0636/2111 SAK0719/0769 GBS0616 3 C0435/L transposase, IS256 family SAG0448 4 C2061/KL prophage LambdaSa03, helicase, putative SAK0617 5 C1224/L prophage LambdaSa2, type II DNA modification SAG1297/1869 methyltransferase, putative SAK0739 GBS1370 6 C0413/R pyridoxamine 5’-phosphate oxidase family SAG0422, SAK0503, GBS0457 7 C0897/D Tn916, FtsK/SpoIIIE family protein SAG0933 SAK2056 GBS1320/2069 8 C1117/M [cpsO] capsular polysaccharide biosythesis protein CpsI SAG1165/1166/1455 SAK1254/1488 GBS1239/1524 9 C0585/M prophage LambdaSa03, peptidoglycan endolysin SAG0604, SAK0653 10 C1367/M glycosyl transferase, group 1 family protein SAG1448, SAK1481, GBS1517

NB: Top-10 (out of 2370) genes were listed. Orthologous genes were collated by using reciprocal best BLAST hits [152] across the three index genomes . The functional cluster of orthologous groups (COG) [175] for each gene was also described.

189 Table 9.7: Top-10 genes ranked by inductive CGP (cyl and cfb genes)

Rank Cluster Gene / function Locus tags in ref. genomes

cyl gene cluster training set (np =36) 1 C0342/I [fabZ] (3R)-hydroxymyristoyl ACP dehydratase SAG0351, SAK0425, GBS0338 2 C0340/IQ [fabF] 3-oxoacyl-(acyl carrier protein) synthase SAG0349, SAK0423, GBS0336 3 C0508/IQR acetoin reductase SAG0523, SAK0674, GBS0569 4 C0339/IQR [fabG] 3-ketoacyl-(acyl-carrier-protein) reductase SAG0348/1904 SAK0422 GBS0335/1891 5 C0338/I [fabD] acyl-carrier-protein S-malonyltransferase SAG0347, SAK0421, GBS0334 6 C1044/R oxidoreductase, short chain dehydrogenase/reductase SAG1091, SAK1176, GBS1158 family 7 C1462/S hypothetical protein SAG1554, SAK1573, GBS1608 8 C1455/IQR [fabG] 3-ketoacyl-(acyl-carrier-protein) reductase SAG1544, SAK1566, GBS1600 9 C1860 hypothetical protein SAG1979/2034 SAK1974 GBS1992 10 C0257 hypothetical protein SAG0265, SAK0337, GBS0255

cfb gene training set (np =3) 1 C0538/K prophage LambdaSa1, antirepressor, putative SAG0555 2 C0560 hypothetical protein SAG0577, SAK0625 3 C1716/S hypothetical protein SAG1820, SAK1840, GBS1861 4 C0429 hypothetical protein SAG0441, SAK0544, GBS0488 5 C0617 transcriptional regulator, AraC family SAG0644, SAK0775, GBS0627 6 C0862 CRISPR-associated SAG0897 family protein SAG0897, SAK1020, GBS0914 7 C0438 bacteriocin transport accessory protein, putative SAG0451, SAK0553, GBS0498 8 C1148 [hylB] hyaluronate lyase SAG1197, SAK1284, GBS1270 9 C1819/GT PTS system, galactitol-specific IIA component, putative SAG1935 SAK0524/0528/1895 GBS1922 10 C1177/L ISSdy1, transposase OrfA SAG1228/1243 SAK1314/1322 GBS1300/1310

NB: Top-10 (out of 2370) genes were listed. Orthologous genes were collated by using reciprocal best BLAST hits [152] across the three index genomes . The functional cluster of orthologous groups (COG) [175] for each gene was also described.

190 Table 9.8: Top-10 genes ranked by inductive CGP (cspAandhylB genes)

Rank Cluster Gene / function Locus tags in ref. genomes

cspA gene training set (np =3) 1 C0407/O [scpB] streptococcal C5a peptidase SAG0416 SAK1320 GBS0451/1308 2 C1194/K [cadC] cadmium resistance accessory protein CadX SAG1258, SAK2052, GBS2065 3 C1483/R AcuB family protein SAG1577, SAK1593, GBS1627 4 C1211/KL SNF2 family protein SAG1280/1618 SAK1633 GBS1352/1353/1666 5 C2061/KL prophage LambdaSa03, helicase, putative SAK0617 6 C0753/QR hypothetical protein SAG0786, SAK0911, GBS0806 7 C0430/J acetyltransferase, GNAT family SAG0442/0443 SAK0545 GBS0489/0490 8 C1332/R polysaccharide biosynthesis protein SAG1412, SAK1447, GBS1482 9 C1161/R [rnz] ribonuclease Z SAG1210, SAK1296, GBS1282 10 C1074 polysaccharide deacetylase family protein SAG1121, SAK1206, GBS1188

hylB gene training set (np =3) 1 C1789/R glucuronyl hydrolase SAG1901, GBS1889 2 C1665/R 5’-nucleotidase, lipoprotein e(P4) family SAG1767, SAK1789, GBS1810 3 C1816/G neuraminidase-related protein SAG1932 SAK1891/1892 GBS1919 4 C1819/GT PTS system, galactitol-specific IIA component, putative SAG1935 SAK0524/0528/1895 GBS1922 5 C2046/G α-galactosidase, putative SAK0535 6 C0665/G β-glucuronidase SAG0698, SAK0824, GBS0671 7 C1166/P exfoliative toxin A, putative SAG1215, SAK1301, GBS1287 8 C1014/E glycine cleavage system H protein, putative SAG1059, SAK1148, GBS1093 9 C0664/G carbohydrate kinase, PfkB family SAG0697/1906 SAK0823 GBS0670/1893 10 C1654/E [ltaE] low specificity L-threonine aldolase SAG1756, SAK1778, GBS1799

NB: Top-10 (out of 2370) genes were listed. Orthologous genes were collated by using reciprocal best BLAST hits [152] across the three index genomes . The functional cluster of orthologous groups (COG) [175] for each gene was also described.

191 Table 9.9: Top-10 genes ranked by inductive CGP (cps and neu gene clusters)

Rank Cluster Gene / function Locus tags in ref. genomes

cps gene cluster training set (np =37) 1 C1330/M glycosyl transferase, group 1 family protein SAG1410, SAK1445, GBS1480 2 C1622/U [lepB] signal peptidase I SAG1723 SAK1443/1731 GBS1768 3 C1367/M glycosyl transferase, group 1 family protein SAG1448, SAK1481, GBS1517 4 C0359/K transcriptional regulator, putative SAG0368, SAK0442, GBS0355 5 C0849/M mur ligase family protein SAG0884, SAK1007, GBS0901 6 C0424/D [bag] cell wall surface anchor family protein, truncation SAG0433 SAK0186/0517/0722/0771 GBS0470/0619 7 C1695/G [xfp] putative phosphoketolase SAG1799, SAK1819, GBS1840 8 C1231/E [mmuM] homocysteine methyltransferase SAG1305, SAK1337, GBS1377 9 C1459/M glycosyl transferase, group 2 family protein SAG1548/1551 SAK1570 GBS1605 10 C1193/L transposase, ISL3 family SAG1253

neu gene cluster training set (np =12) 1 C1977/S hypothetical protein SAG2119 2 C1985/T sensor histidine kinase, putative SAG2127, SAK2066, GBS2086 3 C1483/R AcuB family protein SAG1577, SAK1593, GBS1627 4 C0596/T [vncS] sensor histidine kinase VncS, putative SAG0617 SAK0188/0702 GBS0598 5 C1367/M glycosyl transferase, group 1 family protein SAG1448, SAK1481, GBS1517 6 C0054/E [thrC] threonine synthase SAG0055, SAK0088, GBS0055 7 C1172/S hypothetical protein SAG1223, SAK1309, GBS1295 8 C1330/M glycosyl transferase, group 1 family protein SAG1410, SAK1445, GBS1480 9 C1114/R [cpsL] capsular polysaccharide repeat unit transporter SAG1162, SAK1251, GBS1237 10 C2124 type IIG restriction enzyme and methyltransferase SAK1333, GBS1324

NB: Top-10 (out of 2370) genes were listed. Orthologous genes were collated by using reciprocal best BLAST hits [152] across the three index genomes . The functional cluster of orthologous groups (COG) [175] for each gene was also described.

192 Table 9.10: Top-10 genes ranked by inductive CGP (pbp1A gene)

Rank Cluster Gene / function Locus tags in ref. genomes

pbp1A/ponA gene training set (np =3) 1 C0156/M penicillin-binding protein 1B, putative SAG0159, SAK0222, GBS0155 2 C1930/M [pbp2A] penicillin-binding protein 2A SAG2066, SAK2005, GBS2020 3 C0030/M [zooA] peptidase, M23/M37 family SAG0031, SAK0064, GBS0030 4 C0136 [uppP] undecaprenyl pyrophosphate phosphatase SAG0138, SAK0196, GBS0134 5 C1519 hypothetical protein SAG1613, SAK1628, GBS1662 6 C1586/M [alr] alanine racemase SAG1684, SAK1696, GBS1728 7 C1823/TK [relA] GTP pyrophosphokinase family protein SAG1940, SAK1900, GBS1928 8 C0491 [hup] HU like DNA-binding protein SAG0505, SAK0606, GBS0551 9 C1506/M [murI] glutamate racemase SAG1600, SAK1615, GBS1649 10 C0732/M penicillin-binding protein 2b SAG0765, SAK0890, GBS0785

NB: Top-10 (out of 2370) genes were listed. Orthologous genes were collated by using reciprocal best BLAST hits [152] across the three index genomes . The functional cluster of orthologous groups (COG) [175] for each gene was also described.

193 Chapter 10

Biological significance of prioritised genes

194 195

Figure 10.1: Potential GBS virulence factors discovered by co-occurrence-based CGP methods. GT: glycosyltransferase families; βGL: β-glucuronidase; UGL: unsaturated glucuronyl hydrolase. This illustration is original work. 10.1 Glycosyltransferases

10.1.1 Family 8 glycosyltransferases

Three genes encoding family 8 glycosyltransferases (GT8), C1925, C1375, and

C1374, were ranked within top-5 in the prioritised rank produced by statistical

CGP. GT8 enzymes are involved in lipopolysaccharide (LPS) synthesis by adopt- ing the roles of LPS α-1,3-galactosyltransferase and LPS 1,2-glucosyltransferase

[267]. The GBS GT8 protein genes identified by statistical CGP showed up to 27% sequence identity and are 50% conserved compared with GT8 genes in Escherichia coli K-12 (rfaI/waaO genes: LPS α-1,3-galactosyltransferase; rfaJ/waaR genes:

LPS 1,3-galactosyltransferase and LPS 1,2-glucosyltransferase) and in Salmonella enterica serovar Typhi (rfaI/waaI genes and rfaJ/waaJ genes).

Lipopolysaccharide is an integral component of the cell envelope of Gram- negative bacteria. The structure of LPS consists of a hydrophobic lipid A, inner and outer core oligosaccharides, and a highly variable distant O-antigen respon- sible for the determination of bacterial serovars [268]. In mammals, the LPS of

Gram-negative bacteria is a major virulence factor, in which the lipid A compo- nent triggers a toll-like receptor (TLR) 4-CD14 mediated mechanism in inducing septic-shock syndrome [269].

The family 8 glycosyltransferases encoded by rfaIandrfaJ are known to be involved in the biosynthesis of LPS outer-core [270, 271], a main structural com- ponent that is important for bacterial adhesion and internalisation into epithelial cells [272,273]. Bacteria with a disrupted LPS outer core have increased suscepti- bility to antimicrobial peptides and attenuated virulence in Actinobacillus pleurop- neumoniae [274].

The multi-genomic analysis with statistical CGP revealed the presence of sev- eral genes containing sequences of GT8 domains. These genes achieved high sen-

196 sitivities (93%) and specificities (89–93%) across a range of positive genome ex- amples consisting of both Gram-positive and Gram-negative neonatal pathogens.

The presence of GT8 genes in genomes of neonatal pathogens may suggest that these GT8 enzymes are contributing factors to neonatal infections. Such genes are yet to be characterised in GBS and other Gram-positive pathogens. Thus, it can be postulated that these glycosyltransferases may be involved in the synthesis of cell- surface exopolysaccharides that mediate bacterial adhesion or trigger inappropriate innate immune responses in newborns.

10.1.2 Family 1 and 2 glycosyltransferases

Several genes encoding family 1 (GT1: C1330, C1367) and family 2 (GT2: C1459) glycosyltransferases appeared high in the prioritised rank of cps and neu gene clus- ters (Table 9.9). Both family 1 and 2 glycosyltransferases have a diverse range of glycosyltransferase and galactosyltransferase activities [267]. Many genes in the GBS cps cluster are glycosyltransferase genes, including cpsF/G genes (GT1) and cpsJMNO genes (GT2) [267]. The co-occurrence of these glycosyltransferase genes may suggest roles in the synthesis of currently unidentified antigenic struc- tures which may be involved in defining the serovariants of non-typeable GBS strains.

10.2 Adhesins

10.2.1 Gene products that may promote adherence to extracellular matrix and epithelial cells

Several adhesins were placed highly in the inductive CGP-ranked list trained with

fibrinogen-binding protein gene fbsB. Based on , these genes are predicted to encode a zinc- (C0520) and a manganese-binding (C1445) adhe-

197 sion lipoprotein which have homologs (adcAandpsaA/mtsA genes) in S. pneumo- niae genomes [275]. These metal-binding lipoproteins have been demonstrated to have properties of indirect adherence to epithelial cells and to contribute to viru- lence of several Gram-positive pathogens [275]. In addition to these lipoproteins, the gene product of C1927, a pathogenicity protein, is distantly similar (24% iden- tity and 34% conserved) to a large 1.1-MDa fibronectin-binding surface protein

Ebh in S. aureus [276]. The co-occurrence of these genes with fbsB may suggest that the binding to ECM by GBS requires synergistic actions of different bacterial surface components.

10.2.2 Adherence to collagen

The Cna protein is a cell-surface collagen-binding adhesin in S. aureus associated with staphylococcal keratitis [277]. The protein possesses a trench-like structure which provides a binding-site for the collagen triple helix [203, 278]. Monoclonal antibodies against Cna caused the detachment of the protein from collagen in vivo

[279]. In S. aureus, the affinity of Cna protein to collagen is a determinant of the development of septic arthritis in mouse model [280].

Tables 9.3 and 9.6 reveal that the products of several GBS genes containing staphylococcal cna-B domain (C0623, C2129) were highly ranked in the inductive

CGP of adhesin gene spb1 and fibronectin-binding protein gene fbsA. Cna-like do- main also exist in the gene encoding GBS52 (C0619: SAG0646) in GBS minor pilin gene cluster [203]. The coexistence of cna-domain proteins with other adhe- sion genes may suggest that collagen-adherence is a crucial step in assisting with

GBS colonisation.

As multiple adhesin genes were highly ranked, the presence of these genes in

S. agalactiae suggests that adherence to host is facilitated by multiple complex mechanisms.

198 10.3 Mechanisms that may contribute to the degradation

of extracellular matrix

Two classes of glycosidic hydrolases, the unsaturated glucuronyl hydrolase (C1789) and the β−glucuronidase (C0665), scored highly and may have putative roles in degrading glycosaminoglycan (GAG) which complements the function of hylB.

The normal catabolism of hyaluronan involves several lysosomal enzymes includ- ing hyaluronidases and β−glucuronidases [281]. The enzyme of unsaturated glu- curonyl hydrolase (ugl) is known to have activities of cleaving the GlcA–Glc bond in partially degraded hyaluronan [282]. The cleavage of disaccharide or oligosac- charide units are further degraded by glucosidase or hyaluronate lyases which re- leases the end products as unsaturated glucuronate [283]. Together with hyaluronate lyase, these enzymes may facilitate the degradation of GAG by HylB and may thus assist with the spreading and colonisation of GBS in neonatal infection.

10.3.1 Metalloproteases

The peptidase U32 (C0709) is a metalloprotease sharing 34% sequence identity

(50% conservation at amino-acid level) with the metalloprotease gene prtC found in Gram-negative anaerobe Prophyromas gingivalis, an important pathogen caus- ing periodontitis. The PrtC metalloprotease is known to degrade type I collagen in gingivial infection [284]. In humans, metalloproteases family proteins also have important physiological roles in pregnancy. While metalloproteases are tightly regulated in different stages of pregnancy and labour, an elevated expression of a human matrix metalloproteases-9 (a gelatinase) is associated with bacterial inva- sion of the amniotic sac [285–288]. Whether the GBS metalloprotease could play a similar role in facilitating ascending infection from lower genital tracts of pregnant women remains to be investigated, but is possible.

199 10.4 Gene with possible roles in immune system evasion

10.4.1 Neuraminidase

The neuraminidase gene (C1816) was ranked third in the inductive CGP of GBS hyaluronate lyase gene hylB. In S. pneumoniae, neuraminidase (encoded by nanA and nanB genes) is known to promote the formation of a biofilm, an important vir- ulence factor that assists with immune system evasion, and whose formation could be suppressed by neuraminidase inhibitors [289]. The mechanisms for bacterial neuraminidase to cause host tissue injury are thought to be through the cleavage of terminal sialic acid residues of the host polysaccharides [290].

10.4.2 Staphylokinase homologue

The gene encoding hypothetical protein C1080 (SAG1127 and GBS1195) shares

34% sequence identity with the staphylokinase gene in lysogenic species of S. aureus. Staphylokinase is an enzyme that cleaves the Fc portion of human IgG and complement C3b, thus providing the pathogen with the ability to resist op- sonophagocytosis [291]. Staphylokinase also facilitates bacterial colonisation in the inactivation of antimicrobial peptides (α-defensin) produced by neutrophils

[292]. The virulence factor also catalyses the conversion of plasminogen to plas- min, causing degradation of collagen and elastin which leads to widespread tissue damage [292]. The co-presence of C1080 with fbsB may suggest a synergistic role of staphylokinase in the GBS colonisation of pregnant women and neonates.

10.5 Other genes that may contribute to GBS virulence

The vex3/vncS genes were both highly ranked in the inductive CGPs of spb1gene and neu gene cluster. Both vex3andvncS genes are located on a gene clus- ter located at SAG0613-0618 in SA-2603 genome. The vex3 gene encodes an

200 ABC transporter and vncS encodes the sensor histidine kinase in a two-component

(vncRS) regulatory system. The exact functions of the vex–vnc gene clusters have not been fully determined, but mutants lacking of vex3 have been found to be as- sociated with higher vancomycin tolerance in S. pneumoniae [293, 294].

Other highly-ranked genes that may contribute to GBS virulence including an

α-galactosidase which was found essential in pneumococcal virulence [295]; The gene encoding staphylococcal exfoliative toxin A (C1166) was found co-occurring with the hylB gene. Exfoliative toxin A was implicated in exfoliative epidermal diseases in S. aureus infections [296]. Competence protein gene comFA was found to be ranked highly with the pavA gene. Competence is an important naturally occurring mechanism in maintaining genetic diversity in streptococci, as genetic transformation with external DNA acquisition can assist with the sharing of viru- lence genes in Gram-positive bacteria [297]. 

201 Chapter 11

Conclusions

11.1 Summary of contributions

The main contribution of this thesis is to demonstrate how a rigorous application of biomedical informatics methods can contribute to the challenges of virulence predictions and virulence gene discovery in microbiology and infectious diseases medicine. It has attempted to predict clinical outcomes by applying machine learn- ing methods to bacterial molecular epidemiology markers, and has explored two novel gene prioritisation methods for suggesting potential virulence genes. The specific contributions of this thesis are list in the following sections.

11.1.1 “Typeable” may not translate into “predictive”

The predictive power of 18 well-studied bacterial molecular markers of virulence in Streptococcus agalactiae, a common Gram-positive pathogen that causes severe perinatal infections, were examined. GBS is part of the normal genital microflora carried by up to half of all pregnant women. Sporadic perinatal infections can occur which manifest as sepsis, pneumonia, and meningitis in the neonates (Chapter 2).

Virulence prediction has the ability to optimise antibiotic prescribing. To investi-

202 gate the ability of these markers to predict virulence from GBS genotypes, machine learning was applied to build and evaluate the predictive power of the molecular markers (Chapter 3).

The results from multiple analyses suggested that the molecular markers stud- ied did not yield satisfactory classification of 912 invasive and colonising GBS isolates. Apart from the poor discriminatory power of the markers, the poor results could also be attributed to the overlapping of clinical groups in the case-control design, or extensive horizontal gene transfer and positive selection pressure in bacteria, resulting in the disruption of linkage association and non-cosegregation between the markers and the true virulence genes. It was thus concluded that vir- ulence prediction should be based on true-virulence genes (virulence gene typing) before rapid whole-genome sequencing for individual isolates can be attained for clinical decision support.

11.1.2 Development of new CGP methods for functional discovery of prokaryotic genes

Methods of comparative genomics have been used in virulence gene discovery by comparing strains of bacteria with differential pathogenicity. Several techniques of candidate gene prioritisation (CGP) have been applied to find genes associated with inheritable diseases in human.

This thesis integrated both concepts of comparative genomics and CGP and developed methods to assess how relevant a candidate gene is to a phenotype of interest. Two CGP methods were evaluated:

• Statistical CGP extended the concept of searching for unique genes in com-

parative genomics. Statistical CGP postulated that genes should occur more

frequently in strains displaying the phenotype of interest and less frequently

in strains that do not. The discovery power of statistical CGP was demon-

203 strated in two rediscovery experiments, using peptidoglycan-related genes

and genes specific to anaerobic mixed-acid fermentation. It was also demon-

strated that statistical CGP can function well even with very few genome

examples. This method should be applicable to a variety of scenarios where

pre-experimental prioritisation of likely candidate genes specific to a partic-

ular function is desired.

• Inductive CGP further extended the concept of gene occurrence to use gene

occurrence patterns in bacterial genomes to train machine learning models.

The encouraging results produced by inductive CGP experiments supported

the hypothesis that genes with similar functions are frequently co-present

and co-absent in microbial genomes. Through benchmarking experiments,

inductive CGP was demonstrated to be a robust and useful method in assist-

ing with the discovery of genes contributing to the same biological function.

• For both statistical and inductive CGP methods, this thesis provides a frame-

work for evaluating and comparing the performances of different prioritisa-

tion functions and algorithms. Measures of performance such as probability

enrichment and area under ROC curve were formalised in Chapter 5. Cur-

rently there is no good comparative database for evaluating CGP methods of

studying bacterial genes. This work should serve as a benchmark for future

CGP systems designed for the prioritisation of bacterial candidate genes.

11.1.3 Application of co-occurrence-based candidate gene prioritisa- tion in GBS virulence gene discovery

Part III of this thesis applied both CGP methods for discovering unknown virulence genes in GBS. Statistical CGP was applied to discover genes overrepresented in neonatal pathogens. In Appendix D, bacterial pathogens that are known to cause

204 neonatal sepsis were reviewed and selected as positive and negative genome exam- ples and used in Chapter 8 for de novo discovery of candidate genes by statistical

CGP. Inductive CGP was also applied to help discover unknown genes contributing to GBS virulence. In Chapter 7, the currently known GBS virulence factors were reviewed by literature search and selected as gene examples in Chapter 9. Both analyses demonstrated the application of both statistical and inductive methods on

GBS to discover unknown virulence factors. The procedure demonstrated in this thesis should also assist with discovery of virulence genes in other pathogenic bac- teria.

11.1.4 Some highly-ranked genes are biologically plausible in explain- ing additional GBS virulence mechanisms

Several genes ranked at the top of the prioritised lists showed biological plausibility in explaining GBS pathogenesis (Chapter 10). For example, statistical CGP dis- covered several family 8 glycosyltransferase genes (which are involved in E. coli

LPS outer core synthesis) that may have roles in synthesising important surface polysaccharide structures in GBS. On the other hand, the inductive CGP analy- sis revealed interesting candidate genes such as additional adhesin genes (adcA, psaA) with currently known adhesins, ECM invasins (β-glucuronidase, unsatu- rated glucuronyl hydrolase, and collagenase genes) with hyaluronidase gene, neu- raminidase, staphylokinase, and other factors that may contribute to GBS viru- lence. Several family 1 and family 2 glycosyltransferase genes were found coseg- regating with the cps gene cluster, which may have roles in determining serotypes yet to be described. The co-occurrence of these highly-ranked genes with known virulence factors suggested multifactorial mechanisms of GBS in colonisation and pathogenesis of perinatal diseases.

205 11.2 Future directions

11.2.1 Bench-side validation of the prioritised candidate genes

A number of interesting genes was identified by the CGP methods and these genes were shown to have putative biological significance in other species. It is likely that some of these genes could potentially be participating in the pathogenesis of GBS diseases. Biological validations of the discovered candidate genes with in vitro experiments would further support the validity of the CGP methods, if positive results are obtained.

11.2.2 Virulence prediction studies with discovered GBS genetic mark- ers

In Chapter 3, predictive analysis with the 18-genotype markers yielded poor dis- crimination between the invasive and colonising isolates. The optimal design of molecular markers for virulence predictions requires the characterisation and typ- ing of the virulence gene themselves. Better results might be achieved by designing a similar study based on the genes reviewed in Chapter 7 and suggested in Chap- ters 8 and 9. A favourable result would support the concept of virulence typing in the classification of bacterial pathogens and assist in the direction of clinical decision support systems based on bacterial genetics.

11.2.3 Applying other data sources and algorithms to improve CGP performances

This thesis developed two CGP methods based on phylogenetic profiles of can- didate genes. Alternative data sources have been used in other CGP studies for prioritising human genes. It has been demonstrated that the predictive power of the prioritisation system can be improved with increasing number of data sources

206 [148, 155]. Although the CGP methods described in this thesis achieved good generalisable performance, integrating other data sources, such as literature or an- notation databases, may further improve the performance of such future systems.

The application of supervised learning for document ranking has been an in- creasing focus in current information retrieval research. Specifically, several learn- to-rank algorithms have been recently proposed, notably the development of RankNet and LambdaRank proposed by Burges et al., and SVM-based ranking methods pro- posed by Cao et al. and Yue et al. [298–301]. While the formalisation of such al- gorithms is out of scope of this present work, future exploration of inductive CGP should consider the potential of these newly developed algorithms. A comparative effectiveness analysis of these new approaches with the machine learning algo- rithms described in this thesis may be the most reasonable approach to perusing such an investigation.

11.2.4 Development of a practical tool to support CGP and virulence gene discovery

The encouraging results from the CGP experiments indicate that it should be fea- sible to develop practical CGP tools that assist scientists with the task of gene function discovery. It should be straightforward to encapsulate the methods de- scribed here into a set of protocols that assist researchers in problem definition, and the application and evaluation of statistical and machine learning methods to the gene identification task. In particular, the biological plausibility of the GBS vir- ulence genes identified here suggests the specific role for applying these methods to further investigate virulence genes in bacterial pathogens. The CGP workflow described in this thesis can also be automated to a large extent, for example being encapsulated into a web-based tool. A screen-shot of the prototypical web-based

CGP tool is shown in Figure 11.1.

207 Figure 11.1: A screenshot of the prototype web-based tool for prioritising candi- date genes based on its phylogenetic profiles.

11.2.5 The concept of virulence gene occurrence patterns in the pan- genome

The concept of a “microbial pan-genome” was independently proposed by Tetz and by Tettelin et al., who viewed the bacterial pan-genome as a “sea of genes” [126,

302,303]. Within the pan-genome structure, each bacterial genome encloses sets of genes conferring specific survival advantage. Results from the KEGG experiment and the prioritisation of GBS virulence genes suggested that genes with similar functions tend to aggregate together in genomes, which indirectly supports the pan-genome concept. More questions can be asked, for example: What are the gene-co-occurrence patterns associated with bacterial pneumonia? What survival advantage or functions do those genes confer to the pathogens, together as a group?

Are we able to predict emerging bacterial pathogens by identifying co-occurrence patterns of known pathogens and trace their evolution? Many potential areas can be explored as an extension to this work.

208 11.3 Concluding remarks

In closing, it is hoped that the methods described in this thesis have contributed to improving methods of virulence gene discovery and their clinical identifica- tion with bacterial genetic markers. Given that our ability to sequence bacte- rial genomes is rapidly improving, the occurrence-based, multi-genomic candidate gene prioritisation methods developed in this work should serve an useful tool in accelerating the discovery of virulence biomarkers. They should also assist in im- proving our understanding of microbial pathogenesis and in exploring potential therapeutic targets and vaccine candidates. 

209 Bibliography

210 Bibliography

[1] G Mendel. Experiments IN Plant Hybridization . Verhandlungen des natur-

forschenden Vereines, 1865.

[2] W Bateson and CT Druery. Experiments in plant hybridization. Journal of

the Royal Horticultural Society, 26(1-32), 1901.

[3] JD Watson and FH Crick. Molecular structure of nucleic acids; a structure

for deoxyribose nucleic acid. Nature, 171(4356):737–8, 1953.

[4] F Sanger, GM Air, BG Barrell, NL Brown, AR Coulson, CA Fiddes,

CA Hutchison, PM Slocombe, and M Smith. Nucleotide sequence of bacte-

riophage phi X174 DNA. Nature, 265(5596):687–95, 1977.

[5] K Mullis, F Faloona, S Scharf, R Saiki, G Horn, and H Erlich. Specific

enzymatic amplification of DNA in vitro: the polymerase chain reaction.

Cold Spring Harb Symp Quant Biol, 51 Pt 1:263–73, 1986.

[6] RD Fleischmann, MD Adams, O White, RA Clayton, EF Kirkness,

AR Kerlavage, CJ Bult, JF Tomb, BA Dougherty, JM Merrick, and et al.

Whole-genome random sequencing and assembly of Haemophilus influen-

zae Rd. Science, 269(5223):496–512, 1995.

211 [7] ES Lander, LM Linton, B Birren, C Nusbaum, MC Zody, J Baldwin, K De-

von, K Dewar, M Doyle, W FitzHugh, et al. Initial sequencing and analysis

of the human genome. Nature, 409:860–921, Feb 2001.

[8] DA Wheeler, M Srinivasan, M Egholm, Y Shen, L Chen, A McGuire, W He,

YJ Chen, V Makhijani, GT Roth, et al. The complete genome of an indi-

vidual by massively parallel DNA sequencing. Nature, 452(7189):872–6,

2008.

[9] H Wolinsky. The thousand-dollar genome. Genetic brinkmanship or person-

alized medicine? EMBO Rep, 8(10):900–3, 2007.

[10] ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/lproks0.txt.

URL, 2008. Accessed July 2008.

[11] M Wehling. Translational medicine: science or wishful thinking? J Transl

Med, 6:31, 2008.

[12] F Marincola. Translational Medicine: A two-way road. J Transl Med,

1(1):1, 2003.

[13] M Madigan and J Martinko, editors. Brock Biology of Microorganisms.

Prentice Hall, 11th edition, 2005.

[14] GF Brooks, KC Carroll, JS Butel, and SA Morse. Criteria for Classification

of Bacteria, chapter 3. McGraw-Hill’s AccessMedicine, 2007.

[15] JN Maslow, ME Mulligan, and RD Arbeit. Molecular epidemiology: ap-

plication of contemporary techniques to the typing of microorganisms. Clin

Infect Dis, 17(2):153–62; quiz 163–4, 1993.

[16] B Foxman and L Riley. Molecular epidemiology: focus on infection. Am J

Epidemiol, 153(12):1135–41, 2001.

212 [17] BR Levin, M Lipsitch, and S Bonhoeffer. Population biology, evolution, and

infectious disease: convergence and synthesis. Science, 283(5403):806–9,

1999.

[18] TG Boyce, DL Swerdlow, and PM Griffin. Escherichia coli O157:H7 and

the hemolytic-uremic syndrome. N Engl J Med, 333(6):364–8, 1995.

[19] A Van Belkum. Molecular typing of micro-organisms: at the centre of di-

agnostics, genomics and pathogenesis of infectious diseases? J Med Micro-

biol, 51(1):7–10, 2002.

[20] RK Selander, DA Caugant, H Ochman, JM Musser, MN Gilmour, and

TS Whittam. Methods of multilocus enzyme electrophoresis for bacterial

population genetics and systematics. Appl Environ Microbiol, 51(5):873–

84, 1986.

[21] MJ Struelens, A Deplano, C Godard, N Maes, and E Serruys. Epidemio-

logic typing and delineation of genetic relatedness of methicillin-resistant

Staphylococcus aureus by macrorestriction analysis of genomic DNA by

using pulsed-field gel electrophoresis. J Clin Microbiol, 30(10):2599–605,

1992.

[22] E Mørch and H Andersen. Serological studies on the pneumococci.Einar

Mundsgaard London: Humphrey Milford, Copenhagen, 1943.

[23] FC Tenover, RD Arbeit, RV Goering, PA Mickelsen, BE Murray, DH Pers-

ing, and B Swaminathan. Interpreting chromosomal DNA restriction pat-

terns produced by pulsed-field gel electrophoresis: criteria for bacterial

strain typing. J Clin Microbiol, 33(9):2233–9, 1995.

[24] P Severino and S Brisse. Ribotyping in Clinical Microbiology, pages 573–

581. Springer, 2005.

213 [25] MC Maiden, JA Bygraves, E Feil, G Morelli, JE Russell, R Urwin, Q Zhang,

J Zhou, K Zurth, DA Caugant, et al. Multilocus sequence typing: a portable

approach to the identification of clones within populations of pathogenic

microorganisms. Proc Natl Acad Sci U S A, 95(6):3140–5, 1998.

[26] A Casadevall and L Pirofski. Host-pathogen interactions: the attributes of

virulence. J Infect Dis, 184(3):337–44, 2001.

[27] A Casadevall and LA Pirofski. Host-pathogen interactions: redefining the

basic concepts of virulence and pathogenicity. Infect Immun, 67(8):3703–

13, 1999.

[28] RA Weiss. Virulence and pathogenesis. Trends Microbiol, 10(7):314–7,

2002.

[29] GF Brooks and KC Carroll. Pathogenesis of Bacterial Infection , chapter 9.

McGraw-Hill’s AccessMedicine, 24 edition, 2008.

[30] TM Wassenaar and W Gaastra. Bacterial virulence: can we draw the line?

FEMS Microbiol Lett, 201(1):1–7, 2001.

[31] V Jay. The legacy of Robert Koch. Arch Pathol Lab Med, 125(9):1148–9,

2001.

[32] S Falkow. Molecular Koch’s postulates applied to bacterial pathogenicity–a

personal recollection 15 years later. Nat Rev Microbiol, 2(1):67–72, 2004.

[33] DN Fredericks and DA Relman. Sequence-based identification of micro-

bial pathogens: a reconsideration of Koch’s postulates. Clin Microbiol Rev,

9(1):18–33, 1996.

214 [34] A Fleming. On the antibacterial action of cultures of a Penicillium, with

special reference to their use in the isolation of B. influenzae. Brit. J. Exp.

Path, 10:226–236, 1929.

[35] GL Archer and RE Polk. Treatment and Prophylaxis of Bacterial Infec-

tions. In D.L. Kasper, E. Braunwald, A.S. Fauci, S.L. Hauser, D.L. Longo,

J.L. Jameson, and K.J. Isselbacher, editors, Harrison’s Principles of Internal

Medicine, chapter 127. McGraw-Hill’s AccessMedicine, 17 edition, 2005.

[36] W McDermott. Microbial persistence. Yale J Biol Med, 30(4):257–91, 1958.

[37] JL Mart´ınez, F Baquero, and DI Andersson. Predicting antibiotic resistance.

Nat Rev Microbiol, 5(12):958–65, 2007.

[38] EY Furuya and FD Lowy. Antimicrobial-resistant bacteria in the community

setting. Nat Rev Microbiol, 4(1):36–45, 2006.

[39] S Malhotra-Kumar, C Lammens, S Coenen, K Van Herck, and H Goossens.

Effect of azithromycin and clarithromycin therapy on pharyngeal carriage

of macrolide-resistant streptococci in healthy volunteers: a randomised,

double-blind, placebo-controlled study. Lancet, 369(9560):482–90, 2007.

[40] I Chopra and M Roberts. Tetracycline antibiotics: mode of action, applica-

tions, molecular biology, and epidemiology of bacterial resistance. Micro-

biol Mol Biol Rev, 65(2):232–60 ; second page, table of contents, 2001.

[41] R Bellazzi and B Zupan. Predictive data mining in clinical medicine: current

issues and guidelines. Int J Med Inform, 77(2):81–97, 2008.

[42] P Sajda. Machine learning for detection and diagnosis of disease. Annu Rev

Biomed Eng, 8:537–65, 2006.

215 [43] V Sintchenko, JR Iredell, and GL Gilbert. Pathogen profiling for disease

management and surveillance. Nat Rev Microbiol, 5(6):464–70, 2007.

[44] DM Raskin, R Seshadri, SU Pukatzki, and JJ Mekalanos. Bacterial ge-

nomics and pathogen evolution. Cell, 124(4):703–14, 2006.

[45] TT Binnewies, Y Motro, PF Hallin, O Lund, D Dunn, T La, DJ Hamp-

son, M Bellgard, TM Wassenaar, and DW Ussery. Ten years of bacterial

genome sequencing: comparative-genomics-based discoveries. Funct In-

tegr Genomics, 6(3):165–85, 2006.

[46] RC Hardison. Comparative genomics. PLoS Biol, 1(2):E58, 2003.

[47] JR Fitzgerald and JM Musser. Evolutionary genomics of pathogenic bacte-

ria. Trends Microbiol, 9(11):547–53, 2001.

[48] F Arigoni, F Talabot, M Peitsch, MD Edgerton, E Meldrum, E Allet, R Fish,

T Jamotte, ML Curchod, and H Loferer. A genome-based approach for

the identification of essential bacterial genes. Nat Biotechnol, 16(9):851–6,

1998.

[49] SL Chen, CS Hung, J Xu, CS Reigstad, V Magrini, A Sabo, D Blasiar,

T Bieri, RR Meyer, P Ozersky, et al. Identification of genes subject to pos-

itive selection in uropathogenic strains of Escherichia coli: a comparative

genomics approach. Proc Natl Acad Sci U S A, 103(15):5977–82, 2006.

[50] GK Schoolnik. Functional and comparative genomics of pathogenic bacte-

ria. Curr Opin Microbiol, 5(1):20–6, 2002.

[51] N Salama, K Guillemin, TK McDaniel, G Sherlock, L Tompkins, and

S Falkow. A whole-genome microarray reveals genetic diversity among

Helicobacter pylori strains. Proc Natl Acad Sci U S A, 97(26):14668–73,

2000.

216 [52] A Perrin, S Bonacorsi, E Carbonnelle, D Talibi, P Dessen, X Nassif, and

C Tinsley. Comparative genomics identifies the genetic islands that distin-

guish Neisseria meningitidis, the agent of cerebrospinal meningitis, from

other Neisseria species. Infect Immun, 70(12):7063–72, 2002.

[53] JCD Hotopp, R Grifantini, N Kumar, YL Tzeng, D Fouts, E Frigimelica,

M Draghi, MM Giuliani, R Rappuoli, DS Stephens, et al. Comparative ge-

nomics of Neisseria meningitidis: core genome, islands of horizontal trans-

fer and pathogen-specific genes. Microbiology, 152(Pt 12):3733–49, 2006.

[54] H Tettelin, V Masignani, MJ Cieslewicz, JA Eisen, S Peterson, MR Wes-

sels, IT Paulsen, KE Nelson, I Margarit, TD Read, et al. Complete

genome sequence and comparative genomic analysis of an emerging human

pathogen, serotype V Streptococcus agalactiae. Proc Natl Acad Sci U S A,

99(19):12391–6, 2002.

[55] M Mora, D Veggi, L Santini, M Pizza, and R Rappuoli. Reverse vaccinol-

ogy. Drug Discov Today, 8(10):459–64, 2003.

[56] AK Johri, LC Paoletti, P Glaser, M Dua, PK Sharma, G Grandi, and R Rap-

puoli. Group B Streptococcus: global incidence and vaccine development.

Nat Rev Microbiol, 4(12):932–42, 2006.

[57] M Pizza, V Scarlato, V Masignani, MM Giuliani, BA o,` M Comanducci,

GT Jennings, L Baldi, E Bartolini, B Capecchi, et al. Identification of

vaccine candidates against serogroup B meningococcus by whole-genome

sequencing. Science, 287(5459):1816–20, 2000.

[58] M Hood, A Janney, and G Dameron. Beta hemolytic streptococcus group

B associated with problems of the perinatal period. Am J Obstet Gynecol,

82:809–18, 1961.

217 [59] TC Eickhoff, JO Klein, AK Daly, D Ingall, and M Finland. Neonatal sepsis

and other infections due to group B beta-hemolytic streptococci. NEnglJ

Med, 271:1221–8, 1964.

[60] P Sendi, L Johansson, and A Norrby-Teglund. Invasive group B strepto-

coccal disease in non-pregnant adults : a review with emphasis on skin and

soft-tissue infections. Infection, 36(2):100–11, 2008.

[61] A Schuchat. Epidemiology of group B streptococcal disease in the United

States: shifting paradigms. Clin Microbiol Rev, 11(3):497–513, 1998.

[62] ME Hickman, MA Rench, P Ferrieri, and CJ Baker. Changing epidemiology

of group B streptococcal colonization. Pediatrics, 104(2 Pt 1):203–9, 1999.

[63] RS Gibbs, S Schrag, and A Schuchat. Perinatal infections due to group B

streptococci. Obstet Gynecol, 104(5 Pt 1):1062–76, 2004.

[64] RL Goldenberg and C Thompson. The infectious origins of stillbirth. Am J

Obstet Gynecol, 189(3):861–73, 2003.

[65] A Schuchat. Group B streptococcal disease in newborns: a global perspec-

tive on prevention. Biomed Pharmacother, 49(1):19–25, 1995.

[66] K Ekelund and HB Konradsen. Invasive group B streptococcal disease in

infants: a 19-year nationwide study. Serotype distribution, incidence and

recurrent infection. Epidemiol Infect, 132(6):1083–90, 2004.

[67] AL Jones and CE Rubens. Molecular pathogenesis of Group B Streptococ-

cal Infection. In Regine M Hakenbeck and Singh Chhatwal, editors, Molecu-

lar Biology of Streptococci, chapter 15, pages 379–410. Horizon Bioscience,

2007.

218 [68] S Hakansson˚ and K Kall¨ en.´ Impact and risk factors for early-onset group B

streptococcal morbidity: analysis of a national, population-based cohort in

Sweden 1997-2001. BJOG, 113(12):1452–8, 2006.

[69] WE Benitz, JB Gould, and ML Druzin. Risk factors for early-onset group B

streptococcal sepsis: estimation of odds ratios by critical literature review.

Pediatrics, 103(6):e77, 1999.

[70] JD Conroy and NJ Sharp. Bibliography of comparative and veterinary der-

matology. Part 12. Int J Dermatol, 17(2):142–4, 1978.

[71] SL Lukacs, KC Schoendorf, and A Schuchat. Trends in sepsis-related

neonatal mortality in the United States, 1985-1998. Pediatr Infect Dis J,

23(7):599–603, 2004.

[72] MR Moore, SJ Schrag, and A Schuchat. Effects of intrapartum antimi-

crobial prophylaxis for prevention of group-B-streptococcal disease on the

incidence and of early-onset neonatal sepsis. Lancet Infect Dis,

3(4):201–13, 2003.

[73] SJ Schrag, S Zywicki, MM Farley, AL Reingold, LH Harrison,

LB Lefkowitz, JL Hadler, R Danila, PR Cieslak, and A Schuchat. Group

B streptococcal disease in the era of intrapartum antibiotic prophylaxis. N

Engl J Med, 342(1):15–20, 2000.

[74] S Velaphi, JD Siegel, GD Wendel, N Cushion, WM Eid, and PJ Sanchez.´

Early-onset group B streptococcal infection after a combined maternal and

neonatal group B streptococcal chemoprophylaxis strategy. Pediatrics,

111(3):541–7, 2003.

[75] DF Zaleznik, MA Rench, S Hillier, MA Krohn, R Platt, ML Lee, AE Flores,

P Ferrieri, and CJ Baker. Invasive disease due to group B Streptococcus in

219 pregnant women and neonates from diverse population groups. Clin Infect

Dis, 30(2):276–81, 2000.

[76] H Wolf, AH Schaap, BJ Smit, L Spanjaard, and AH Adriaanse. Liberal di-

agnosis and treatment of intrauterine infection reduces early-onset neonatal

group B streptococcal infection but not sepsis by other pathogens. Infect Dis

Obstet Gynecol, 8(3-4):143–50, 2000.

[77] S Schrag, R Gorwitz, K Fultz-Butts, and A Schuchat. Prevention of Peri-

natal Group B Streptococcal Disease. In Recommendations and Reports,

volume 51, pages 1–22. Centers for Disease Control and Prevention, 2002.

[78] NE Rosenstein and A Schuchat. Opportunities for prevention of perina-

tal group B streptococcal disease: a multistate surveillance analysis. The

Neonatal Group B Streptococcal Disease Study Group. Obstet Gynecol,

90(6):901–6, 1997.

[79] SJ Schrag, ER Zell, R Lynfield, A Roome, KE Arnold, AS Craig, LH Har-

rison, A Reingold, K Stefonek, G Smith, et al. A population-based com-

parison of strategies to prevent early-onset group B streptococcal disease in

neonates. N Engl J Med, 347(4):233–9, 2002.

[80] CR Phares, R Lynfield, MM Farley, J Mohle-Boetani, LH Harrison, S Petit,

AS Craig, W Schaffner, SM Zansky, K Gershman, et al. Epidemiology

of invasive group B streptococcal disease in the United States, 1999-2005.

JAMA, 299(17):2056–65, 2008.

[81] KT Chen, KM Puopolo, EC Eichenwald, AB Onderdonk, and E Lieberman.

No increase in rates of early-onset neonatal sepsis by antibiotic-resistant

group B Streptococcus in the era of intrapartum antibiotic prophylaxis. Am

J Obstet Gynecol, 192(4):1167–71, 2005.

220 [82] KT Chen, RE Tuomala, AP Cohen, EC Eichenwald, and E Lieberman. No

increase in rates of early-onset neonatal sepsis by non-group B Streptococ-

cus or ampicillin-resistant organisms. Am J Obstet Gynecol, 185(4):854–8,

2001.

[83] FYC Lin, LE Weisman, J Troendle, and K Adams. Prematurity is the ma-

jor risk factor for late-onset group B streptococcus disease. J Infect Dis,

188(2):267–71, 2003.

[84] JL Rowen, CJ Baker, et al. Group B streptococcal infections. In R.D. Feigin

and J.D. Cherry, editors, Textbook of Pediatric Infectious Diseases, chap-

ter 88. Philadelphia: Saunders, fifth edition, 2004.

[85] SJ Schrag and BJ Stoll. Early-onset neonatal sepsis in the era of widespread

intrapartum chemoprophylaxis. Pediatr Infect Dis J, 25(10):939–40, 2006.

[86] R Lancefield. A serological differentiation of human and other groups of

hemolytic Streptococci. JExpMed., 59:441–458, 1934.

[87] F Michon, JR Brisson, A Dell, DL Kasper, and HJ Jennings. Multianten-

nary group-specific polysaccharide of group B Streptococcus. Biochemistry,

27(14):5341–51, 1988.

[88] HC Slotved, J Elliott, T Thompson, and HB Konradsen. Latex assay for

serotyping of group B Streptococcus isolates. J Clin Microbiol, 41(9):4445–

7, 2003.

[89] JA Elliott, TA Thompson, RR Facklam, and HC Slotved. Increased sensi-

tivity of a latex agglutination method for serotyping group B streptococcus.

J Clin Microbiol, 42(8):3907, 2004.

[90] CB Cropp, RA Zimmerman, J Jelinkova, AH Auernheimer, RA Bolin,

and BC Wyrick. Serotyping of group B streptococci by slide agglutina-

221 tion fluorescence microscopy, and microimmunodiffusion. J Lab Clin Med,

84(4):594–603, 1974.

[91] S Hakansson,˚ LG Burman, J Henrichsen, and SE Holm. Novel coagglu-

tination method for serotyping group B streptococci. J Clin Microbiol,

30(12):3268–9, 1992.

[92] G Arakere, AE Flores, P Ferrieri, and CE Frasch. Inhibition enzyme-linked

immunosorbent assay for serotyping of group B streptococcal isolates. J

Clin Microbiol, 37(8):2564–7, 1999.

[93] HC Slotved, F Kong, L Lambertsen, S Sauer, and GL Gilbert. Serotype

IX, a proposed new Streptococcus agalactiae serotype. J Clin Microbiol,

45(9):2929–36, 2007.

[94] MJ Cieslewicz, D Chaffin, G Glusman, D Kasper, A Madan, S Rodrigues,

J Fahey, MR Wessels, and CE Rubens. Structural and genetic diversity of

group B streptococcus capsular polysaccharides. Infect Immun, 73(5):3096–

103, 2005.

[95] AM Weisner, AP Johnson, TL Lamagni, E Arnold, M Warner, PT Heath,

and A Efstratiou. Characterization of group B streptococci recovered

from infants with invasive disease in England and Wales. Clin Infect Dis,

38(9):1203–8, 2004.

[96] S Berg, B Trollfors, T Lagergard,˚ G Zackrisson, and BA Claesson.

Serotypes and clinical manifestations of group B streptococcal infections

in western Sweden. Clin Microbiol Infect, 6(1):9–13, 2000.

[97] E Persson, S Berg, B Trollfors, P Larsson, E Ek, E Backhaus, BEB Claes-

son, L Jonsson, G Radberg,˚ T Ripa, and S Johansson. Serotypes and clinical

222 manifestations of invasive group B streptococcal infections in western Swe-

den 1998-2001. Clin Microbiol Infect, 10(9):791–6, 2004.

[98] CS Lachenauer, DL Kasper, J Shimada, Y Ichiman, H Ohtsuka, M Kaku,

LC Paoletti, P Ferrieri, and LC Madoff. Serotypes VI and VIII predominate

among group B streptococci isolated from pregnant Japanese women. J

Infect Dis, 179(4):1030–3, 1999.

[99] SJ Bliss, SD Manning, P Tallman, CJ Baker, MD Pearlman, CF Marrs, and

B Foxman. Group B Streptococcus colonization in male and nonpregnant

female university students: a cross-sectional prevalence study. Clin Infect

Dis, 34(2):184–90, 2002.

[100] LH Harrison, JA Elliott, DM Dwyer, JP Libonati, P Ferrieri, L Billmann, and

A Schuchat. Serotype distribution of invasive group B streptococcal isolates

in Maryland: implications for vaccine formulation. Maryland Emerging In-

fections Program. J Infect Dis, 177(4):998–1002, 1998.

[101] SM Borchardt, B Foxman, DO Chaffin, CE Rubens, PA Tallman, SD Man-

ning, CJ Baker, and CF Marrs. Comparison of DNA dot blot hybridization

and lancefield capillary precipitin methods for group B streptococcal capsu-

lar typing. J Clin Microbiol, 42(1):146–50, 2004.

[102] M Sellin, C Olofsson, S Hakansson,˚ and M Norgren. Genotyping of

the capsule gene cluster (cps) in nontypeable group B streptococci reveals

two major cps allelic variants of serotypes III and VII. J Clin Microbiol,

38(9):3420–8, 2000.

[103] SD Manning, DW Lacher, HD Davies, B Foxman, and TS Whittam. DNA

polymorphism and molecular subtyping of the capsular gene cluster of

group B streptococcus. J Clin Microbiol, 43(12):6113–6, 2005.

223 [104] L Wen, Q Wang, Y Li, F Kong, GL Gilbert, B Cao, L Wang, and L Feng.

Use of a serotype-specific DNA microarray for identification of group B

Streptococcus (Streptococcus agalactiae). J Clin Microbiol, 44(4):1447–

52, 2006.

[105] F Kong, S Gowan, D Martin, G James, and GL Gilbert. Serotype identifi-

cation of group B streptococci by PCR and sequencing. J Clin Microbiol,

40(1):216–26, 2002.

[106] F Kong, L Ma, and GL Gilbert. Simultaneous detection and serotype iden-

tification of Streptococcus agalactiae using multiplex PCR and reverse line

blot hybridization. J Med Microbiol, 54(Pt 12):1133–8, 2005.

[107] N Jones, JF Bohnsack, S Takahashi, KA Oliver, MS Chan, F Kunst,

P Glaser, C Rusniok, DWM Crook, RM Harding, et al. Multilocus sequence

typing system for group B streptococcus. J Clin Microbiol, 41(6):2530–6,

2003.

[108] SL Luan, M Granlund, M Sellin, T Lagergard,˚ BG Spratt, and M Norgren.

Multilocus sequence typing of Swedish invasive group B streptococcus iso-

lates indicates a neonatally associated genetic lineage and capsule switching.

J Clin Microbiol, 43(8):3727–33, 2005.

[109] EP Price, J Inman-Bamber, V Thiruvenkataswamy, F Huygens, and PM Gif-

fard. Computer-aided identification of polymorphism sets diagnostic for

groups of bacterial and viral genetic variants. BMC Bioinformatics, 8:278,

2007.

[110] E Honsa, T Fricke, AJ Stephens, D Ko, F Kong, GL Gilbert, F Huygens,

and PM Giffard. Assignment of Streptococcus agalactiae isolates to clonal

224 complexes using a small set of single nucleotide polymorphisms. BMC Mi-

crobiol, 8:140, 2008.

[111] Z Zhao, F Kong, and GL Gilbert. Reverse line blot assay for direct iden-

tification of seven Streptococcus agalactiae major surface protein antigen

genes. Clin Vaccine Immunol, 13(1):145–9, 2006.

[112] Y Sun, F Kong, Z Zhao, and GL Gilbert. Comparison of a 3-set genotyp-

ing system with multilocus sequence typing for Streptococcus agalactiae

(Group B streptococcus). J Clin Microbiol, 43(9):4704–7, 2005.

[113] D Marchaim, S Efrati, R Melamed, L Gortzak-Uzan, K Riesenberg,

R Zaidenstein, and F Schlaeffer. Clonal variability of group B Streptococcus

among different groups of carriers in southern Israel. Eur J Clin Microbiol

Infect Dis, 25(7):443–8, 2006.

[114] F Kong, HF Gidding, R Berner, and GL Gilbert. Streptococcus agalac-

tiae C-beta protein gene (bac) sequence types, based on the repeated region

of the cell-wall-spanning domain: relationship to virulence and a proposed

standardized nomenclature. J Med Microbiol, 55(Pt 7):829–37, 2006.

[115] CE Rubens, LM Heggen, and JM Kuypers. IS861, a group B streptococcal

insertion sequence related to IS150 and IS3 of Escherichia coli. J Bacteriol,

171(10):5531–5, 1989.

[116] B Spellerberg, S Martin, C Franken, R Berner, and R Lutticken.¨ Identifi-

cation of a novel insertion sequence element in Streptococcus agalactiae.

Gene, 241(1):51–6, 2000.

[117] M Granlund, L Oberg, M Sellin, and M Norgren. Identification of a novel

insertion element, IS1548, in group B streptococci, predominantly in strains

causing endocarditis. J Infect Dis, 177(4):967–76, 1998.

225 [118] M Granlund, F Michel, and M Norgren. Mutually exclusive distribution of

IS1548 and GBSi1, an active group II intron identified in human isolates of

group B streptococci. J Bacteriol, 183(8):2560–9, 2001.

[119] P Bidet, N Brahimi, C Chalas, Y Aujard, and E Bingen. Molecular char-

acterization of serotype III group B-streptococcus isolates causing neonatal

meningitis. J Infect Dis, 188(8):1132–7, 2003.

[120] C Franken, G Haase, C Brandt, J Weber-Heynemann, S Martin, C Lammler,¨

A Podbielski, R Lutticken,¨ and B Spellerberg. Horizontal gene transfer

and host specificity of beta-haemolytic streptococci: the role of a putative

composite transposon containing scpB and lmb. Mol Microbiol, 41(4):925–

35, 2001.

[121] G Hery-Arnaud,´ G Bruant, P Lanotte, S Brun, A Rosenau, N van der Mee-

Marquet, R Quentin, and L Mereghetti. Acquisition of insertion sequences

and the GBSi1 intron by Streptococcus agalactiae isolates correlates with

the evolution of the species. J Bacteriol, 187(17):6248–52, 2005.

[122] F Kong, D Martin, G James, and GL Gilbert. Towards a genotyping system

for Streptococcus agalactiae (group B streptococcus): use of mobile genetic

elements in Australasian invasive isolates. J Med Microbiol, 52(Pt 4):337–

44, 2003.

[123] DB Clewell, SE Flannagan, Y Ike, JM Jones, and C Gawron-Burke. Se-

quence analysis of termini of conjugative transposon Tn916. J Bacteriol,

170(7):3046–52, 1988.

[124] X Zeng, F Kong, H Wang, A Darbar, and GL Gilbert. Simultaneous detec-

tion of nine antibiotic resistance-related genes in Streptococcus agalactiae

226 using multiplex PCR and reverse line blot hybridization assay. Antimicrob

Agents Chemother, 50(1):204–9, 2006.

[125] P Glaser, C Rusniok, C Buchrieser, F Chevalier, L Frangeul, T Msadek,

M Zouine, EC e,´ L Lalioui, C Poyart, et al. Genome sequence of Strepto-

coccus agalactiae, a pathogen causing invasive neonatal disease. Mol Mi-

crobiol, 45(6):1499–513, 2002.

[126] H Tettelin, V Masignani, MJ Cieslewicz, C Donati, D Medini, NL Ward,

SV Angiuoli, J Crabtree, AL Jones, AS Durkin, et al. Genome analysis of

multiple pathogenic isolates of Streptococcus agalactiae: implications for

the microbial “pan-genome”. Proc Natl Acad Sci U S A, 102(39):13950–5,

2005.

[127] N Dore, D Bennett, M Kaliszer, M Cafferkey, and CJ Smyth. Molecu-

lar epidemiology of group B streptococci in Ireland: associations between

serotype, invasive status and presence of genes encoding putative virulence

factors. Epidemiol Infect, 131(2):823–33, 2003.

[128] K Fluegge, S Supper, A Siedler, and R Berner. Serotype distribution of

invasive group B streptococcal isolates in infants: results from a nationwide

active laboratory surveillance study over 2 years in Germany. Clin Infect

Dis, 40(5):760–3, 2005.

[129] N Brimil, E Barthell, U Heindrichs, M Kuhn, R Lutticken,¨ and B Speller-

berg. Epidemiology of Streptococcus agalactiae colonization in Germany.

Int J Med Microbiol, 296(1):39–44, 2006.

[130] F Lin, V Sintchenko, F Kong, GL Gilbert, and E Coiera. Commonly-used

molecular epidemiology markers of Streptococcus agalactiae do not appear

to predict virulence. Pathology, 41(6):576–581, 2009.

227 [131] JM Bland and DG Altman. Statistics notes. The odds ratio. BMJ,

320(7247):1468, 2000.

[132] Z Zhao, F Kong, X Zeng, HF Gidding, J Morgan, and GL Gilbert. Distribu-

tion of genotypes and antibiotic resistance genes among invasive Streptococ-

cus agalactiae (group B streptococcus) isolates from Australasian patients

belonging to different age groups. Clin Microbiol Infect, 14(3):260–7, 2008.

[133] F Kong, S Gowan, D Martin, G James, and GL Gilbert. Molecular pro-

files of group B streptococcal surface protein antigen genes: relationship to

molecular serotypes. J Clin Microbiol, 40(2):620–6, 2002.

[134] SD Manning, M Ki, CF Marrs, KJ Kugeler, SM Borchardt, CJ Baker, and

B Foxman. The frequency of genes encoding three putative group B strepto-

coccal virulence factors among invasive and colonizing isolates. BMC Infect

Dis, 6:116, 2006.

[135] IH Witten and E Frank. Data Mining: Practical machine learning tools and

techniques . Morgan Kaufmann, San Francisco, 2nd edition, 2005.

[136] J Platt. Probabilistic outputs for support vector machines and comparisons

to regularized likelihood methods. In Advances in Large Margin Classifiers,

volume 1, pages 61–74. MIT Press, 1999.

[137] TA Lasko, JG Bhagwat, KH Zou, and L Ohno-Machado. The use of receiver

operating characteristic curves in biomedical informatics. J Biomed Inform,

38(5):404–15, 2005.

[138] T Hastie, R Tibshirani, and J Friedman. The Elements of Statistical Learn-

ing: Data Mining, Inference, and Prediction. Springer, 2001.

[139] JA Hanley and BJ McNeil. The meaning and use of the area under a receiver

operating characteristic (ROC) curve. Radiology, 143(1):29–36, 1982.

228 [140] DJ McMillan, RG Beiko, R Geffers, J Buer, LM Schouls, BJM Vlaminckx,

WJB Wannet, KS Sriprakash, and GS Chhatwal. Genes for the majority of

group a streptococcal virulence factors and extracellular surface proteins do

not confer an increased propensity to cause invasive disease. Clin Infect Dis,

43(7):884–91, 2006.

[141] I Guyon, A Elisseefi, and LP Kaelbling. An Introduction to Variable and

Feature Selection. Journal of Machine Learning Research, 3(7-8):1157–

1182, 2003.

[142] I Kononenko. Estimating Attributes: Analysis and Extensions of RELIEF.

Lecture Notes in Computer Science, pages 171–171, 1994.

[143] L Yu and H Liu. Feature Selection for High-Dimensional Data: A Fast

Correlation-Based Filter Solution. In Proc 12th Int Conf on Machine Learn-

ing (ICML-03), volume 20, page 856, 2003.

[144] JR Quinlan. C4. 5: Programs for Machine Learning. Morgan Kaufmann,

1993.

[145] KCC Chan and AKC Wong. Statistical Technique for Extracting Classifi-

catory Knowledge from Databases. In Knowledge Discovery in Databases,

pages 107–124. AAAI/MIT Press, 1991.

[146] J Reunanen, I Guyon, and A Elisseeff. Overfitting in Making Comparisons

Between Variable Selection Methods. Journal of Machine Learning Re-

search, 3(7-8):1371–1382, 2003.

[147] L Peltonen and VA McKusick. Genomics and medicine. Dissecting human

disease in the postgenomic era. Science, 291(5507):1224–9, 2001.

229 [148] S Aerts, D Lambrechts, S Maity, P Van Loo, B Coessens, F De Smet,

LC Tranchevent, B De Moor, P Marynen, B Hassan, et al. Gene prioriti-

zation through genomic data fusion. Nat Biotechnol, 24(5):537–44, 2006.

[149] N Fuhr. Getting into Information Retrieval. In M Agosti, F Crestani, and

G Pasi, editors, Lecture notes in computer science, volume 1980, pages 1–

20. Springer-Verlag, Berlin, 2000.

[150] G Jimenez-Sanchez, B Childs, and D Valle. Human disease genes. Nature,

409(6822):853–5, 2001.

[151] C Perez-Iratxeta, P Bork, and MA Andrade. Association of genes to geneti-

cally inherited diseases using data mining. Nat Genet, 31(3):316–9, 2002.

[152] AE Hirsh and HB Fraser. Protein dispensability and rate of evolution. Na-

ture, 411(6841):1046–9, 2001.

[153] N Lopez-Bigas´ and CA Ouzounis. Genome-wide identification of genes

likely to be involved in human genetic disease. Nucleic Acids Res,

32(10):3108–14, 2004.

[154] EA Adie, RR Adams, KL Evans, DJ Porteous, and BS Pickard. Speeding

disease gene discovery by sequence based candidate prioritization. BMC

Bioinformatics, 6:55, 2005.

[155] KJ Gaulton, KL Mohlke, and TJ Vision. A computational system to select

candidate genes for complex human traits. Bioinformatics, 23(9):1132–40,

2007.

[156] GR Grimes, TQ Wen, M Mewissen, RM Baxter, S Moodie, JS Beattie,

and P Ghazal. PDQ Wizard: automated prioritization and characteriza-

tion of gene and protein lists using biomedical literature. Bioinformatics,

22(16):2055–7, 2006.

230 [157] GD Bader, I Donaldson, C Wolting, BF Ouellette, T Pawson, and

CW Hogue. BIND–The Biomolecular Interaction Network Database. Nu-

cleic Acids Res, 29(1):242–5, 2001.

[158] WJ Kent, F Hsu, D Karolchik, RM Kuhn, H Clawson, H Trumbower, and

D Haussler. Exploring relationships and mining data with the UCSC Gene

Sorter. Genome Res, 15(5):737–41, 2005.

[159] N Fuhr. Models in Information Retrieval. In M Agosti, F Crestani, and

G Pasi, editors, Lecture notes in computer science, volume 1980, pages 21–

50. Springer-Verlag, Berlin, 2000.

[160] FS Turner, DR Clutterbuck, and CAM Semple. POCUS: mining genomic

sequence annotation to predict disease genes. Genome Biol, 4(11):R75,

2003.

[161] EA Adie, RR Adams, KL Evans, DJ Porteous, and BS Pickard. SUSPECTS:

enabling fast and effective prioritization of positional candidates. Bioinfor-

matics, 22(6):773–4, 2006.

[162] F Lin, E Coiera, R Lan, and V Sintchenko. In silico prioritisation of candi-

date genes for prokaryotic gene function discovery: an application of phy-

logenetic profiles. BMC Bioinformatics, 10:86, 2009.

[163] MY Galperin and EV Koonin. Who’s your neighbor? New computational

approaches for functional genomics. Nat Biotechnol, 18(6):609–13, 2000.

[164] M Pellegrini, EM Marcotte, MJ Thompson, D Eisenberg, and TO Yeates.

Assigning protein functions by comparative genome analysis: protein phy-

logenetic profiles. Proc Natl Acad Sci U S A, 96(8):4285–8, 1999.

231 [165] Y Zheng, BP Anton, RJ Roberts, and S Kasif. Phylogenetic detection of

conserved gene clusters in microbial genomes. BMC Bioinformatics, 6:243,

2005.

[166] Y Yamanishi, JP Vert, and M Kanehisa. Protein network inference from

multiple genomic data: a supervised approach. Bioinformatics, 20 Suppl

1:i363–70, 2004.

[167] JP Vert. A tree kernel to analyse phylogenetic profiles. Bioinformatics,18

Suppl 1:S276–84, 2002.

[168] EM Marcotte, I Xenarios, AM van Der Bliek, and D Eisenberg. Localizing

proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci U S

A, 97(22):12115–20, 2000.

[169] J Wu, Z Hu, and C DeLisi. Gene annotation and network inference by

phylogenetic profiling. BMC Bioinformatics, 7:80, 2006.

[170] LB Lusted. Signal detectability and medical decision-making. Science,

171(977):1217–9, 1971.

[171] BG Spratt. Resistance to antibiotics mediated by target alterations. Science,

264(5157):388–93, 1994.

[172] KH Schleifer and O Kandler. Peptidoglycan types of bacterial cell walls and

their taxonomic implications. Bacteriol Rev, 36(4):407–77, 1972.

[173] M Kanehisa and S Goto. KEGG: kyoto encyclopedia of genes and genomes.

Nucleic Acids Res, 28(1):27–30, 2000.

[174] PD Karp, IM Keseler, A Shearer, M Latendresse, M Krummenacker, SM Pa-

ley, I Paulsen, J Collado-Vides, S Gama-Castro, M Peralta-Gil, et al. Multi-

232 dimensional annotation of the Escherichia coli K-12 genome. Nucleic Acids

Res, 35(22):7577–90, 2007.

[175] RL Tatusov, EV Koonin, and DJ Lipman. A genomic perspective on protein

families. Science, 278(5338):631–7, 1997.

[176] KLN Mercer and DS Weiss. The Escherichia coli cell division protein FtsW

is required to recruit its cognate transpeptidase, FtsI (PBP3), to the division

site. J Bacteriol, 184(4):904–12, 2002.

[177] MT Cabeen and C Jacobs-Wagner. Bacterial cell shape. Nat Rev Microbiol,

3(8):601–10, 2005.

[178] M Perego, P Glaser, A Minutello, MA Strauch, K Leopold, and W -

cher. Incorporation of D-alanine into lipoteichoic acid and wall teichoic

acid in Bacillus subtilis. Identification of genes and regulation. J Biol Chem,

270(26):15598–606, 1995.

[179] R Jothi, TM Przytycka, and L Aravind. Discovering functional linkages and

uncharacterized cellular pathways using phylogenetic profile comparisons:

a comprehensive assessment. BMC Bioinformatics, 8:173, 2007.

[180] EC Lin and S Iuchi. Regulation of gene expression in fermentative and res-

piratory systems in Escherichia coli and related bacteria. Annu Rev Genet,

25:361–87, 1991.

[181] G Michal. Biochemical pathways: An Atlas of Biochemistry and Molecular

Biology. Wiley-Spektrum, 1998.

[182] P Nordlund and P Reichard. Ribonucleotide reductases. Annu Rev Biochem,

75:681–706, 2006.

233 [183] JP Gogarten and JP Townsend. Horizontal gene transfer, genome innovation

and evolution. Nat Rev Microbiol, 3(9):679–87, 2005.

[184] MA Herbert, CJE Beveridge, and NJ Saunders. Bacterial virulence factors in

neonatal sepsis: group B streptococcus. Curr Opin Infect Dis, 17(3):225–9,

2004.

[185] KS Doran and V Nizet. Molecular pathogenesis of neonatal group B strep-

tococcal infection: no longer in its infancy. Mol Microbiol, 54(1):23–31,

2004.

[186] A Schubert, K Zakikhany, M Schreiner, R Frank, B Spellerberg, BJ Eik-

manns, and DJ Reinscheid. A fibrinogen receptor from group B Streptococ-

cus interacts with fibrinogen by repetitive units with novel ligand binding

sites. Mol Microbiol, 46(2):557–69, 2002.

[187] A Schubert, K Zakikhany, G Pietrocola, A Meinke, P Speziale, BJ Eik-

manns, and DJ Reinscheid. The fibrinogen receptor FbsA promotes adher-

ence of Streptococcus agalactiae to human epithelial cells. Infect Immun,

72(11):6197–205, 2004.

[188] G Pietrocola, L Visai, V Valtulina, E Vignati, S Rindi, CR Arciola, R Piazza,

and P Speziale. Multiple interactions of FbsA, a surface protein from Strep-

tococcus agalactiae, with fibrinogen: affinity, stoichiometry, and structural

characterization. Biochemistry, 45(42):12840–52, 2006.

[189] M Pierno, L Maravigna, R Piazza, L Visai, and P Speziale. FbsA-driven

fibrinogen polymerization: a bacterial “deceiving strategy”. Phys Rev Lett,

96(2):028108, 2006.

[190] G Pietrocola, A Schubert, L Visai, M Torti, JR Fitzgerald, TJ Fos-

ter, DJ Reinscheid, and P Speziale. FbsA, a fibrinogen-binding pro-

234 tein from Streptococcus agalactiae, mediates platelet aggregation. Blood,

105(3):1052–9, 2005.

[191] IM Jonsson, G Pietrocola, P Speziale, M Verdrengh, and A Tarkowski. Role

of fibrinogen-binding adhesin expression in septic arthritis and septicemia

caused by Streptococcus agalactiae. J Infect Dis, 192(8):1456–64, 2005.

[192] H Gutekunst, BJ Eikmanns, and DJ Reinscheid. The novel fibrinogen-

binding protein FbsB promotes Streptococcus agalactiae invasion into ep-

ithelial cells. Infect Immun, 72(6):3495–504, 2004.

[193] A Rosenau, K Martins, S Amor, F Gannier, P Lanotte, N van der Mee-

Marquet, L Mereghetti, and R Quentin. Evaluation of the ability of Strep-

tococcus agalactiae strains isolated from genital and neonatal specimens to

bind to human fibrinogen and correlation with characteristics of the fbsA

and fbsB genes. Infect Immun, 75(3):1310–7, 2007.

[194] D Pracht, C Elm, J Gerber, S Bergmann, M Rohde, M Seiler, KS Kim,

HF Jenkinson, R Nau, and S Hammerschmidt. PavA of Streptococcus pneu-

moniae modulates adherence, invasion, and meningeal inflammation. Infect

Immun, 73(5):2680–9, 2005.

[195] Q Cheng, D Stafslien, SS Purushothaman, and P Cleary. The group B strep-

tococcal C5a peptidase is both a specific protease and an invasin. Infect

Immun, 70(5):2408–13, 2002.

[196] C Beckmann, JD Waggoner, TO Harris, GS Tamura, and CE Rubens. Iden-

tification of novel adhesins from Group B streptococci by use of phage dis-

play reveals that C5a peptidase mediates fibronectin binding. Infect Immun,

70(6):2869–76, 2002.

235 [197] GS Tamura, JR Hull, MD Oberg, and DG Castner. High-affinity interac-

tion between fibronectin and the group B streptococcal C5a peptidase is

unaffected by a naturally occurring four-amino-acid deletion that eliminates

peptidase activity. Infect Immun, 74(10):5739–46, 2006.

[198] B Spellerberg, E Rozdzinski, S Martin, J Weber-Heynemann, N Schnitzler,

RLutticken,¨ and A Podbielski. Lmb, a protein with similarities to the LraI

adhesin family, mediates attachment of Streptococcus agalactiae to human

laminin. Infect Immun, 67(2):871–8, 1999.

[199] A Elsner, B Kreikemeyer, A Braun-Kiewnick, B Spellerberg, BA Buttaro,

and A Podbielski. Involvement of Lsp, a member of the LraI-lipoprotein

family in , in eukaryotic cell adhesion and internal-

ization. Infect Immun, 70(9):4859–69, 2002.

[200] T Tenenbaum, B Spellerberg, R Adam, M Vogel, KS Kim, and H Schroten.

Streptococcus agalactiae invasion of human brain microvascular endothelial

cells is promoted by the laminin-binding protein Lmb. Microbes Infect,

9(6):714–20, 2007.

[201] P Lauer, CD Rinaudo, M Soriani, I Margarit, D Maione, R Rosini, AR Tad-

dei, M Mora, R Rappuoli, G Grandi, and JL Telford. Genome analysis

reveals pili in Group B Streptococcus. Science, 309(5731):105, 2005.

[202] D Maione, I Margarit, CD Rinaudo, V Masignani, M Mora, M Scarselli,

H Tettelin, C Brettoni, ET Iacobini, R Rosini, et al. Identification of a uni-

versal Group B streptococcus vaccine by multiple genome screen. Science,

309(5731):148–50, 2005.

236 [203] V Krishnan, AH Gaspar, N Ye, A Mandlik, H Ton-That, and SVL Narayana.

An IgG-like domain in the minor pilin GBS52 of Streptococcus agalactiae

mediates lung epithelial cell adhesion. Structure, 15(8):893–903, 2007.

[204] JL Michel, LC Madoff, K Olson, DE Kling, DL Kasper, and FM Ausubel.

Large, identical, tandem repeating units in the C protein alpha antigen gene,

bca, of group B streptococci. Proc Natl Acad Sci U S A, 89(21):10060–4,

1992.

[205] M Stalhammar-Carlemalm,˚ L Stenberg, and G Lindahl. Protein rib: a novel

group B streptococcal cell surface protein that confers protective immunity

and is expressed by most strains causing invasive infections. JExpMed,

177(6):1593–603, 1993.

[206] CS Lachenauer, R Creti, JL Michel, and LC Madoff. Mosaicism in the

alpha-like protein genes of group B streptococci. Proc Natl Acad Sci U S A,

97(17):9630–5, 2000.

[207] JA Maeland, L Bevanger, and RV Lyng. Antigenic determinants of alpha-

like proteins of Streptococcus agalactiae. Clin Diagn Lab Immunol,

11(6):1035–9, 2004.

[208] KM Puopolo and LC Madoff. Upstream short sequence repeats regulate

expression of the alpha C protein of group B Streptococcus. Mol Microbiol,

50(3):977–91, 2003.

[209] M Wastfelt,¨ M Stalhammar-Carlemalm, AM Delisse, T Cabezon, and

G Lindahl. Identification of a family of streptococcal surface proteins with

extremely repetitive structure. J Biol Chem, 271(31):18892–7, 1996.

237 [210] C Gravekamp, B Rosner, and LC Madoff. Deletion of repeats in the alpha C

protein enhances the pathogenicity of group B streptococci in immune mice.

Infect Immun, 66(9):4347–54, 1998.

[211] C Larsson, M Stalhammar-Carlemalm,˚ and G Lindahl. Protection against

experimental infection with group B streptococcus by immunization with a

bivalent protein vaccine. Vaccine, 17(5):454–8, 1999.

[212] C Larsson, M Stalhammar-Carlemalm,˚ and G Lindahl. Experimental vac-

cination against group B streptococcus, an encapsulated bacterium, with

highly purified preparations of cell surface proteins Rib and alpha. Infect

Immun, 64(9):3518–23, 1996.

[213] MJ Baron, GR Bolduc, MB Goldberg, TC Auperin,´ and LC Madoff. Alpha

C protein of group B Streptococcus binds host cell surface glycosamino-

glycan and enters cells by an actin-dependent mechanism. J Biol Chem,

279(23):24714–23, 2004.

[214] GR Bolduc, MJ Baron, C Gravekamp, CS Lachenauer, and LC Madoff. The

alpha C protein mediates internalization of group B Streptococcus within

human cervical epithelial cells. Cell Microbiol, 4(11):751–8, 2002.

[215] LC Madoff, JL Michel, EW Gong, DE Kling, and DL Kasper. Group B

streptococci escape host immunity by deletion of tandem repeat elements of

the alpha C protein. Proc Natl Acad Sci U S A, 93(9):4131–6, 1996.

[216] CA Pritzlaff, JC Chang, SP Kuo, GS Tamura, CE Rubens, and V Nizet.

Genetic basis for the beta-haemolytic/cytolytic activity of group B Strepto-

coccus. Mol Microbiol, 39(2):236–47, 2001.

[217] ME Hensler, GY Liu, S Sobczak, K Benirschke, V Nizet, and GP Heldt. Vir-

ulence role of group B Streptococcus beta-hemolysin/cytolysin in a neonatal

238 rabbit model of early-onset pulmonary infection. J Infect Dis, 191(8):1287–

91, 2005.

[218] GY Liu, KS Doran, T Lawrence, N Turkson, M Puliti, L Tissi, and V Nizet.

Sword and shield: linked group B streptococcal beta-hemolysin/cytolysin

and carotenoid pigment function to subvert host phagocyte defense. Proc

Natl Acad Sci U S A, 101(40):14491–6, 2004.

[219] A Ring, C Depnering, J Pohl, V Nizet, JL Shenep, and W Stremmel. Syn-

ergistic action of nitric oxide release from murine macrophages caused by

group B streptococcal cell wall and beta-hemolysin/cytolysin. J Infect Dis,

186(10):1518–21, 2002.

[220] KS Doran, GY Liu, and V Nizet. Group B streptococcal beta-

hemolysin/cytolysin activates neutrophil signaling pathways in brain en-

dothelium and contributes to development of meningitis. J Clin Invest,

112(5):736–44, 2003.

[221] V Nizet, RL Gibson, EY Chi, PE Framson, M Hulse, and CE Rubens. Group

B streptococcal beta-hemolysin expression is associated with injury of lung

epithelial cells. Infect Immun, 64(9):3818–26, 1996.

[222] RL Gibson, V Nizet, and CE Rubens. Group B streptococcal beta-hemolysin

promotes injury of lung microvascular endothelial cells. Pediatr Res, 45(5

Pt 1):626–34, 1999.

[223] KS Doran, JCW Chang, VM Benoit, L Eckmann, and V Nizet. Group B

streptococcal beta-hemolysin/cytolysin promotes invasion of human lung

epithelial cells and the release of interleukin-8. J Infect Dis, 185(2):196–

203, 2002.

239 [224] B Spellerberg, B Pohl, G Haase, S Martin, J Weber-Heynemann, and

RLutticken.¨ Identification of genetic determinants for the hemolytic ac-

tivity of Streptococcus agalactiae by ISS1 transposition. J Bacteriol,

181(10):3212–9, 1999.

[225] MP Forquin, A Tazi, M Rosa-Fraile, C Poyart, P Trieu-Cuot, and S Dramsi.

The putative glycosyltransferase-encoding gene cylJ and the group B Strep-

tococcus (GBS)-specific gene cylK modulate hemolysin production and vir-

ulence of GBS. Infect Immun, 75(4):2063–6, 2007.

[226] GN Qui, H Toyoda, T Toida, I Koshiishi, and T Imanari. Compositional

analysis of hyaluronan, chondroitin sulfate and dermatan sulfate: HPLC of

disaccharides produced from the glycosaminoglycans by solvolysis. Chem.

Pharm. Bull. Tokyo., 44:1017–1020, 1996.

[227] S Li and MJ Jedrzejas. Hyaluronan binding and degradation by Streptococ-

cus agalactiae hyaluronate lyase. J Biol Chem, 276(44):41407–16, 2001.

[228] DG Pritchard, JO Trent, X Li, P Zhang, ML Egan, and JR Baker. Characteri-

zation of the active site of group B streptococcal hyaluronan lyase. Proteins,

40(1):126–34, 2000.

[229] LV Mello, BL De Groot, S Li, and MJ Jedrzejas. Structure and flexibility

of Streptococcus agalactiae hyaluronate lyase complex with its substrate.

Insights into the mechanism of processive degradation of hyaluronan. J Biol

Chem, 277(39):36678–88, 2002.

[230] K Rolland, C Marois, V Siquier, B Cattier, and R Quentin. Genetic features

of Streptococcus agalactiae strains causing severe neonatal infections, as

revealed by pulsed-field gel electrophoresis and hylB gene analysis. J Clin

Microbiol, 37(6):1892–8, 1999.

240 [231] EE Adderson, S Takahashi, Y Wang, J Armstrong, DV Miller, and JF Bohn-

sack. Subtractive hybridization identifies a novel predicted protein mediat-

ing epithelial cell invasion by virulent serotype III group B Streptococcus

agalactiae. Infect Immun, 71(12):6857–63, 2003.

[232] S Lang and M Palmer. Characterization of Streptococcus agalactiae CAMP

factor as a pore-forming toxin. J Biol Chem, 278(40):38167–73, 2003.

[233] PG Jerlstrom,¨ SR Talay, P Valentin-Weigand, KN Timmis, and GS Chhat-

wal. Identification of an immunoglobulin A binding motif located in the

beta-antigen of the c protein complex of group B streptococci. Infect Im-

mun, 64(7):2787–93, 1996.

[234] T Areschoug, M Stalhammar-Carlemalm,˚ I Karlsson, and G Lindahl. Strep-

tococcal beta protein has separate binding sites for human factor H and IgA-

Fc. J Biol Chem, 277(15):12642–8, 2002.

[235] H Jarva, TS Jokiranta, R Wurzner,¨ and S Meri. Complement resistance

mechanisms of streptococci. Mol Immunol, 40(2-4):95–107, 2003.

[236] PS Pannaraj, JK Kelly, LC Madoff, MA Rench, CS Lachenauer, MS Ed-

wards, and CJ Baker. Group B Streptococcus bacteremia elicits beta C

protein-specific IgMand IgG in humans. J Infect Dis, 195(3):353–6, 2007.

[237] JL Michel, LC Madoff, DE Kling, DL Kasper, and FM Ausubel. Cloned

alpha and beta C-protein antigens of group B streptococci elicit protective

immunity. Infect Immun, 59(6):2023–8, 1991.

[238] KN Seifert, EE Adderson, AA Whiting, JF Bohnsack, PJ Crowley, and

LJ Brady. A unique serine-rich repeat protein (Srr-2) and novel surface anti-

gen (epsilon) associated with a virulent lineage of serotype III Streptococcus

agalactiae. Microbiology, 152(Pt 4):1029–40, 2006.

241 [239] CK Brown, ZY Gu, YV Matsuka, SS Purushothaman, LA Winter,

PP Cleary, SB Olmsted, DH Ohlendorf, and CA Earhart. Structure of

the streptococcal cell wall C5a peptidase. Proc Natl Acad Sci U S A,

102(51):18391–6, 2005.

[240] S Yamamoto, K Miyake, Y Koike, M Watanabe, Y Machida, M Ohta, and

S Iijima. Molecular characterization of type-specific capsular polysaccha-

ride biosynthesis genes of Streptococcus agalactiae type Ia. J Bacteriol,

181(17):5176–84, 1999.

[241] DO Chaffin, SB Beres, HH Yim, and CE Rubens. The serotype of type Ia

and III group B streptococci is determined by the polymerase gene within

the polycistronic capsule operon. J Bacteriol, 182(16):4466–77, 2000.

[242] DO Chaffin, K McKinnon, and CE Rubens. CpsK of Streptococcus agalac-

tiae exhibits alpha2,3-sialyltransferase activity in Haemophilus ducreyi. Mol

Microbiol, 45(1):109–22, 2002.

[243] DO Chaffin, LM Mentele, and CE Rubens. Sialylation of group B strep-

tococcal capsular polysaccharide is mediated by cpsK and is required for

optimal capsule polymerization and expression. J Bacteriol, 187(13):4615–

26, 2005.

[244] S Koskiniemi, M Sellin, and M Norgren. Identification of two genes, cpsX

and cpsY, with putative regulatory function on capsule expression in group

B streptococci. FEMS Immunol Med Microbiol, 21(2):159–68, 1998.

[245] MJ Cieslewicz, DL Kasper, Y Wang, and MR Wessels. Functional analysis

in type Ia group B Streptococcus of a cluster of genes involved in extracel-

lular polysaccharide production by diverse species of streptococci. J Biol

Chem, 276(1):139–46, 2001.

242 [246] V Suryanti, A Nelson, and A Berry. Cloning, over-expression, purification,

and characterisation of N-acetylneuraminate synthase from Streptococcus

agalactiae. Protein Expr Purif, 27(2):346–56, 2003.

[247] WF Vann, DA Daines, AS Murkin, ME Tanner, DO Chaffin, CE Rubens,

J Vionnet, and RP Silver. The NeuC protein of Escherichia coli K1 is a

UDP N-acetylglucosamine 2-epimerase. J Bacteriol, 186(3):706–12, 2004.

[248] RF Haft, MR Wessels, MF Mebane, N Conaty, and CE Rubens. Character-

ization of cpsF and its product CMP-N-acetylneuraminic acid synthetase, a

group B streptococcal enzyme that can function in K1 capsular polysaccha-

ride biosynthesis in Escherichia coli. Mol Microbiol, 19(3):555–63, 1996.

[249] AL Lewis, H Cao, SK Patel, S Diaz, W Ryan, AF Carlin, V Thon,

WG Lewis, A Varki, X Chen, and V Nizet. NeuA sialic acid O-

acetylesterase activity modulates O-acetylation of capsular polysaccharide

in group B Streptococcus. J Biol Chem, 282(38):27562–71, 2007.

[250] AL Lewis, V Nizet, and A Varki. Discovery and characterization of sialic

acid O-acetylation in group B Streptococcus. Proc Natl Acad Sci U S A,

101(30):11123–8, 2004.

[251] DA Daines, LF Wright, DO Chaffin, CE Rubens, and RP Silver. NeuD plays

a role in the synthesis of sialic acid in Escherichia coli K1. FEMS Microbiol

Lett, 189(2):281–4, 2000.

[252] AL Lewis, ME Hensler, A Varki, and V Nizet. The group B streptococ-

cal sialic acid O-acetyltransferase is encoded by neuD, a conserved com-

ponent of bacterial sialic acid biosynthetic gene clusters. J Biol Chem,

281(16):11186–92, 2006.

243 [253] CE Rubens, MR Wessels, LM Heggen, and DL Kasper. Transposon muta-

genesis of type III group B Streptococcus: correlation of capsule expression

with virulence. Proc Natl Acad Sci U S A, 84(20):7208–12, 1987.

[254] A Unkmeir, U Kammerer,¨ A Stade, C Hubner,¨ S Haller, A Kolb-Maurer,¨

M Frosch, and G Dietrich. Lipooligosaccharide and polysaccharide capsule:

virulence factors of Neisseria meningitidis that determine meningococcal

interaction with human dendritic cells. Infect Immun, 70(5):2454–62, 2002.

[255] MS Edwards, DL Kasper, HJ Jennings, CJ Baker, and A Nicholson-Weller.

Capsular sialic acid prevents activation of the alternative complement path-

way by type III, group B streptococci. J Immunol, 128(3):1278–83, 1982.

[256] MB Marques, DL Kasper, MK Pangburn, and MR Wessels. Prevention of

C3 deposition by capsular polysaccharide is a virulence mechanism of type

III group B streptococci. Infect Immun, 60(10):3986–93, 1992.

[257] TO Harris, DW Shelver, JF Bohnsack, and CE Rubens. A novel streptococ-

cal surface protease promotes virulence, resistance to opsonophagocytosis,

and cleavage of human fibrinogen. J Clin Invest, 111(1):61–70, 2003.

[258] JM Ghuysen. Serine beta-lactamases and penicillin-binding proteins. Annu

Rev Microbiol, 45:37–67, 1991.

[259] AL Jones, RHV Needham, A Clancy, KM Knoll, and CE Rubens. Penicillin-

binding proteins in Streptococcus agalactiae: a novel mechanism for eva-

sion of immune clearance. Mol Microbiol, 47(1):247–56, 2003.

[260] A Hamilton, DL Popham, DJ Carl, X Lauth, V Nizet, and AL Jones.

Penicillin-binding protein 1a promotes resistance of group B streptococcus

to antimicrobial peptides. Infect Immun, 74(11):6179–87, 2006.

244 [261] AL Jones, RH Mertz, DJ Carl, and CE Rubens. A streptococcal penicillin-

binding protein is critical for resisting innate airway defenses in the neonatal

lung. J Immunol, 179(5):3196–202, 2007.

[262] S Vergnano, M Sharland, P Kazembe, C Mwansambo, and PT Heath.

Neonatal sepsis: an international perspective. Arch Dis Child Fetal Neonatal

Ed, 90(3):F220–4, 2005.

[263] I Brook. Clinical review: bacteremia caused by anaerobic bacteria in chil-

dren. Crit Care, 6(3):205–11, 2002.

[264] G Lindahl, M Stalhammar-Carlemalm,˚ and T Areschoug. Surface pro-

teins of Streptococcus agalactiae and related proteins in other bacterial

pathogens. Clin Microbiol Rev, 18(1):102–27, 2005.

[265] M Terabe, T Washio, and H Motoda. S3Bagging: Fast Classifier Induction

Method with Subsampling and Bagging, pages 177–186. Lecture notes in

computer science. Springer-Verlag, London, UK, 2001.

[266] F Lin. Factors affecting the classification performance of machine learning

algorithms in clinical genomics. poster presented at Bioinformatics Aus-

tralia 2006 Conference; Sydney, Australia.

[267] BL Cantarel, PM Coutinho, C Rancurel, T Bernard, V Lombard, and B Hen-

rissat. The Carbohydrate-Active EnZymes database (CAZy): an expert re-

source for Glycogenomics. Nucleic Acids Res, 2008.

[268] CRH Raetz and C Whitfield. Lipopolysaccharide endotoxins. Annu Rev

Biochem, 71:635–700, 2002.

[269] B Beutler. Tlr4: central component of the sole mammalian LPS sensor. Curr

Opin Immunol, 12(1):20–6, 2000.

245 [270] K Shibayama, S Ohsuka, T Tanaka, Y Arakawa, and M Ohta. Conserved

structural regions involved in the catalytic mechanism of Escherichia coli

K-12 WaaO (RfaI). J Bacteriol, 180(20):5313–8, 1998.

[271] E Pradel, CT Parker, and CA Schnaitman. Structures of the rfaB, rfaI, rfaJ,

and rfaS genes of Escherichia coli K-12 and their roles in assembly of the

lipopolysaccharide core. J Bacteriol, 174(14):4736–45, 1992.

[272] A Hoare, M Bittner, J Carter, S Alvarez, M Zald´ıvar, D Bravo, MA Valvano,

and I Contreras. The outer core lipopolysaccharide of Salmonella enter-

ica serovar Typhi is required for bacterial entry into epithelial cells. Infect

Immun, 74(3):1555–64, 2006.

[273] TS Zaidi, SM Fleiszig, MJ Preston, JB Goldberg, and GB Pier. Lipopolysac-

charide outer core is a ligand for corneal cell binding and ingestion of Pseu-

domonas aeruginosa. Invest Ophthalmol Vis Sci, 37(6):976–86, 1996.

[274] M Ramjeet, V Deslandes, F St Michael, AD Cox, M Kobisch, M Gottschalk,

and M Jacques. Truncation of the lipopolysaccharide outer core affects sus-

ceptibility to antimicrobial peptides and virulence of Actinobacillus pleu-

ropneumoniae serotype 1. J Biol Chem, 280(47):39104–14, 2005.

[275] JP Claverys. A new family of high-affinity ABC manganese and zinc per-

meases. Res Microbiol, 152(3-4):231–43, 2001.

[276] SR Clarke, LG Harris, RG Richards, and SJ Foster. Analysis of Ebh, a

1.1-megadalton cell wall-associated fibronectin-binding protein of Staphy-

lococcus aureus. Infect Immun, 70(12):6680–7, 2002.

[277] MN Rhem, EM Lech, JM Patti, D McDevitt, M Ho¨ok,¨ DB Jones, and

KR Wilhelmus. The collagen-binding adhesin is a virulence factor in

Staphylococcus aureus keratitis. Infect Immun, 68(6):3776–9, 2000.

246 [278] J Symersky, JM Patti, M Carson, K House-Pompeo, M Teale, D Moore,

L Jin, A Schneider, LJ DeLucas, M Ho¨ok,¨ and SV Narayana. Structure of

the collagen-binding domain from a Staphylococcus aureus adhesin. Nat

Struct Biol, 4(10):833–8, 1997.

[279] L Visai, Y Xu, F Casolini, S Rindi, M Ho¨ok,¨ and P Speziale. Monoclonal

antibodies to CNA, a collagen-binding microbial surface component rec-

ognizing adhesive matrix molecules, detach Staphylococcus aureus from a

collagen substrate. J Biol Chem, 275(51):39837–45, 2000.

[280] Y Xu, JM Rivas, EL Brown, X Liang, and M Ho¨ok.¨ Virulence potential of

the staphylococcal adhesin CNA in experimental arthritis is determined by

its affinity for collagen. J Infect Dis, 189(12):2323–33, 2004.

[281] R Stern. Hyaluronan catabolism: a new metabolic pathway. Eur J Cell Biol,

83(7):317–25, 2004.

[282] W Hashimoto, E Kobayashi, H Nankai, N Sato, T Miya, S Kawai, and

K Murata. Unsaturated glucuronyl hydrolase of Bacillus sp. GL1: novel en-

zyme prerequisite for metabolism of unsaturated oligosaccharides produced

by polysaccharide lyases. Arch Biochem Biophys, 368(2):367–74, 1999.

[283] T Itoh, S Akao, W Hashimoto, B Mikami, and K Murata. Crystal structure

of unsaturated glucuronyl hydrolase, responsible for the degradation of gly-

cosaminoglycan, from Bacillus sp. GL1 at 1.8 A resolution. J Biol Chem,

279(30):31804–12, 2004.

[284] T Kato, N Takahashi, and HK Kuramitsu. Sequence analysis and character-

ization of the Porphyromonas gingivalis prtC gene, which expresses a novel

collagenase activity. J Bacteriol, 174(12):3889–95, 1992.

247 [285] A Sellers and JF Woessner. The extraction of a neutral metalloproteinase

from the involuting rat uterus, and its action on cartilage proteoglycan.

Biochem J, 189(3):521–31, 1980.

[286] GJ Locksmith, P Clark, P Duff, and GS Schultz. Amniotic fluid matrix

metalloproteinase-9 levels in women with preterm labor and suspected intra-

amniotic infection. Obstet Gynecol, 94(1):1–6, 1999.

[287] H Harirah, SE Donia, and CD Hsu. Amniotic fluid matrix metalloproteinase-

9 and interleukin-6 in predicting intra-amniotic infection. Obstet Gynecol,

99(1):80–4, 2002.

[288] RK Edwards, P Clark, J Locksmith Gregory, and P Duff. Performance char-

acteristics of putative tests for subclinical chorioamnionitis. Infect Dis Ob-

stet Gynecol, 9(4):209–14, 2001.

[289] G Soong, A Muir, MI Gomez, J Waks, B Reddy, P Planet, PK Singh,

Y Kaneko, MC Wolfgang, YS Hsiao, et al. Bacterial neuraminidase facili-

tates mucosal infection by participating in biofilm production. J Clin Invest,

116(8):2297–2305, 2006.

[290] MJ Jedrzejas. Pneumococcal virulence factors: structure and function. Mi-

crobiol Mol Biol Rev, 65(2):187–207 ; first page, table of contents, 2001.

[291] SHM Rooijakkers, WJB van Wamel, M Ruyken, KPM van Kessel, and JAG

van Strijp. Anti-opsonic properties of staphylokinase. Microbes Infect,

7(3):476–84, 2005.

[292] MI Bokarewa, T Jin, and A Tarkowski. Staphylococcus aureus: Staphylok-

inase. Int J Biochem Cell Biol, 38(4):504–9, 2006.

248 [293] W Haas, J Sublett, D Kaushal, and EI Tuomanen. Revising the role of

the pneumococcal vex-vncRS locus in vancomycin tolerance. J Bacteriol,

186(24):8463–71, 2004.

[294] W Haas, D Kaushal, J Sublett, C Obert, and EI Tuomanen. Vancomycin

stress response in a sensitive and a tolerant strain of Streptococcus pneumo-

niae. J Bacteriol, 187(23):8205–10, 2005.

[295] K Shen, J Gladitz, P Antalis, B Dice, B Janto, R Keefe, J Hayes, A Ahmed,

R Dopico, N Ehrlich, et al. Characterization, distribution, and expression

of novel genes among eight clinical isolates of Streptococcus pneumoniae.

Infect Immun, 74(1):321–30, 2006.

[296] AD Johnson, JF Metzger, and L Spero. Production, purification, and chem-

ical characterization of Staphylococcus aureus exfoliative toxin. Infect Im-

mun, 12(5):1206–10, 1975.

[297] S Guiral, M Moscoso, A Dagkessamanskaia, and JP Claverys. Competence

in Streptococcus pneumoniae: What is it for? In Regine M Hakenbeck

and Singh Chhatwal, editors, Molecular Biology of Streptococci, chapter 3,

pages 61–82. Horizon Bioscience, 2007.

[298] C Burges, T Shaked, E Renshaw, A Lazier, M Deeds, N Hamilton, and

G Hullender. Learning to rank using gradient descent. In ICML ’05: Pro-

ceedings of the 22nd international conference on Machine learning, pages

89–96, New York, NY, USA, 2005. ACM.

[299] C Burges, R Ragno, and QV Le. Learning to rank with nonsmooth cost

functions. In Bernhard Scholkopf,¨ John C. Platt, Thomas Hoffman, Bern-

hard Scholkopf,¨ John C. Platt, and Thomas Hoffman, editors, NIPS, pages

193–200. MIT Press, 2006.

249 [300] Y Cao, J Xu, TY Liu, H Li, Y Huang, and HW Hon. Adapting ranking

svm to document retrieval. In SIGIR ’06: Proceedings of the 29th annual

international ACM SIGIR conference on Research and development in in-

formation retrieval, pages 186–193, New York, NY, USA, 2006. ACM.

[301] Y Yue, T Finley, F Radlinski, and T Joachims. A support vector method

for optimizing average precision. In SIGIR ’07: Proceedings of the 30th

annual international ACM SIGIR conference on Research and development

in information retrieval, pages 271–278, New York, NY, USA, 2007. ACM.

[302] VV Tetz. The pangenome concept: a unifying view of genetic information.

Med Sci Monit, 11(7):HY24–9, 2005.

[303] D Medini, C Donati, H Tettelin, V Masignani, and R Rappuoli. The micro-

bial pan-genome. Curr Opin Genet Dev, 15(6):589–94, 2005.

[304] RH Pantell, TB Newman, J Bernzweig, DA Bergman, JI Takayama, M Se-

gal, SA Finch, and RC Wasserman. Management and outcomes of care of

fever in early infancy. JAMA, 291(10):1203–12, 2004.

[305] BD Gessner, L Castrodale, and M Soriano-Gabarro. Aetiologies and risk

factors for neonatal sepsis and pneumonia mortality among Alaskan infants.

Epidemiol Infect, 133(5):877–81, 2005.

[306] JH Jiang, NC Chiu, FY Huang, HA Kao, CH Hsu, HY Hung, JH Chang, and

CC Peng. Neonatal sepsis in the neonatal intensive care unit: characteristics

of early versus late onset. J Microbiol Immunol Infect, 37(5):301–6, 2004.

[307] SS Mehr, JL Sadowsky, LW Doyle, and J Carr. Sepsis in neonatal intensive

care in the late 1990s. J Paediatr Child Health, 38(3):246–51, 2002.

[308] K Mayor-Lynn, VH Gonzalez-Quintero,´ MJ O’Sullivan, AI Hartstein,

S Roger, and M Tamayo. Comparison of early-onset neonatal sepsis caused

250 by Escherichia coli and group B Streptococcus. Am J Obstet Gynecol,

192(5):1437–9, 2005.

[309] KJ Gray, SL Bennett, N French, AJ Phiri, and SM Graham. Invasive group

B streptococcal infection in infants, Malawi. Emerg Infect Dis, 13(2):223–9,

2007.

[310] KM Puopolo, LC Madoff, and EC Eichenwald. Early-onset group B strepto-

coccal disease in the era of maternal screening. Pediatrics, 115(5):1240–6,

2005.

[311] RS Baltimore, SM Huie, JI Meek, A Schuchat, and KL O’Brien. Early-onset

neonatal sepsis in the era of group B streptococcal prevention. Pediatrics,

108(5):1094–8, 2001.

[312] RK Edwards, WE Jamie, D Sterner, S Gentry, K Counts, and P Duff. In-

trapartum antibiotic prophylaxis and early-onset neonatal sepsis patterns.

Infect Dis Obstet Gynecol, 11(4):221–6, 2003.

[313] J Huebner and DA Goldmann. Coagulase-negative staphylococci: role as

pathogens. Annu Rev Med, 50:223–36, 1999.

[314] D Isaacs. A ten year, multicentre study of coagulase negative staphylococcal

infections in Australasian neonatal units. Arch Dis Child Fetal Neonatal Ed,

88(2):F89–93, 2003.

[315] R Ghotaslou, Z Ghorashi, and MR Nahaei. Klebsiella pneumoniae in neona-

tal sepsis: a 3-year-study in the pediatric hospital of Tabriz, Iran. Jpn J Infect

Dis, 60(2-3):126–8, 2007.

[316] CH Park, JH Seo, JY Lim, HO Woo, and HS Youn. Changing trend of

neonatal infection: experience at a newly established regional medical cen-

ter in Korea. Pediatr Int, 49(1):24–30, 2007.

251 [317] D Greenberg, ES Shinwell, P Yagupsky, S Greenberg, E Leibovitz, M Ma-

zor, and R Dagan. A prospective study of neonatal sepsis and meningitis in

southern Israel. Pediatr Infect Dis J, 16(8):768–73, 1997.

[318] D Isaacs, S Fraser, G Hogg, and HY Li. Staphylococcus aureus infec-

tions in Australasian neonatal nurseries. Arch Dis Child Fetal Neonatal Ed,

89(4):F331–5, 2004.

[319] JA Hoffman, EO Mason, GE Schutze, TQ Tan, WJ Barson, LB Givner,

ER Wald, JS Bradley, R Yogev, and SL Kaplan. Streptococcus pneumoniae

infections in the neonate. Pediatrics, 112(5):1095–102, 2003.

[320] KG Dawson, JC Emerson, and JL Burns. Fifteen years of experience with

bacterial meningitis. Pediatr Infect Dis J, 18(9):816–22, 1999.

[321] MJ Bizzarro, C Raskind, RS Baltimore, and PG Gallagher. Seventy-five

years of neonatal sepsis at Yale: 1928-2003. Pediatrics, 116(3):595–602,

2005.

[322] JM Farber and PI Peterkin. Listeria monocytogenes, a food-borne pathogen.

Microbiol Rev, 55(3):476–511, 1991.

[323] A Dent and P Toltzis. Descriptive and molecular epidemiology of Gram-

negative infections in the neonatal intensive care unit. Curr Opin

Infect Dis, 16(3):279–83, 2003.

[324] S Rahman, A Hameed, MT Roghani, and Z Ullah. Multidrug resistant

neonatal sepsis in Peshawar, Pakistan. Arch Dis Child Fetal Neonatal Ed,

87(1):F52–4, 2002.

[325] B Jones, K Peake, AJ Morris, LM McCowan, and MR Battin. Escherichia

coli: a growing problem in early onset neonatal sepsis. Aust N Z J Obstet

Gynaecol, 44(6):558–61, 2004.

252 [326] S Hershckowitz, MB Elisha, V Fleisher-Sheffer, M Barak, R Kudinsky, and

Z Weintraub. A cluster of early neonatal sepsis and pneumonia caused by

nontypable Haemophilus influenzae. Pediatr Infect Dis J, 23(11):1061–2,

2004.

[327] WG Adams, KA Deaver, SL Cochi, BD Plikaytis, ER Zell, CV Broome,

and JD Wenger. Decline of childhood Haemophilus influenzae type b (Hib)

disease in the Hib vaccine era. JAMA, 269(2):221–6, 1993.

[328] MJ Donnelly, BC Herold, SG Jenkins, and RS Daum. Obstacles to the elim-

ination of Haemophilus influenzae type b disease: three illustrative cases.

Pediatrics, 112(6 Pt 1):1465–6, 2003.

[329] JM O’Neill, JW St Geme, D Cutter, EE Adderson, J Anyanwu, RF Jacobs,

and GE Schutze. Invasive disease due to nontypeable Haemophilus influen-

zae among children in Arkansas. J Clin Microbiol, 41(7):3064–9, 2003.

253 Appendices

254 Appendix A

Genome examples used in statistical CGP of peptidoglycan genes

This table lists the 400 positive genome examples and 17 negative genome exam- ples in the statistical CGP of peptidoglycan-related genes (Case study 1).

Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1

Genome GenBank Accession Positive genome examples (400) Acidobacteria bacterium Ellin345 CP000360 Acidothermus cellulolyticus 11B CP000481 Acidovorax JS42 CP000539–41 Acidovorax avenae citrulli AAC00-1 CP000512 Acinetobacter sp ADP1 CR543861 Aeromonas hydrophila ATCC 7966 CP000462 Agrobacterium tumefaciens C58 UWash AE008687–90 Alcanivorax borkumensis SK2 AM286690 Alkalilimnicola ehrlichei MLHE-1 CP000453 Anabaena variabilis ATCC 29413 CP000117 CP000119–21 Anaeromyxobacter dehalogenans 2CP-C CP000251 Aquifex aeolicus AE000657 AE000667 Arthrobacter FB24 CP000454 Arthrobacter aurescens TC1 CP000474–6 Azoarcus BH72 AM406670 Azoarcus sp EbN1 CR555306–8 Bacillus anthracis Ames AE016879 Bacillus anthracis Ames 0581 AE017334–6 Bacillus anthracis str Sterne AE017225 (Continue on next page)

255 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession Bacillus cereus ATCC14579 AE016877–8 Bacillus cereus ATCC 10987 AE017194–5 Bacillus cereus ZK CP000001 CP000040–4 Bacillus clausii KSM-K16 AP006627 Bacillus halodurans BA000004 Bacillus licheniformis DSM 13 AE017333 Bacillus subtilis AL009126 Bacillus thuringiensis Al Hakam CP000485–6 Bacillus thuringiensis konkukian AE017355 CP000047 Bacteroides fragilis NCTC 9434 CR626927–8 Bacteroides fragilis YCH46 AP006841–2 Bacteroides thetaiotaomicron VPI-5482 AE015928 AY171301 Bartonella bacilliformis KC583 CP000524 Bartonella henselae Houston-1 BX897699 Bartonella quintana Toulouse BX897700 Baumannia cicadellinicola Homalodisca coagulata CP000238 Bdellovibrio bacteriovorus BX842601 Bifidobacterium adolescentis ATCC 15703 AP009256 Bifidobacterium longum AE014295 AF540971 Bordetella bronchiseptica BX470250 Bordetella parapertussis BX470249 Bordetella pertussis BX470248 Borrelia afzelii PKo CP000395–403 Borrelia burgdorferi AE000784–94 AE001115 AE001575–84 Borrelia garinii PBi CP000013–5 Bradyrhizobium japonicum BA000040 Brucella abortus 9-941 AE017223–4 Brucella melitensis AE008917–8 Brucella melitensis biovar Abortus AM040264–5 Brucella suis 1330 AE014291–2 Buchnera aphidicola AE016826 AF492591 Buchnera aphidicola Cc Cinara cedri CP000263 Buchnera aphidicola Sg AE013218 Buchnera sp AP001070–1 BA000003 Burkholderia 383 CP000150–2 Burkholderia cenocepacia AU 1054 CP000378–80 Burkholderia cenocepacia HI2424 CP000458–61 Burkholderia cepacia AMMD CP000440–3 ATCC 23344 CP000010–1 Burkholderia mallei NCTC 10229 CP000545–6 (Continue on next page)

256 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession Burkholderia mallei SAVP1 CP000525–6 Burkholderia pseudomallei 1710b CP000124–5 Burkholderia pseudomallei K96243 BX571965–6 Burkholderia thailandensis E264 CP000085–6 Burkholderia xenovorans LB400 CP000270–2 Campylobacter fetus 82-40 CP000487 Campylobacter jejuni AL111168 Campylobacter jejuni RM1221 CP000025 Candidatus Blochmannia floridanus BX248583 Candidatus Blochmannia pennsylvanicus BPEN CP000016 Candidatus Carsonella ruddii AP009180 Candidatus Pelagibacter ubique HTCC1062 CP000084 Candidatus Ruthia magnifica Cm Calyptogena magnifica CP000488 Carboxydothermus hydrogenoformans Z-2901 CP000141 Caulobacter crescentus AE005673 Chlamydia muridarum AE002160 AE002162 Chlamydia trachomatis AE001273 Chlamydia trachomatis A HAR-13 CP000051–2 Chlamydophila abortus S26 3 CR848038 Chlamydophila caviae AE015925–6 Chlamydophila felis Fe C-56 AP006861–2 Chlamydophila pneumoniae AR39 AE002161 Chlamydophila pneumoniae CWL029 AE001363 Chlamydophila pneumoniae J138 BA000008 Chlamydophila pneumoniae TW 183 AE009440 Chlorobium chlorochromatii CaD3 CP000108 Chlorobium phaeobacteroides DSM 266 CP000492 Chlorobium tepidum TLS AE006470 Chromobacterium violaceum AE016825 Chromohalobacter salexigens DSM 3043 CP000285 Clostridium acetobutylicum AE001437–8 Clostridium novyi NT CP000382 Clostridium perfringens AP003515 BA000016 Clostridium perfringens ATCC 13124 CP000246 Clostridium perfringens SM101 CP000312–5 Clostridium tetani E88 AE015927 AF528097 Clostridium thermocellum ATCC 27405 Colwellia psychrerythraea 34H CP000083 Corynebacterium diphtheriae BX248353 Corynebacterium efficiens YS-314 BA000035 Corynebacterium glutamicum ATCC 13032 Bielefeld BX927147 Corynebacterium jeikeium K411 AF401314 CR931997 Coxiella burnetii AE016828–9 Cyanobacteria bacterium Yellowstone A-Prime CP000239 (Continue on next page)

257 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession Cyanobacteria bacterium Yellowstone B-Prime CP000240 Cytophaga hutchinsonii ATCC 33406 CP000383 Dechloromonas aromatica RCB CP000089 Dehalococcoides CBDB1 AJ965256 Dehalococcoides ethenogenes 195 CP000027 Deinococcus geothermalis DSM 11300 CP000358–9 Deinococcus radiodurans AE000513 AE001825–7 Desulfitobacterium hafniense Y51 AP008230 Desulfotalea psychrophila LSv54 CR522870–2 Desulfovibrio desulfuricans G20 CP000112 Desulfovibrio vulgaris DP4 CP000527–8 Desulfovibrio vulgaris Hildenborough AE017285–6 Ehrlichia canis Jake CP000107 Ehrlichia chaffeensis Arkansas CP000236 Ehrlichia ruminantium Gardel CR925677 Ehrlichia ruminantium str. Welgevonden CR925678 Enterococcus faecalis V583 AE016830–3 Erwinia carotovora atroseptica SCRI1043 BX950851 Erythrobacter litoralis HTCC2594 CP000157 Escherichia coli 536 CP000247 Escherichia coli APEC O1 CP000468 Escherichia coli CFT073 AE014075 Escherichia coli K12 U000096 Escherichia coli O157H7 AB011548–9 BA000007 Escherichia coli O157H7 EDL933 AE005174 AF074613 Escherichia coli UTI89 CP000243–4 Escherichia coli W3110 FSC 198 AM286280 Francisella tularensis holarctica AM233362 Francisella tularensis holarctica OSU18 CP000437 Francisella tularensis novicida U112 CP000439 Francisella tularensis tularensis AJ749949 Frankia CcI3 CP000249 Frankia alni ACN14a CT573213 Fusobacterium nucleatum AE009951 Geobacillus kaustophilus HTA426 AP006520 BA000043 Geobacter metallireducens GS-15 CP000148–9 Geobacter sulfurreducens AE017180 Gloeobacter violaceus BA000045 Gluconobacter oxydans 621H CP000004–9 Gramella forsetii KT0803 CU207366 Granulobacter bethesdensis CGDNIH1 CP000394 Haemophilus ducreyi 35000HP AE017143 Haemophilus influenzae (Continue on next page)

258 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession Haemophilus influenzae 86 028NP CP000057 Haemophilus somnus 129PT CP000019 CP000436 Hahella chejuensis KCTC 2396 CP000155 Halorhodospira halophila SL1 CP000544 acinonychis Sheeba AM260522–3 Helicobacter hepaticus AE017125 26695 AE000511 Helicobacter pylori HPAG1 CP000241–2 Helicobacter pylori J99 AE001439 Hyphomonas neptunium ATCC 15444 CP000158 Idiomarina loihiensis L2TR AE017340 Jannaschia CCS1 CP000264–5 acidophilus NCFM CP000033 Lactobacillus brevis ATCC 367 CP000416–8 Lactobacillus casei ATCC 334 CP000423–4 Lactobacillus delbrueckii bulgaricus CR954253 Lactobacillus delbrueckii bulgaricus ATCC BAA-365 CP000412 Lactobacillus gasseri ATCC 33323 CP000413 Lactobacillus johnsonii NCC 533 AE017198 Lactobacillus plantarum AL935263 CR377164–6 Lactobacillus sakei 23K CR936503 Lactobacillus salivarius UCC118 AF488831–2 CP000233–4 Lactococcus lactis AE005176 Lactococcus lactis cremoris MG1363 AM406671 Lactococcus lactis cremoris SK11 CP000425–30 Lawsonia intracellularis PHE MN1-00 AM180252–5 Lens CR628337 CR628339 Legionella pneumophila Paris CR628336 CR628338 Legionella pneumophila Philadelphia 1 AE017354 Leifsonia xyli xyli CTCB0 AE016822 Leptospira borgpetersenii serovar Hardjo-bovis JB197 CP000350–1 Leptospira borgpetersenii serovar Hardjo-bovis L550 CP000348–9 Leptospira interrogans serovar Copenhageni AE016823–4 Leptospira interrogans serovar Lai AE010300–1 Leuconostoc mesenteroides ATCC 8293 CP000414–5 Listeria innocua AL592022 AL592102 Listeria monocytogenes AL591824 Listeria monocytogenes 4b F2365 AE017262 Listeria welshimeri serovar 6b SLCC5334 AM263198 Magnetococcus MC-1 CP000471 Magnetospirillum magneticum AMB-1 AP007255 Mannheimia succiniciproducens MBEL55E AE016827 (Continue on next page)

259 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession Maricaulis maris MCS10 CP000449 Marinobacter aquaeolei VT8 Mesorhizobium BNC1 CP000389–92 Mesorhizobium loti AP003017 BA000012–3 Methylibium petroleiphilum PM1 CP000555–6 Methylobacillus flagellatus KT CP000284 Methylococcus capsulatus Bath AE017282 Moorella thermoacetica ATCC 39073 CP000232 KMS CP000518–20 Mycobacterium MCS CP000384–5 Mycobacterium avium 104 CP000479 Mycobacterium avium paratuberculosis AE016958 Mycobacterium bovis BX248333 Mycobacterium bovis BCG Pasteur 1173P2 AM408590 Mycobacterium leprae AL450380 Mycobacterium smegmatis MC2 155 CP000480 Mycobacterium CDC1551 AE000516 Mycobacterium tuberculosis H37Rv AL123456 Mycobacterium ulcerans Agy99 CP000325 Mycobacterium vanbaalenii PYR-1 CP000511 Myxococcus xanthus DK 1622 CP000113 Neisseria gonorrhoeae FA 1090 AE004969 Neisseria meningitidis FAM18 AM421808 Neisseria meningitidis MC58 AE002098 Neisseria meningitidis Z2491 AL157959 Neorickettsia sennetsu Miyayama CP000237 Nitrobacter hamburgensis X14 CP000319–22 Nitrobacter winogradskyi Nb-255 CP000115 Nitrosococcus oceani ATCC 19707 CP000126–7 Nitrosomonas europaea AL954747 Nitrosomonas eutropha C71 CP000450–2 Nitrosospira multiformis ATCC 25196 CP000103–6 farcinica IFM10152 AP006618–20 Nocardioides JS614 CP000508–9 Nostoc sp AP003602–6 BA000019–20 Novosphingobium aromaticivorans DSM 12444 CP000248 Oceanobacillus iheyensis BA000028 Oenococcus oeni PSU-1 CP000411 Parachlamydia sp UWE25 BX908798 Paracoccus denitrificans PD1222 CP000489–91 Pasteurella multocida AE004439 Pediococcus pentosaceus ATCC 25745 CP000422 Pelobacter carbinolicus CP000142 Pelobacter propionicus DSM 2379 CP000482–4 Pelodictyon luteolum DSM 273 CP000096 Photobacterium profundum SS9 CR354531–2 (Continue on next page)

260 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession CR377818 Photorhabdus luminescens BX470251 Pirellula sp BX119912 Polaromonas JS666 CP000316–8 Polaromonas naphthalenivorans CJ2 CP000529–37 Porphyromonas gingivalis W83 AE015924 Prochlorococcus marinus AS9601 CP000551 Prochlorococcus marinus CCMP1375 AE017126 Prochlorococcus marinus MED4 BX548174 Prochlorococcus marinus MIT9313 BX548175 Prochlorococcus marinus MIT 9303 CP000554 Prochlorococcus marinus MIT 9312 CP000111 Prochlorococcus marinus MIT 9515 CP000552 Prochlorococcus marinus NATL1A CP000553 Prochlorococcus marinus NATL2A CP000095 Propionibacterium acnes KPA171202 AE017283 Pseudoalteromonas atlantica T6c CP000388 Pseudoalteromonas haloplanktis TAC125 CR954246–7 AE004091 Pseudomonas aeruginosa UCBPP-PA14 CP000438 Pseudomonas entomophila L48 CT573326 Pseudomonas fluorescens Pf-5 CP000076 Pseudomonas fluorescens PfO-1 CP000094 Pseudomonas putida KT2440 AE015451 phaseolicola 1448A CP000058–60 Pseudomonas syringae pv B728a CP000075 Pseudomonas syringae tomato DC3000 AE016853–5 Psychrobacter arcticum 273-4 CP000082 Psychrobacter cryohalolentis K5 CP000323–4 Psychromonas ingrahamii 37 CP000510 Ralstonia eutropha H16 AM260479–80 Ralstonia eutropha JMP134 CP000090–3 Ralstonia metallidurans CH34 CP000352–5 AL646052–3 Rhizobium etli CFN 42 CP000133–8 Rhizobium leguminosarum bv viciae 3841 AM236080–6 Rhodobacter sphaeroides 2 4 1 CP000143–7 DQ232586–7 Rhodococcus RHA1 CP000431–4 Rhodoferax ferrireducens T118 CP000267–8 Rhodopseudomonas palustris BisA53 CP000463 Rhodopseudomonas palustris BisB18 CP000301 Rhodopseudomonas palustris BisB5 CP000283 Rhodopseudomonas palustris CGA009 BX571963–4 Rhodopseudomonas palustris HaA2 CP000250 Rhodospirillum rubrum ATCC 11170 CP000230–1 Rickettsia bellii RML369-C CP000087 Rickettsia conorii AE006914 (Continue on next page)

261 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession Rickettsia felis URRWXCal2 CP000053–5 Rickettsia prowazekii AJ235269 wilmington AE017197 Roseobacter denitrificans OCh 114 CP000362 CP000464–7 Rubrobacter xylanophilus DSM 9941 CP000386 Saccharophagus degradans 2-40 CP000282 Salinibacter ruber DSM 13855 CP000159–60 Salmonella enterica Choleraesuis AE017220 AY509003–4 Salmonella enterica Paratypi ATCC 9150 CP000026 Salmonella typhi AL513382–4 Salmonella typhi Ty2 AE014613 Salmonella typhimurium LT2 AE006468 AE006471 Shewanella ANA-3 CP000469–70 Shewanella MR-4 CP000446 Shewanella MR-7 CP000444–5 Shewanella W3-18-1 CP000503 Shewanella amazonensis SB2B CP000507 Shewanella denitrificans OS217 CP000302 Shewanella frigidimarina NCIMB 400 CP000447 Shewanella oneidensis AE014299–300 Shigella boydii Sb227 CP000036–7 Shigella dysenteriae CP000034–5 Shigella flexneri 2a AE005674 AF386526 Shigella flexneri 2a 2457T AE014073 Shigella flexneri 5 8401 CP000266 Shigella sonnei Ss046 CP000038–9 Silicibacter TM1040 CP000375–7 Silicibacter pomeroyi DSS-3 CP000031–2 Sinorhizobium meliloti AE006469 AL591688 AL591985 Sodalis glossinidius morsitans AP008232–5 Solibacter usitatus Ellin6076 CP000473 Sphingopyxis alaskensis RB2256 CP000356–7 Staphylococcus aureus COL CP000045–6 Staphylococcus aureus MW2 BA000033 Staphylococcus aureus Mu50 AP003367 BA000017 Staphylococcus aureus N315 AP003139 BA000018 Staphylococcus aureus NCTC 8325 CP000253 Staphylococcus aureus RF122 AJ938182 Staphylococcus aureus USA300 CP000255–8 Staphylococcus aureus aureus MRSA252 BX571856 (Continue on next page)

262 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession Staphylococcus aureus aureus MSSA476 BX571857–8 Staphylococcus epidermidis ATCC 12228 AE015929–35 Staphylococcus epidermidis RP62A CP000028–9 Staphylococcus haemolyticus AP006716 Staphylococcus saprophyticus AP008934–6 Streptococcus agalactiae 2603 AE009948 Streptococcus agalactiae A909 CP000114 Streptococcus agalactiae NEM316 AL732656 Streptococcus mutans AE014133 Streptococcus pneumoniae D39 CP000410 Streptococcus pneumoniae R6 AE007317 Streptococcus pyogenes M1 GAS AE004092 Streptococcus pyogenes MGAS10270 CP000260 Streptococcus pyogenes MGAS10394 CP000003 Streptococcus pyogenes MGAS10750 CP000262 Streptococcus pyogenes MGAS2096 CP000261 Streptococcus pyogenes MGAS315 AE014074 Streptococcus pyogenes MGAS5005 CP000017 Streptococcus pyogenes MGAS6180 CP000056 Streptococcus pyogenes MGAS8232 AE009949 Streptococcus pyogenes MGAS9429 CP000259 Streptococcus pyogenes SSI-1 BA000034 Streptococcus sanguinis SK36 Streptococcus thermophilus CNRZ1066 CP000024 Streptococcus thermophilus LMD-9 CP000419–21 Streptococcus thermophilus LMG 18311 CP000023 Streptomyces avermitilis AP005645 BA000030 Streptomyces coelicolor AL589148 AL645771 AL645882 Symbiobacterium thermophilum IAM14863 AP006840 Synechococcus CC9311 CP000435 Synechococcus CC9605 CP000110 Synechococcus CC9902 CP000097 Synechococcus elongatus PCC 6301 AP008231 Synechococcus elongatus PCC 7942 CP000100–1 Synechococcus sp WH8102 BX548020 Synechocystis PCC6803 AP004310–2 AP006585 BA000022 Syntrophobacter fumaroxidans MPOB CP000478 Syntrophomonas wolfei Goettingen CP000448 Syntrophus aciditrophicus SB CP000252 Thermoanaerobacter tengcongensis AE008691 Thermobifida fusca YX CP000088 Thermosynechococcus elongatus BA000039 Thermotoga maritima AE000512 (Continue on next page)

263 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession Thermus thermophilus HB27 AE017221–2 Thermus thermophilus HB8 AP008226–8 Thiobacillus denitrificans ATCC 25259 CP000116 Thiomicrospira crunogena XCL-2 CP000109 Thiomicrospira denitrificans ATCC 33889 CP000153 Treponema denticola ATCC 35405 AE017226 Treponema pallidum AE000520 Trichodesmium erythraeum IMS101 CP000393 Tropheryma whipplei TW08 27 BX072543 Tropheryma whipplei Twist AE014184 Verminephrobacter eiseniae EF01-2 CP000542–3 Vibrio cholerae AE003852–3 Vibrio fischeri ES114 CP000020–2 Vibrio parahaemolyticus BA000031–2 Vibrio vulnificus CMCP6 AE016795–6 Vibrio vulnificus YJ016 AP005352 BA000037–8 Wigglesworthia brevipalpis AB063523 BA000021 Wolbachia endosymbiont of Brugia malayi TRS AE017321 Wolbachia endosymbiont of Drosophila melanogaster AE017196 Wolinella succinogenes BX571656 Xanthomonas campestris AE008922 Xanthomonas campestris 8004 CP000050 Xanthomonas campestris vesicatoria 85-10 AM039948–52 Xanthomonas citri AE008923–5 Xanthomonas oryzae KACC10331 Xanthomonas oryzae MAFF 311018 AP008229 Xylella fastidiosa AE003849–51 Xylella fastidiosa Temecula1 AE009442–3 Yersinia enterocolitica 8081 Antiqua CP000308–11 Yersinia pestis CO92 AL109969 AL117189 AL117211 AL590842 Yersinia pestis KIM AE009952 AF074611 Yersinia pestis Nepal516 CP000305–7 Yersinia pestis biovar Mediaevails AE017042–6 Yersinia pseudotuberculosis IP32953 BX936398–400 Zymomonas mobilis ZM4 AE008692 Negative genome examples (17) Anaplasma marginale St Maries CP000030 Anaplasma phagocytophilum HZ CP000235 Aster yellows witches-broom phytoplasma AYWB CP000061–5 Mesoplasma florum L1 AE017263 Mycoplasma capricolum ATCC 27343 CP000123 (Continue on next page)

264 Table A.1: Positive and negative genome examples used in statistical CGP of peptidoglycan-related genes in Chapter 6.1 (cont’d)

Genome GenBank Accession Mycoplasma gallisepticum AE015450 Mycoplasma hyopneumoniae 232 AE017332 Mycoplasma hyopneumoniae 7448 AE017244 Mycoplasma hyopneumoniae J AE017243 Mycoplasma mobile 163K AE017308 Mycoplasma mycoides BX293980 Mycoplasma penetrans BA000026 U000089 Mycoplasma pulmonis AL445566 Mycoplasma synoviae 53 AE017245 Onion yellows phytoplasma AP006628 Ureaplasma urealyticum AF222894 (End of table)

265 Appendix B

Genome examples used in the statistical CGP of anaerobic fermentation genes

This table lists the 200 positive genome examples and 142 negative genome exam- ples in the statistical CGP of anaerobic mixed-acid fermentation genes (Case study 2).

Table B.1: Positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes in Chapter 6.3

Genome GenBank Accession Positive genome examples (200) Aeromonas hydrophila ATCC 7966 CP000462 Alkalilimnicola ehrlichei MLHE-1 CP000453 Anaeromyxobacter dehalogenans 2CP-C CP000251 Azoarcus sp EbN1 CR555306–8 Bacillus anthracis Ames AE016879 Bacillus anthracis Ames 0581 AE017334–6 Bacillus anthracis str Sterne AE017225 Bacillus halodurans BA000004 Bacillus licheniformis DSM 13 AE017333 Bacillus subtilis AL009126 Bacillus thuringiensis Al Hakam CP000485–6 Bacillus thuringiensis konkukian AE017355 CP000047 Bacteroides fragilis NCTC 9434 CR626927–8 Bacteroides fragilis YCH46 AP006841–2 Bacteroides thetaiotaomicron VPI-5482 AE015928 AY171301 Bifidobacterium adolescentis ATCC 15703 AP009256 Bifidobacterium longum AE014295 AF540971 Brucella abortus 9-941 AE017223–4 (Continue on next page)

266 Table B.1: Positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes in Chapter 6.3 (Cont’d)

Genome GenBank Accession Brucella melitensis biovar Abortus AM040264–5 Burkholderia 383 CP000150–2 Carboxydothermus hydrogenoformans Z-2901 CP000141 Chlorobium phaeobacteroides DSM 266 CP000492 Chlorobium tepidum TLS AE006470 Chromobacterium violaceum AE016825 Clostridium acetobutylicum AE001437–8 Clostridium novyi NT CP000382 Clostridium perfringens AP003515 BA000016 Clostridium perfringens ATCC 13124 CP000246 Clostridium perfringens SM101 CP000312–5 Clostridium tetani E88 AE015927 AF528097 Clostridium thermocellum ATCC 27405 Colwellia psychrerythraea 34H CP000083 Corynebacterium efficiens YS-314 BA000035 Corynebacterium glutamicum ATCC 13032 Bielefeld BX927147 Corynebacterium jeikeium K411 AF401314 CR931997 Coxiella burnetii AE016828–9 Cyanobacteria bacterium Yellowstone A-Prime CP000239 Cyanobacteria bacterium Yellowstone B-Prime CP000240 Dechloromonas aromatica RCB CP000089 Dehalococcoides CBDB1 AJ965256 Dehalococcoides ethenogenes 195 CP000027 Desulfitobacterium hafniense Y51 AP008230 Desulfotalea psychrophila LSv54 CR522870–2 Desulfovibrio desulfuricans G20 CP000112 Desulfovibrio vulgaris DP4 CP000527–8 Desulfovibrio vulgaris Hildenborough AE017285–6 Enterococcus faecalis V583 AE016830–3 Erwinia carotovora atroseptica SCRI1043 BX950851 Escherichia coli 536 CP000247 Escherichia coli APEC O1 CP000468 Escherichia coli CFT073 AE014075 Escherichia coli K12 U000096 Escherichia coli O157H7 AB011548–9 BA000007 Escherichia coli O157H7 EDL933 AE005174 AF074613 Escherichia coli UTI89 CP000243–4 Escherichia coli W3110 Fusobacterium nucleatum AE009951 Geobacter metallireducens GS-15 CP000148–9 Geobacter sulfurreducens AE017180 Haemophilus ducreyi 35000HP AE017143 Haemophilus influenzae (Continue on next page)

267 Table B.1: Positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes in Chapter 6.3 (Cont’d)

Genome GenBank Accession Haemophilus influenzae 86 028NP CP000057 Haemophilus somnus 129PT CP000019 CP000436 Hahella chejuensis KCTC 2396 CP000155 Lactobacillus acidophilus NCFM CP000033 Lactobacillus brevis ATCC 367 CP000416–8 Lactobacillus casei ATCC 334 CP000423–4 Lactobacillus delbrueckii bulgaricus CR954253 Lactobacillus delbrueckii bulgaricus ATCC BAA-365 CP000412 Lactobacillus gasseri ATCC 33323 CP000413 Lactobacillus johnsonii NCC 533 AE017198 Lactobacillus plantarum AL935263 CR377164–6 Lactobacillus sakei 23K CR936503 Lactobacillus salivarius UCC118 AF488831–2 CP000233–4 Lactococcus lactis AE005176 Lactococcus lactis cremoris MG1363 AM406671 Lactococcus lactis cremoris SK11 CP000425–30 Lawsonia intracellularis PHE MN1-00 AM180252–5 Leuconostoc mesenteroides ATCC 8293 CP000414–5 Listeria innocua AL592022 AL592102 Listeria monocytogenes AL591824 Listeria monocytogenes 4b F2365 AE017262 Listeria welshimeri serovar 6b SLCC5334 AM263198 Magnetococcus MC-1 CP000471 Mannheimia succiniciproducens MBEL55E AE016827 Maricaulis maris MCS10 CP000449 Marinobacter aquaeolei VT8 Mesoplasma florum L1 AE017263 Methylibium petroleiphilum PM1 CP000555–6 Moorella thermoacetica ATCC 39073 CP000232 Mycoplasma capricolum ATCC 27343 CP000123 Mycoplasma gallisepticum AE015450 Mycoplasma hyopneumoniae 232 AE017332 Mycoplasma hyopneumoniae 7448 AE017244 Mycoplasma hyopneumoniae J AE017243 Mycoplasma mobile 163K AE017308 Mycoplasma mycoides BX293980 Mycoplasma penetrans BA000026 Mycoplasma pneumoniae U000089 Mycoplasma pulmonis AL445566 Mycoplasma synoviae 53 AE017245 Nitrobacter winogradskyi Nb-255 CP000115 Oenococcus oeni PSU-1 CP000411 Pasteurella multocida AE004439 Pediococcus pentosaceus ATCC 25745 CP000422 (Continue on next page)

268 Table B.1: Positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes in Chapter 6.3 (Cont’d)

Genome GenBank Accession Pelobacter carbinolicus CP000142 Pelobacter propionicus DSM 2379 CP000482–4 Pelodictyon luteolum DSM 273 CP000096 Photobacterium profundum SS9 CR354531–2 CR377818 Photorhabdus luminescens BX470251 Porphyromonas gingivalis W83 AE015924 Propionibacterium acnes KPA171202 AE017283 Psychromonas ingrahamii 37 CP000510 Ralstonia eutropha H16 AM260479–80 Ralstonia eutropha JMP134 CP000090–3 Ralstonia metallidurans CH34 CP000352–5 Rhodobacter sphaeroides 2 4 1 CP000143–7 DQ232586–7 Rhodoferax ferrireducens T118 CP000267–8 Rhodopseudomonas palustris BisA53 CP000463 Rhodopseudomonas palustris BisB18 CP000301 Rhodopseudomonas palustris BisB5 CP000283 Rhodopseudomonas palustris CGA009 BX571963–4 Rhodopseudomonas palustris HaA2 CP000250 Rhodospirillum rubrum ATCC 11170 CP000230–1 Salmonella enterica Choleraesuis AE017220 AY509003–4 Salmonella enterica Paratypi ATCC 9150 CP000026 Salmonella typhi AL513382–4 Salmonella typhi Ty2 AE014613 Salmonella typhimurium LT2 AE006468 AE006471 Shewanella ANA-3 CP000469–70 Shewanella MR-4 CP000446 Shewanella MR-7 CP000444–5 Shewanella W3-18-1 CP000503 Shewanella amazonensis SB2B CP000507 Shewanella denitrificans OS217 CP000302 Shewanella frigidimarina NCIMB 400 CP000447 Shewanella oneidensis AE014299–300 Shigella boydii Sb227 CP000036–7 Shigella dysenteriae CP000034–5 Shigella flexneri 2a AE005674 AF386526 Shigella flexneri 2a 2457T AE014073 Shigella flexneri 5 8401 CP000266 Shigella sonnei Ss046 CP000038–9 Staphylococcus aureus COL CP000045–6 Staphylococcus aureus MW2 BA000033 Staphylococcus aureus Mu50 AP003367 BA000017 Staphylococcus aureus N315 AP003139 (Continue on next page)

269 Table B.1: Positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes in Chapter 6.3 (Cont’d)

Genome GenBank Accession BA000018 Staphylococcus aureus NCTC 8325 CP000253 Staphylococcus aureus RF122 AJ938182 Staphylococcus aureus USA300 CP000255–8 Staphylococcus aureus aureus MRSA252 BX571856 Staphylococcus aureus aureus MSSA476 BX571857–8 Staphylococcus epidermidis ATCC 12228 AE015929–35 Staphylococcus epidermidis RP62A CP000028–9 Staphylococcus haemolyticus AP006716 Staphylococcus saprophyticus AP008934–6 Streptococcus agalactiae 2603 AE009948 Streptococcus agalactiae A909 CP000114 Streptococcus agalactiae NEM316 AL732656 Streptococcus mutans AE014133 Streptococcus pneumoniae D39 CP000410 Streptococcus pneumoniae R6 AE007317 Streptococcus pyogenes M1 GAS AE004092 Streptococcus pyogenes MGAS10270 CP000260 Streptococcus pyogenes MGAS10394 CP000003 Streptococcus pyogenes MGAS10750 CP000262 Streptococcus pyogenes MGAS2096 CP000261 Streptococcus pyogenes MGAS315 AE014074 Streptococcus pyogenes MGAS5005 CP000017 Streptococcus pyogenes MGAS6180 CP000056 Streptococcus pyogenes MGAS8232 AE009949 Streptococcus pyogenes MGAS9429 CP000259 Streptococcus pyogenes SSI-1 BA000034 Streptococcus sanguinis SK36 Streptococcus thermophilus CNRZ1066 CP000024 Streptococcus thermophilus LMD-9 CP000419–21 Streptococcus thermophilus LMG 18311 CP000023 Syntrophobacter fumaroxidans MPOB CP000478 Syntrophomonas wolfei Goettingen CP000448 Syntrophus aciditrophicus SB CP000252 Thermoanaerobacter tengcongensis AE008691 Thermotoga maritima AE000512 Thiobacillus denitrificans ATCC 25259 CP000116 Thiomicrospira crunogena XCL-2 CP000109 Thiomicrospira denitrificans ATCC 33889 CP000153 Treponema denticola ATCC 35405 AE017226 Treponema pallidum AE000520 Ureaplasma urealyticum AF222894 Vibrio cholerae AE003852–3 Vibrio parahaemolyticus BA000031–2 Vibrio vulnificus CMCP6 AE016795–6 Vibrio vulnificus YJ016 AP005352 BA000037–8 Yersinia enterocolitica 8081 (Continue on next page)

270 Table B.1: Positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes in Chapter 6.3 (Cont’d)

Genome GenBank Accession Yersinia pestis Antiqua CP000308–11 Yersinia pestis CO92 AL109969 AL117189 AL117211 AL590842 Yersinia pestis KIM AE009952 AF074611 Yersinia pestis Nepal516 CP000305–7 Yersinia pestis biovar Mediaevails AE017042–6 Yersinia pseudotuberculosis IP32953 BX936398–400 Zymomonas mobilis ZM4 AE008692 Negative genome examples (142) Acidothermus cellulolyticus 11B CP000481 Acidovorax JS42 CP000539–41 Acidovorax avenae citrulli AAC00-1 CP000512 Acinetobacter sp ADP1 CR543861 Agrobacterium tumefaciens C58 UWash AE008687–90 Alcanivorax borkumensis SK2 AM286690 Anabaena variabilis ATCC 29413 CP000117 CP000119–21 Anaplasma marginale St Maries CP000030 Anaplasma phagocytophilum HZ CP000235 Aquifex aeolicus AE000657 AE000667 Arthrobacter aurescens TC1 CP000474–6 Aster yellows witches-broom phytoplasma AYWB CP000061–5 Azoarcus BH72 AM406670 Bacillus cereus ATCC14579 AE016877–8 Bacillus cereus ATCC 10987 AE017194–5 Bacillus cereus ZK CP000001 CP000040–4 Bartonella bacilliformis KC583 CP000524 Bartonella henselae Houston-1 BX897699 Bartonella quintana Toulouse BX897700 Bdellovibrio bacteriovorus BX842601 Bordetella bronchiseptica BX470250 Bordetella parapertussis BX470249 Bordetella pertussis BX470248 Borrelia afzelii PKo CP000395–403 Borrelia burgdorferi AE000784–94 AE001115 AE001575–84 Bradyrhizobium japonicum BA000040 Brucella melitensis AE008917–8 Brucella suis 1330 AE014291–2 Burkholderia pseudomallei 1710b CP000124–5 Burkholderia pseudomallei K96243 BX571965–6 Burkholderia thailandensis E264 CP000085–6 (Continue on next page)

271 Table B.1: Positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes in Chapter 6.3 (Cont’d)

Genome GenBank Accession Campylobacter fetus 82-40 CP000487 Campylobacter jejuni AL111168 Campylobacter jejuni RM1221 CP000025 Candidatus Pelagibacter ubique HTCC1062 CP000084 Caulobacter crescentus AE005673 Chromohalobacter salexigens DSM 3043 CP000285 Corynebacterium diphtheriae BX248353 Cytophaga hutchinsonii ATCC 33406 CP000383 Deinococcus geothermalis DSM 11300 CP000358–9 Deinococcus radiodurans AE000513 AE001825–7 Erythrobacter litoralis HTCC2594 CP000157 Francisella tularensis FSC 198 AM286280 Francisella tularensis holarctica AM233362 Francisella tularensis holarctica OSU18 CP000437 Francisella tularensis novicida U112 CP000439 Francisella tularensis tularensis AJ749949 Frankia CcI3 CP000249 Geobacillus kaustophilus HTA426 AP006520 BA000043 Gluconobacter oxydans 621H CP000004–9 Gramella forsetii KT0803 CU207366 Helicobacter acinonychis Sheeba AM260522–3 Helicobacter hepaticus AE017125 Helicobacter pylori 26695 AE000511 Helicobacter pylori HPAG1 CP000241–2 Helicobacter pylori J99 AE001439 Hyphomonas neptunium ATCC 15444 CP000158 Idiomarina loihiensis L2TR AE017340 Jannaschia CCS1 CP000264–5 Legionella pneumophila Lens CR628337 CR628339 Legionella pneumophila Paris CR628336 CR628338 Legionella pneumophila Philadelphia 1 AE017354 Leifsonia xyli xyli CTCB0 AE016822 Leptospira borgpetersenii serovar Hardjo-bovis JB197 CP000350–1 Leptospira borgpetersenii serovar Hardjo-bovis L550 CP000348–9 Leptospira interrogans serovar Copenhageni AE016823–4 Leptospira interrogans serovar Lai AE010300–1 Magnetospirillum magneticum AMB-1 AP007255 Mesorhizobium BNC1 CP000389–92 Mesorhizobium loti AP003017 BA000012–3 Methylobacillus flagellatus KT CP000284 Methylococcus capsulatus Bath AE017282 Mycobacterium avium 104 CP000479 Mycobacterium avium paratuberculosis AE016958 (Continue on next page)

272 Table B.1: Positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes in Chapter 6.3 (Cont’d)

Genome GenBank Accession Mycobacterium bovis BX248333 Mycobacterium bovis BCG Pasteur 1173P2 AM408590 Mycobacterium leprae AL450380 Mycobacterium smegmatis MC2 155 CP000480 Mycobacterium tuberculosis CDC1551 AE000516 Mycobacterium tuberculosis H37Rv AL123456 Mycobacterium ulcerans Agy99 CP000325 Mycobacterium vanbaalenii PYR-1 CP000511 Myxococcus xanthus DK 1622 CP000113 Neisseria gonorrhoeae FA 1090 AE004969 Neisseria meningitidis FAM18 AM421808 Neisseria meningitidis MC58 AE002098 Neisseria meningitidis Z2491 AL157959 Nitrobacter hamburgensis X14 CP000319–22 Nitrosomonas europaea AL954747 Nitrosospira multiformis ATCC 25196 CP000103–6 Nocardia farcinica IFM10152 AP006618–20 Nocardioides JS614 CP000508–9 Nostoc sp AP003602–6 BA000019–20 Novosphingobium aromaticivorans DSM 12444 CP000248 Oceanobacillus iheyensis BA000028 Onion yellows phytoplasma AP006628 Paracoccus denitrificans PD1222 CP000489–91 Pirellula sp BX119912 Polaromonas JS666 CP000316–8 Polaromonas naphthalenivorans CJ2 CP000529–37 Pseudoalteromonas atlantica T6c CP000388 Pseudoalteromonas haloplanktis TAC125 CR954246–7 Pseudomonas aeruginosa AE004091 Pseudomonas aeruginosa UCBPP-PA14 CP000438 Pseudomonas fluorescens Pf-5 CP000076 Pseudomonas fluorescens PfO-1 CP000094 Pseudomonas putida KT2440 AE015451 Pseudomonas syringae phaseolicola 1448A CP000058–60 Pseudomonas syringae pv B728a CP000075 Pseudomonas syringae tomato DC3000 AE016853–5 Ralstonia solanacearum AL646052–3 Rhizobium leguminosarum bv viciae 3841 AM236080–6 Rhodococcus RHA1 CP000431–4 Rickettsia conorii AE006914 Rickettsia prowazekii AJ235269 Rickettsia typhi wilmington AE017197 Rubrobacter xylanophilus DSM 9941 CP000386 Saccharophagus degradans 2-40 CP000282 Salinibacter ruber DSM 13855 CP000159–60 Silicibacter pomeroyi DSS-3 CP000031–2 Sinorhizobium meliloti AE006469 (Continue on next page)

273 Table B.1: Positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes in Chapter 6.3 (Cont’d)

Genome GenBank Accession AL591688 AL591985 Sodalis glossinidius morsitans AP008232–5 Solibacter usitatus Ellin6076 CP000473 Sphingopyxis alaskensis RB2256 CP000356–7 Streptomyces avermitilis AP005645 BA000030 Streptomyces coelicolor AL589148 AL645771 AL645882 Symbiobacterium thermophilum IAM14863 AP006840 Thermobifida fusca YX CP000088 Thermus thermophilus HB27 AE017221–2 Thermus thermophilus HB8 AP008226–8 Trichodesmium erythraeum IMS101 CP000393 Tropheryma whipplei TW08 27 BX072543 Tropheryma whipplei Twist AE014184 Wolinella succinogenes BX571656 Xanthomonas campestris AE008922 Xanthomonas campestris 8004 CP000050 Xanthomonas campestris vesicatoria 85-10 AM039948–52 Xanthomonas citri AE008923–5 Xanthomonas oryzae KACC10331 Xanthomonas oryzae MAFF 311018 AP008229 Xylella fastidiosa AE003849–51 Xylella fastidiosa Temecula1 AE009442–3 (End of table)

274 Appendix C

Using KEGG pathways as validation sets

In case study 3, there were significant variations in inductive CGP performance be- tween different KEGG pathways. Apart from true indiscriminability of functional categories by phylogenetic profiles, these variations could have been attributed to the sampling bias in the validation data sets. KEGG is an useful resource for illus- trating the transformation of enzymatic substrates. To depict complex biochemical interactions, it is often required to include neighbouring metabolic pathways in the same pathway map. This may result in mixed gene occurrence profiles which can adversely influence inductive CGP performance.

To investigate the possibility of a potential sampling biases, the worst perform- ing validation set (phenylalanine, tyrosine, and tryptophan biosynthesis pathway,

KEGG pathway Id: sag00400) was manually inspected. It was found that there were genes sharing with 7 (out 81) other functional categories of KEGG. A closer examination revealed that two tRNA synthases genes and alanine/aspartate amino- transferase genes were included in the validation sets. The repeated experiment

275 with the removal of these genes resulted in a significantly improved overall perfor- mance (with best AUC of improved from 0.729 to 0.852, Table C.1 and C.2).

Table C.1: Comparison of AUC between the original and processed sag00400 val- idation sets Algorithm Dataset NB LR ADTree IBk J48 SVM/P SVM/R Original (sag00400) 0.709 0.663 0.668 0.715 0.621 0.682 0.729 Processed (sag00400m) 0.684 0.652 0.664 0.773 0.669 0.851 0.759 The values shown in this table are areas under ROC curve. The generalisation performance was estimated by stratified cross-validation for each algorithm (10-fold cross-validation for sag00400, 9-fold for sag00400m).

Table C.2: Genes in the original (sag00400) and the processed (sag00400m) pheny- lalanine, tyrosine and tryptophan biosynthesis KEGG pathway Pathway Gene locus Gene Annotation sag00400 sag00400m SAG0158  tyrS tyrosyl-tRNA synthetase SAG0462 trpG anthranilate synthase component II SAG0525  aspC aspartate aminotransferase SAG0540  hypothetical protein SAG0630 aroA 3-phosphoshikimate 1-carboxyvinyltransferase SAG0631 aroK shikimate kinase SAG0869  pheS phenylalanyl-tRNA synthetase subunit α SAG0871  pheT phenylalanyl-tRNA synthetase subunit β SAG1377 aroC chorismate synthase SAG1378 aroB 3-dehydroquinate synthase SAG1379 aroD 3-dehydroquinate dehydratase SAG1676  alaT aminotransferase AlaT SAG1680 aroE shikimate 5-dehydrogenase SAG1686  phospho-2-dehydro-3-heoxyheptonate aldolase n 14 9

To investigate this effect further, the extent of gene sharing of different KEGG categories was plotted (Figure C.1). It was observed that validation sets with few overlaps tend to have better performance (as shown in red, yellow, and blue groups). The validation sets with good AUC [for example, fatty acid biosynthe- sis (sag00061, AUC: 0.994) and ribosomal genes (sag03010, AUC: 0.930)] have relatively few connections, indicating that only the genes specific to a the func-

276 tion were included. More dense connections were found in pathways that achieved worse AUC (blue and grey groups); although some heavily-connected pathways

(for example, the amino-tRNA biosynthesis pathway, sag00970, AUC: 0.960) have distinct molecular functional properties (for example, synthesis amino-tRNAs). In the above example with the sag00400 validation set, removal of tRNA synthase and aminotransferase genes (sag00400m) resulted in a reduction of connections with neighbouring pathways. Such “purification” step led to an improvement in

CGP performance.

Summary

KEGG is a standardised validation source that is frequently used for benchmarking in silico methods of gene function discovery. The map-style of gene curation in

KEGG implies that a possibility of mixing functionally distinct genes in the same validation set. In the case of CGP by phylogenetic profiles, inclusion of genes from other functional groups may degrade the quality of training data and resulting in loss of performance. Evaluations of functional prediction methods must thus consider the effect of such sampling bias in the data set.

277 sag01040 (3)

sag00061 (10) 0.994

sag00440 (4)

sag00960 (1) sag00430 (2)

sag00903 (2) sag00640 (10) sag00340 (4) 0.985

sag00643 (1) sag00920 (2)

sag00071 (5) sag00564 (7) sag00380 (6) sag00770 (9)

sag00450 (7) sag00624 (5) sag00632 (5)

sag00641 (4)

sag00660 (2)

sag00350 (13) sag00280 (4) sag00220 (8) 0.936 sag00400m (9) sag00620 (21) 0.857 0.868 sag00360 (4) sag00650 (11) sag03060 (14) 0.784 sag00785 (2) 0.857 sag00272 (9)

sag00290 (7) sag00460 (4) sag03010 (56) sag00740 (6) sag00780 (2) 0.930 sag00271 (10) 0.858

sag00400 (14) 0.729 sag00970 (22) sag00561 (9) sag00190 (11) 0.960 0.906

sag00260 (17) sag00010 (27) 0.843 sag00252 (20) 0.908 0.809 sag00473 (4) sag00100 (8) sag00900 (3)

sag00300 (8) sag00710 (15) 0.799 sag02010 (111) sag00680 (2) sag00330 (14) 0.914 0.962

sag00051 (19) 0.795

sag00251 (22) sag00670 (9) sag00630 (5) sag00550 (11) sag00910 (7) 0.940 0.872 sag02060 (29) 0.990

sag00521 (6) sag00052 (13) sag00030 (20) 0.875 sag00562 (1) 0.948 sag02020 (15) sag00230 (55) 0.893 0.813

sag00040 (13) sag00860 (3) 0.939 sag00471 (3) sag00500 (18) 0.866

sag00240 (48) sag00520 (6) sag00523 (4) 0.816

sag00480 (4) sag00790 (10) 0.861 sag00760 (7)

sag00053 (7)

sag00531 (2) sag01032 (2) sag03020 (5) sag00530 (14) 0.923 sag03030 (18) 0.860

sag00730 (9) sag03090 (9)

sag00511 (1)

Figure C.1: This diagram illustrates the pathways that have shared genes in KEGG. Each node denotes a KEGG functional category, and an edge between nodes indicates that common genes are present in both cate- gories. Pathways with fewer connections are generally associated with better evaluation results, while pathways with poor AUCs are likely to be more heavily interconnected. The original validation set composed of phenylala- nine, tyrosine, and tryptophan biosynthesis pathway in KEGG (sag00400, the grey node) contains genes sharing with more neighbouring pathways than the processed validation set (sag00400m, rectangular green node). The pathway names are listed in Table C.3. Coloured nodes: red: AUC>0.95; yellow: AUC=0.90–0.95; green: AUC=0.85–0.90; blue: AUC=0.75–0.85; grey: AUC <0.75; White nodes: KEGG pathways with <10 genes in the pathway that (not included in inductive CGP evaluation).

278 Table C.3: KEGG pathway listed in Figure C.1

Id Name sag00010 Glycolysis / Gluconeogenesis sag00030 Pentose phosphate pathway sag00040 Pentose and glucuronate interconversions sag00051 Fructose and mannose metabolism sag00052 Galactose metabolism sag00053 Ascorbate and aldarate metabolism sag00061 Fatty acid biosynthesis sag00071 Fatty acid metabolism sag00100 Biosynthesis of steroids sag00190 Oxidative phosphorylation sag00220 Urea cycle and metabolism of amino groups sag00230 Purine metabolism sag00240 Pyrimidine metabolism sag00251 Glutamate metabolism sag00252 Alanine and aspartate metabolism sag00260 Glycine, serine and threonine metabolism sag00271 Methionine metabolism sag00272 Cysteine metabolism sag00280 Valine, leucine and isoleucine degradation sag00290 Valine, leucine and isoleucine biosynthesis sag00300 Lysine biosynthesis sag00330 Arginine and proline metabolism sag00340 Histidine metabolism sag00350 Tyrosine metabolism sag00360 Phenylalanine metabolism sag00380 Tryptophan metabolism sag00400 Phenylalanine, tyrosine and tryptophan biosynthesis sag00430 Taurine and hypotaurine metabolism sag00440 Aminophosphonate metabolism sag00450 Selenoamino acid metabolism sag00460 Cyanoamino acid metabolism sag00471 D-Glutamine and D-glutamate metabolism sag00473 D-Alanine metabolism sag00480 Glutathione metabolism sag00500 Starch and sucrose metabolism sag00511 N-Glycan degradation sag00520 Nucleotide sugars metabolism sag00521 Streptomycin biosynthesis sag00523 Polyketide sugar unit biosynthesis sag00530 Aminosugars metabolism sag00531 Glycosaminoglycan degradation sag00550 Peptidoglycan biosynthesis sag00561 Glycerolipid metabolism sag00562 Inositol phosphate metabolism sag00564 Glycerophospholipid metabolism sag00620 Pyruvate metabolism sag00624 1- and 2-Methylnaphthalene degradation sag00630 Glyoxylate and dicarboxylate metabolism sag00632 Benzoate degradation via CoA ligation (Continue on next page)

279 Id Name sag00640 Propanoate metabolism sag00641 3-Chloroacrylic acid degradation sag00643 Styrene degradation sag00650 Butanoate metabolism sag00660 C5-Branched dibasic acid metabolism sag00670 One carbon pool by folate sag00680 Methane metabolism sag00710 Carbon fixation sag00730 Thiamine metabolism sag00740 Riboflavin metabolism sag00760 Nicotinate and nicotinamide metabolism sag00770 Pantothenate and CoA biosynthesis sag00780 Biotin metabolism sag00785 Lipoic acid metabolism sag00790 Folate biosynthesis sag00860 Porphyrin and chlorophyll metabolism sag00900 Terpenoid biosynthesis sag00903 Limonene and pinene degradation sag00910 Nitrogen metabolism sag00920 Sulfur metabolism sag00960 Alkaloid biosynthesis II sag00970 Aminoacyl-tRNA biosynthesis sag01032 Glycan structures - degradation sag01040 Polyunsaturated fatty acid biosynthesis sag02010 ABC transporters - General sag02020 Two-component system - General sag02060 Phosphotransferase system (PTS) sag03010 Ribosome sag03020 RNA polymerase sag03030 DNA replication sag03060 Protein export sag03090 Type II system (End of table)

280 Appendix D

Bacterial pathogens causing neonatal sepsis

D.1 Introduction

Neonatal sepsis causes high rates of morbidity and mortality in newborns and is a significant clinical problem. The propensity of certain bacterial species to cause se- vere neonatal infection has prompted active epidemiological surveillance, culture- based screening, and vaccine research and development. Neonatal sepsis arises from complex interactions between the immature host immune system and a con- stellation of bacterial virulence factors. In Chapter 3, we attempted to predict the virulence of an important neonatal pathogen, group B streptococcus (GBS), using bacterial genotyping data — but failed to achieve sufficient discriminatory power to achieve clinical utility. It was concluded that searching for pathogen markers with higher discriminatory power is required to successfully predict virulence of

GBS.

In Chapters 5 and 6, two in silico CGP methods were developed and evalu- ated to discover genes associated with a phenotype of interest. One of the two

281 methods, statistical CGP, was successful in identifying genes responsible for pep- tidoglycan and anaerobic metabolisms. Statistical CGP scores the relevant genes highly by finding genes that are present in genomes displaying a particular func- tional characteristic and absent in genomes that do not. An analogous concept was proposed by Falkow as molecular Koch’s postulates, which stated that virulence genes should be present in pathogens and absent in non-pathogens. Since several neonatal pathogens can cause bacteraemia in neonates, it is possible that these bac- terial pathogens may share certain virulence factors responsible for causing sepsis syndrome. The use of statistical CGP may thus reveal interesting genes and assist with our exploration of genetic determinants associated with molecular pathogen- esis.

As stated in Chapter 6, the step of genome example selection is crucial to a successful statistical CGP task. The aim of this appendix is to perform a litera- ture review to summarise which bacterial pathogens are likely to cause neonatal septicaemia. The review will serve as the basis for evidence-based selection of genome examples which will be applicable to the statistical CGP task described in

Chapter 8.

D.2 Gram-positive pathogens

D.2.1 Group B streptococcus (GBS)

The clinical manifestations of GBS diseases were reviewed in Chapter 2. This appendix will focus on the epidemiological aspect of GBS as a neonatal pathogen.

GBS is the most common pathogen in developed countries

GBS is the most common neonatal pathogen in developed countries. GBS was found to be among the leading cause of neonatal infection in a 1995–1998 north

282 American study with 3066 infants aged less than 3 months [304]. The mortality of GBS early-onset diseases (EOD) was reported to be up to 21% in an 8-year survey [305]. GBS is the most common causal pathogen in neonatal meningitis associated with LOD (37%) [306, 307]. Compared with E. coli, EOD caused by

GBS are associated with several risk factors including higher number of vaginal examinations, prolonged rupture of membranes, chorioamnionitis, and a higher proportion of pneumonia [308].

Outside Westernised countries, GBS constitutes only a small proportion of neonatal diseases [262]. However, despite lower overall prevalence of GBS in- fections, higher mortality rates with GBS infection were found to be associated with both early- and late-onset diseases in developing countries [309].

As discussed in Chapter 2.2.2, universal gestational screening for GBS has been recommended by Centers for Disease Control in the United States after stud- ies demonstrating a reduction in the incidence of EOD caused by GBS. Despite the preventative efforts, invasive EOD can still occur in pregnancies screened negative for GBS colonisation [310]. In addition, the incidence of LOD and non-GBS infec- tions were not affected by the institution of intrapartum antibiotics [311]. These un- solved challenges mean that GBS continues to be an important perinatal pathogen.

D.2.2 Staphylococcus spp.

Coagulase-negative staphylococci is the leading cause in LOD

Staphylococcus species, in particular coagulase-negative staphylococcus (CoNS), emerged to become important pathogens causing EOD after the introduction of culture-based screening for GBS [312]. As a nosocomial pathogen, CoNS infec- tions manifest in bacteraemia, urinary tract infection, and infection associated with intravenous catheters. [313] Late-onset diseases caused by CoNS primarily affect premature infants born at <30 weeks of gestation, a susceptible group of neonates

283 who usually present with necrotising enterocolitis and sepsis originated from skin infections [314]. An Australian study in the late 1990s revealed that Staphylococ- cus spp. including CoNS, methicillin-resistant S. aureus (MRSA), and methicillin- sensitive S. aureus (MSSA), were among the most frequent bacterial pathogens isolated [307]. This is in contrast to pathogens causing EOD which were domi- nated by GBS and E. coli [307]. Of note, both CoNS (50%) and S. aureus (34%) were found to be strongly associated with LOD developed in neonatal intensive care units (NICU) [306, 307].

Coagulase-negative staphylococci are also important neonatal pathogens in de- veloping countries. CoNS was found to be the predominant bacterial pathogen isolated from blood cultures in countries including Iran (63%) [315], South Ko- rea [316], and others [262].

S. aureus infections carries a high mortality in neonates

Neonatal sepsis attributed to S. aureus are mostly manifested as LOD with a proba- ble nosocomial link [317]. The incidence of neonatal diseases caused by S. aureus was approximately 5% in late 1990s in the U.S. [311]. In Australia, S. aureus is an important causal pathogen in LOD, with up to 25% of late-onset sepsis being attributable to MRSA or MSSA [307]. Neonates with late-onset MRSA diseases usually present with skin abscesses and cellulitis and subsequent septicaemia car- ries a high mortality [318]. This is different from EOD where high mortality rates are caused by MSSA infections instead of MRSA [318].

D.2.3 Other gram-positive pathogens

A small proportion of fatalities associated with neonatal sepsis are caused by non-

GBS, non-staphylococcal Gram-positive bacteria [305]. Streptococcus pneumo- niae is a less common pathogen in newborns compared with GBS, but the mor-

284 bidities associated with pneumococcal infection are also significant; up to 27% mortality of all neonatal sepsis cases caused by S. pneumoniae have been reported

[319, 320]. S. pneumoniae infections frequently manifest as pneumonia or menin- gitis and are more likely to occur in full-term newborns [319]. Other Gram-positive cocci have been inconsistently reported as neonatal pathogens. For example, Ente- rococcus faecalis constituted up to 11% of Gram-positive pathogens in both late- and late-late-onset sepsis in one survey [321]. Neonatal infections with viridans streptococci were found to have an incidence of 5.9% in EOD [311]. Infrequent neonatal sepsis with Streptococcus pyogenes and Listeria monocytogenes have also been reported [262, 305, 311]. In particular, despite Listeria monocytogenes was rarely reported as a causal pathogen in surveys, listeriosis is a specific infection commonly attributed to neonatal infection caused by ingestion of contaminated food during pregnancy [322].

D.3 Gram-negative pathogens

Gram-negative bacteria, particularly Klebsiella pneumoniae, E. coli, Pseudomonas aeruginosa,andSalmonella spp., are the predominant neonatal pathogens in de- veloping countries [262]. Sepsis caused by Gram-negative pathogens carries a higher mortality compared with those caused by Gram-positive pathogens; early- onset sepsis caused by Gram-negative bacteria have up to 38% higher mortality compared with the Gram-positive counterparts [305]. In NICU, Gram-negative pathogens demonstrated patterns of regular nosocomial transmissions in a closed hospital environment [323]. In addition, the emergence of multidrug-resistant

Gram-negative bacteria is of particular concern, as they are frequently associated with increased virulence and higher incidence of meningitis in newborns [324].

285 D.3.1 E. coli

E. coli infections have high mortality rates in EOD

E. coli is the most common Gram-negative neonatal pathogen in developed coun- tries. Between 3–13% of all neonatal sepsis are caused by E. coli with most cases presenting as early-onset sepsis [307, 311, 317].

Septicaemia is the major presentation of E. coli infection in the newborn, par- ticularly in infants with low gestational age and birth weight [308]. E. coli in- fection is associated with poorer prognostic indicators, including poor 5-minute

Apgar scores, a higher incidence of septicaemia, and a longer length of stay in

NICU [308]. Mechanical ventilation is usually required in E. coli sepsis and ad- missions to NICU are associated with poorer outcomes [308]. Mortality as high as 61% in EOD caused by E. coli has been reported [306, 325]. Neurological impairments have been found in infants recovering from E. coli sepsis, a lasting sequelae of infection that may lead to poor neurological developments in early childhood [325].

D.3.2 Klebsiella pneumoniae

K. pneumoniae is the most common neonatal pathogens in developing coun- tries

Klebsiella infection constitutes a significant proportion of neonatal sepsis and high mortalities (25%) in infected neonates have been reported [316]. K. pneumoniae was found to be the most common Gram-negative pathogen (53%) in a 2002–2005 survey in Iran [315]. K. pneumoniae is frequently associated with LOD associated with nosocomial infections. In an early 1990s survey in Southern Israel, 20% of the neonatal infections were attributable to K. pneumoniae in a population of Jews

286 and Bedouin [317]. Up to 25% of all neonatal sepsis cases can be attributed to

Klebsiella spp. in developed countries [311, 326].

D.3.3 Haemophilus influenzae

H. influenzae is an important childhood pathogen causing severe pneumonia and meningitis. Vaccination against Haemophilus influenzae type b (Hib) has signifi- cantly reduced the incidence of Hib infections [327]. However, isolated cases of

Hib have been reported in the post-vaccination era over the last decade [328]. Sev- eral clusters recently reported were non-typeable H. influenzae strains and are also associated with prematurity and early-onset septicaemia [326, 329].

D.4 Other pathogens

Pseudomonas aeruginosa is an infrequent but lethal Gram-negative pathogen in neonates. Very high overall mortalities were observed in both EOD (80%) and

LOD(47%) associated with P. aeruginosa infections [306].

Infections caused by gastrointestinal anaerobes are usually only associated with infants with pre-existing neonatal comorbidities and are rare.

287 Appendix E

Genome examples used for statistical CGP of neonatal virulence genes

This table lists the 331 negative genome examples used in the statistical CGP of virulence genes in GBS (Chapter 8).

Table E.1: Negative genome examples used in statistical CGP of GBS virulence genes in Chapter 8

Genome GenBank Accession Negative genome examples (331) Acidobacteria bacterium Ellin345 CP000360 Acidothermus cellulolyticus 11B CP000481 Acidovorax JS42 CP000539–41 Acidovorax avenae citrulli AAC00-1 CP000512 Actinobacillus pleuropneumoniae L20 Aeropyrum pernix BA000002 Alcanivorax borkumensis SK2 AM286690 Alkalilimnicola ehrlichei MLHE-1 CP000453 Anabaena variabilis ATCC 29413 CP000117 CP000119–21 Anaeromyxobacter dehalogenans 2CP-C CP000251 Anaplasma marginale St Maries CP000030 Anaplasma phagocytophilum HZ CP000235 Aquifex aeolicus AE000657 AE000667 Archaeoglobus fulgidus AE000782 Arthrobacter FB24 CP000454 Arthrobacter aurescens TC1 CP000474–6 Aster yellows witches-broom phytoplasma AYWB CP000061–5 Azoarcus BH72 AM406670 Azoarcus sp EbN1 CR555306–8 Bacillus clausii KSM-K16 AP006627 (Continue on next page)

288 Table E.1: Negative genome examples used in statistical CGP in Chapter 8 (Cont’d)

Genome GenBank Accession Bacillus halodurans BA000004 Bacillus licheniformis ATCC 14580 CP000002 Bacillus licheniformis DSM 13 AE017333 Bacillus thuringiensis Al Hakam CP000485–6 Bacillus thuringiensis konkukian AE017355 CP000047 Bacteroides fragilis NCTC 9434 CR626927–8 Bacteroides fragilis YCH46 AP006841–2 Bartonella henselae Houston-1 BX897699 Bartonella quintana Toulouse BX897700 Baumannia cicadellinicola Homalodisca coagulata CP000238 Bdellovibrio bacteriovorus BX842601 Bifidobacterium adolescentis ATCC 15703 AP009256 Bifidobacterium longum AE014295 AF540971 Bordetella bronchiseptica BX470250 Bordetella parapertussis BX470249 Borrelia afzelii PKo CP000395–403 Borrelia burgdorferi AE000784–94 AE001115 AE001575–84 Borrelia garinii PBi CP000013–5 Bradyrhizobium japonicum BA000040 Brucella abortus 9-941 AE017223–4 Brucella suis 1330 AE014291–2 Buchnera aphidicola AE016826 AF492591 Buchnera aphidicola Cc Cinara cedri CP000263 Buchnera aphidicola Sg AE013218 Buchnera sp AP001070–1 BA000003 Burkholderia mallei ATCC 23344 CP000010–1 Burkholderia mallei NCTC 10229 CP000545–6 Burkholderia mallei NCTC 10247 Burkholderia mallei SAVP1 CP000525–6 Burkholderia thailandensis E264 CP000085–6 Burkholderia xenovorans LB400 CP000270–2 Campylobacter fetus 82-40 CP000487 Candidatus Blochmannia floridanus BX248583 Candidatus Blochmannia pennsylvanicus BPEN CP000016 Candidatus Carsonella ruddii AP009180 Candidatus Carsonella ruddii PV AP009180 Candidatus Pelagibacter ubique HTCC1062 CP000084 Candidatus Ruthia magnifica Cm Calyptogena magnifica CP000488 Carboxydothermus hydrogenoformans Z-2901 CP000141 Caulobacter crescentus AE005673 Chlamydia muridarum AE002160 AE002162 Chlamydophila abortus S26 3 CR848038 (Continue on next page)

289 Table E.1: Negative genome examples used in statistical CGP in Chapter 8 (Cont’d)

Genome GenBank Accession Chlamydophila caviae AE015925–6 Chlamydophila felis Fe C-56 AP006861–2 Chlamydophila pneumoniae AR39 AE002161 Chlamydophila pneumoniae CWL029 AE001363 Chlamydophila pneumoniae J138 BA000008 Chlamydophila pneumoniae TW 183 AE009440 Chlorobium chlorochromatii CaD3 CP000108 Chlorobium phaeobacteroides DSM 266 CP000492 Chlorobium tepidum TLS AE006470 Chromohalobacter salexigens DSM 3043 CP000285 Clostridium acetobutylicum AE001437–8 Clostridium novyi NT CP000382 Clostridium thermocellum ATCC 27405 Colwellia psychrerythraea 34H CP000083 Corynebacterium efficiens YS-314 BA000035 Corynebacterium glutamicum ATCC 13032 Bielefeld BX927147 Corynebacterium glutamicum ATCC 13032 Kitasato BA000036 Corynebacterium jeikeium K411 AF401314 CR931997 Coxiella burnetii AE016828–9 Cyanobacteria bacterium Yellowstone A-Prime CP000239 Cyanobacteria bacterium Yellowstone B-Prime CP000240 Cytophaga hutchinsonii ATCC 33406 CP000383 Dechloromonas aromatica RCB CP000089 Dehalococcoides CBDB1 AJ965256 Dehalococcoides ethenogenes 195 CP000027 Deinococcus geothermalis DSM 11300 CP000358–9 Deinococcus radiodurans AE000513 AE001825–7 Desulfitobacterium hafniense Y51 AP008230 Desulfotalea psychrophila LSv54 CR522870–2 Desulfovibrio desulfuricans G20 CP000112 Desulfovibrio vulgaris DP4 CP000527–8 Desulfovibrio vulgaris Hildenborough AE017285–6 Ehrlichia canis Jake CP000107 Ehrlichia chaffeensis Arkansas CP000236 Ehrlichia ruminantium Gardel CR925677 Ehrlichia ruminantium Welgevonden CR767821 Ehrlichia ruminantium str. Welgevonden CR925678 Erwinia carotovora atroseptica SCRI1043 BX950851 Erythrobacter litoralis HTCC2594 CP000157 Frankia CcI3 CP000249 Frankia alni ACN14a CT573213 Geobacillus kaustophilus HTA426 AP006520 BA000043 Geobacter metallireducens GS-15 CP000148–9 Geobacter sulfurreducens AE017180 Gloeobacter violaceus BA000045 Gluconobacter oxydans 621H CP000004–9 (Continue on next page)

290 Table E.1: Negative genome examples used in statistical CGP in Chapter 8 (Cont’d)

Genome GenBank Accession Gramella forsetii KT0803 CU207366 Granulobacter bethesdensis CGDNIH1 CP000394 Haemophilus ducreyi 35000HP AE017143 Haemophilus somnus 129PT CP000019 CP000436 Hahella chejuensis KCTC 2396 CP000155 Haloarcula marismortui ATCC 43049 AY596290–8 Halobacterium sp AE004437–8 AF016485 Haloquadratum walsbyi AM180088–9 Halorhodospira halophila SL1 CP000544 Helicobacter acinonychis Sheeba AM260522–3 Helicobacter hepaticus AE017125 Helicobacter pylori 26695 AE000511 Helicobacter pylori HPAG1 CP000241–2 Helicobacter pylori J99 AE001439 butylicus CP000493 Hyphomonas neptunium ATCC 15444 CP000158 Idiomarina loihiensis L2TR AE017340 Jannaschia CCS1 CP000264–5 Lactobacillus brevis ATCC 367 CP000416–8 Lactobacillus delbrueckii bulgaricus CR954253 Lactobacillus delbrueckii bulgaricus ATCC BAA-365 CP000412 Lactobacillus gasseri ATCC 33323 CP000413 Lactobacillus johnsonii NCC 533 AE017198 Lactobacillus plantarum AL935263 CR377164–6 Lactobacillus sakei 23K CR936503 Lactobacillus salivarius UCC118 AF488831–2 CP000233–4 Lactococcus lactis AE005176 Lactococcus lactis cremoris MG1363 AM406671 Lactococcus lactis cremoris SK11 CP000425–30 Lawsonia intracellularis PHE MN1-00 AM180252–5 Leifsonia xyli xyli CTCB0 AE016822 Leptospira borgpetersenii serovar Hardjo-bovis JB197 CP000350–1 Leptospira borgpetersenii serovar Hardjo-bovis L550 CP000348–9 Leptospira interrogans serovar Copenhageni AE016823–4 Leptospira interrogans serovar Lai AE010300–1 Leuconostoc mesenteroides ATCC 8293 CP000414–5 Magnetococcus MC-1 CP000471 Magnetospirillum magneticum AMB-1 AP007255 Mannheimia succiniciproducens MBEL55E AE016827 Maricaulis maris MCS10 CP000449 Marinobacter aquaeolei VT8 Mesoplasma florum L1 AE017263 Mesorhizobium BNC1 CP000389–92 Mesorhizobium loti AP003017 BA000012–3 (Continue on next page)

291 Table E.1: Negative genome examples used in statistical CGP in Chapter 8 (Cont’d)

Genome GenBank Accession Methanobacterium thermoautotrophicum AE000666 Methanococcoides burtonii DSM 6242 CP000300 Methanococcus jannaschii L077117–9 Methanococcus maripaludis S2 BX950229 Methanocorpusculum labreanum Z CP000559 Methanoculleus marisnigri JR1 Methanopyrus kandleri AE009439 Methanosaeta thermophila PT CP000477 Methanosarcina acetivorans AE010299 Methanosarcina barkeri fusaro CP000098–9 Methanosarcina mazei AE008384 Methanosphaera stadtmanae CP000102 Methanospirillum hungatei JF-1 CP000254 Methylibium petroleiphilum PM1 CP000555–6 Methylobacillus flagellatus KT CP000284 Methylococcus capsulatus Bath AE017282 Moorella thermoacetica ATCC 39073 CP000232 Mycobacterium avium 104 CP000479 Mycobacterium avium paratuberculosis AE016958 Mycobacterium leprae AL450380 Mycobacterium smegmatis MC2 155 CP000480 Mycobacterium ulcerans Agy99 CP000325 Mycobacterium vanbaalenii PYR-1 CP000511 Mycoplasma capricolum ATCC 27343 CP000123 Mycoplasma gallisepticum AE015450 Mycoplasma genitalium L043967 Mycoplasma hyopneumoniae 232 AE017332 Mycoplasma hyopneumoniae 7448 AE017244 Mycoplasma hyopneumoniae J AE017243 Mycoplasma mycoides BX293980 Mycoplasma penetrans BA000026 Mycoplasma pulmonis AL445566 Mycoplasma synoviae 53 AE017245 Myxococcus xanthus DK 1622 CP000113 Nanoarchaeum equitans AE017199 Natronomonas pharaonis CR936257–9 Neisseria meningitidis FAM18 AM421808 Neisseria meningitidis MC58 AE002098 Neisseria meningitidis Z2491 AL157959 Neorickettsia sennetsu Miyayama CP000237 Nitrobacter hamburgensis X14 CP000319–22 Nitrobacter winogradskyi Nb-255 CP000115 Nitrosococcus oceani ATCC 19707 CP000126–7 Nitrosomonas europaea AL954747 Nitrosomonas eutropha C71 CP000450–2 Nitrosospira multiformis ATCC 25196 CP000103–6 Nocardia farcinica IFM10152 AP006618–20 Nocardioides JS614 CP000508–9 Nostoc sp AP003602–6 (Continue on next page)

292 Table E.1: Negative genome examples used in statistical CGP in Chapter 8 (Cont’d)

Genome GenBank Accession BA000019–20 Novosphingobium aromaticivorans DSM 12444 CP000248 Oceanobacillus iheyensis BA000028 Oenococcus oeni PSU-1 CP000411 Onion yellows phytoplasma AP006628 Parachlamydia sp UWE25 BX908798 Paracoccus denitrificans PD1222 CP000489–91 Pediococcus pentosaceus ATCC 25745 CP000422 Pelobacter carbinolicus CP000142 Pelobacter propionicus DSM 2379 CP000482–4 Pelodictyon luteolum DSM 273 CP000096 Photobacterium profundum SS9 CR354531–2 CR377818 Photorhabdus luminescens BX470251 Picrophilus torridus DSM 9790 AE017261 Pirellula sp BX119912 Polaromonas JS666 CP000316–8 Polaromonas naphthalenivorans CJ2 CP000529–37 Prochlorococcus marinus AS9601 CP000551 Prochlorococcus marinus CCMP1375 AE017126 Prochlorococcus marinus MED4 BX548174 Prochlorococcus marinus MIT9313 BX548175 Prochlorococcus marinus MIT 9301 Prochlorococcus marinus MIT 9303 CP000554 Prochlorococcus marinus MIT 9312 CP000111 Prochlorococcus marinus MIT 9515 CP000552 Prochlorococcus marinus NATL1A CP000553 Prochlorococcus marinus NATL2A CP000095 Pseudoalteromonas atlantica T6c CP000388 Pseudoalteromonas haloplanktis TAC125 CR954246–7 Pseudomonas entomophila L48 CT573326 Pseudomonas syringae phaseolicola 1448A CP000058–60 Pseudomonas syringae pv B728a CP000075 Pseudomonas syringae tomato DC3000 AE016853–5 Psychrobacter arcticum 273-4 CP000082 Psychrobacter cryohalolentis K5 CP000323–4 Psychromonas ingrahamii 37 CP000510 Pyrobaculum aerophilum AE009441 Pyrobaculum calidifontis JCM 11548 Pyrobaculum islandicum DSM 4184 CP000504 Pyrococcus abyssi AL096836 U049503 Pyrococcus furiosus AE009950 Pyrococcus horikoshii BA000001 Ralstonia eutropha H16 AM260479–80 Ralstonia eutropha JMP134 CP000090–3 Ralstonia metallidurans CH34 CP000352–5 Ralstonia solanacearum AL646052–3 Rhizobium etli CFN 42 CP000133–8 (Continue on next page)

293 Table E.1: Negative genome examples used in statistical CGP in Chapter 8 (Cont’d)

Genome GenBank Accession Rhizobium leguminosarum bv viciae 3841 AM236080–6 Rhodoferax ferrireducens T118 CP000267–8 Rhodopseudomonas palustris BisA53 CP000463 Rhodopseudomonas palustris BisB18 CP000301 Rhodopseudomonas palustris BisB5 CP000283 Rhodopseudomonas palustris CGA009 BX571963–4 Rhodopseudomonas palustris HaA2 CP000250 Rhodospirillum rubrum ATCC 11170 CP000230–1 Rickettsia bellii RML369-C CP000087 Rickettsia conorii AE006914 Rickettsia prowazekii AJ235269 Roseobacter denitrificans OCh 114 CP000362 CP000464–7 Rubrobacter xylanophilus DSM 9941 CP000386 Saccharophagus degradans 2-40 CP000282 Salinibacter ruber DSM 13855 CP000159–60 Shewanella amazonensis SB2B CP000507 Shewanella baltica OS155 Shewanella denitrificans OS217 CP000302 Shewanella frigidimarina NCIMB 400 CP000447 Shewanella loihica PV-4 Shewanella oneidensis AE014299–300 Silicibacter TM1040 CP000375–7 Silicibacter pomeroyi DSS-3 CP000031–2 Sinorhizobium meliloti AE006469 AL591688 AL591985 Sodalis glossinidius morsitans AP008232–5 Solibacter usitatus Ellin6076 CP000473 Sphingopyxis alaskensis RB2256 CP000356–7 Staphylococcus saprophyticus AP008934–6 Staphylothermus marinus F1 Streptococcus mutans AE014133 Streptococcus sanguinis SK36 Streptococcus thermophilus CNRZ1066 CP000024 Streptococcus thermophilus LMD-9 CP000419–21 Streptococcus thermophilus LMG 18311 CP000023 Streptomyces avermitilis AP005645 BA000030 Streptomyces coelicolor AL589148 AL645771 AL645882 Sulfolobus acidocaldarius DSM 639 CP000077 Sulfolobus solfataricus AE006641 Sulfolobus tokodaii BA000023 Symbiobacterium thermophilum IAM14863 AP006840 Synechococcus CC9311 CP000435 Synechococcus CC9605 CP000110 Synechococcus CC9902 CP000097 (Continue on next page)

294 Table E.1: Negative genome examples used in statistical CGP in Chapter 8 (Cont’d)

Genome GenBank Accession Synechococcus elongatus PCC 6301 AP008231 Synechococcus elongatus PCC 7942 CP000100–1 Synechococcus sp WH8102 BX548020 Synechocystis PCC6803 AP004310–2 AP006585 BA000022 Syntrophobacter fumaroxidans MPOB CP000478 Syntrophomonas wolfei Goettingen CP000448 Syntrophus aciditrophicus SB CP000252 Thermoanaerobacter tengcongensis AE008691 Thermobifida fusca YX CP000088 Thermococcus kodakaraensis KOD1 AP006878 Thermofilum pendens Hrk 5 CP000505–6 Thermoplasma acidophilum AL139299 Thermoplasma volcanium BA000011 Thermosynechococcus elongatus BA000039 Thermotoga maritima AE000512 Thermus thermophilus HB27 AE017221–2 Thermus thermophilus HB8 AP008226–8 Thiobacillus denitrificans ATCC 25259 CP000116 Thiomicrospira crunogena XCL-2 CP000109 Thiomicrospira denitrificans ATCC 33889 CP000153 Treponema denticola ATCC 35405 AE017226 Trichodesmium erythraeum IMS101 CP000393 Tropheryma whipplei TW08 27 BX072543 Tropheryma whipplei Twist AE014184 Ureaplasma urealyticum AF222894 Verminephrobacter eiseniae EF01-2 CP000542–3 Vibrio fischeri ES114 CP000020–2 Wigglesworthia brevipalpis AB063523 BA000021 Wolbachia endosymbiont of Brugia malayi TRS AE017321 Wolbachia endosymbiont of Drosophila melanogaster AE017196 Wolinella succinogenes BX571656 Xanthomonas campestris AE008922 Xanthomonas campestris 8004 CP000050 Xanthomonas campestris vesicatoria 85-10 AM039948–52 Xanthomonas citri AE008923–5 Xanthomonas oryzae KACC10331 Xanthomonas oryzae MAFF 311018 AP008229 Xylella fastidiosa AE003849–51 Xylella fastidiosa Temecula1 AE009442–3 Yersinia pseudotuberculosis IP32953 BX936398–400 Zymomonas mobilis ZM4 AE008692 (End of table)



295