In silico virulence prediction and virulence gene
discovery of Streptococcus agalactiae
FrankPo-YenLIN Centre for Health Informatics School of Public Health and Community Medicine University of New South Wales
A thesis submitted in fulfilment of requirements for the degree
of Doctor of Philosophy
October 2009 Declaration of originality
I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no materials previously published or written by another person, nor material which to a substantial extent has been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in this thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis.
I also declare that the intellectual content of the thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation, and linguistic expression is acknowledged.
Frank Po-Yen LIN
October 2009 To my parents
and
my aunt Abstract
Physicians frequently face challenges in predicting which bacterial subpopulations are likely to cause severe infections. A more accurate prediction of virulence would improve diagnostics and limit the extent of antibiotic resistance. Nowadays, bac- terial pathogens can be typed with high accuracy with advanced genotyping tech- nologies. However, effective translation of bacterial genotyping data into assess- ments of clinical risk remains largely unexplored.
The discovery of unknown virulence genes is another key determinant of suc- cessful prediction of infectious disease outcomes. The trial-and-error method for virulence gene discovery is time-consuming and resource-intensive. Selecting can- didate genes with higher precision can thus reduce the number of futile trials. Sev- eral in silico candidate gene prioritisation (CGP) methods have been proposed to aid the search for genes responsible for inherited diseases in human. It remains uninvestigated as to how the CGP concept can assist with virulence gene discovery in bacterial pathogens.
The main contribution of this thesis is to demonstrate the value of translational bioinformatics methods to address challenges in virulence prediction and viru- lence gene discovery. This thesis studied an important perinatal bacterial pathogen, group B streptococcus (GBS), the leading cause of neonatal sepsis and meningi- tis in developed countries. While several antibiotic prophylactic programs have successfully reduced the number of early-onset neonatal diseases (infections that occur within 7 days of life), the prevalence of late-onset infections (infections that occur between 7–30 days of life) remained constant. In addition, the widespread use of intrapartum prophylactic antibiotics may introduce undue risk of penicillin allergy and may trigger the development of antibiotic-resistant microorganisms. To minimising such potential harm, a more targeted approach of antibiotic use is
required. Distinguish virulent GBS strains from colonising counterparts thus lays
the cornerstone of achieving the goal of tailored therapy.
There are three aims of this thesis:
1. Prediction of virulence by analysis of bacterial genotype data:
To identify markers that may be associated with GBS virulence, statistical anal-
ysis was performed on GBS genotype data consisting of 780 invasive and 132
colonising S. agalactiae isolates. From a panel of 18 molecular markers stud-
ied, only alp3 gene (which encodes a surface protein antigen commonly associ-
ated with serotype V) showed an increased association with invasive diseases
(OR=2.93, p=0.0003, Fisher’s exact test). Molecular serotype II (OR=10.0,
p=0.0007) was found to have a significant association with early-onset neonatal
disease when compared with late-onset diseases.
To investigate whether clinical outcomes can be predicted by the panel of geno-
type markers, logistic regression and machine learning algorithms were applied
to distinguish invasive isolates from colonising isolates. Nevertheless, the pre-
dictive analysis only yielded weak predictive power (area under ROC curve,
AUC: 0.56–0.71, stratified 10-fold cross-validation). It was concluded that a
definitive predictive relationship between the molecular markers and clinical
outcomes may be lacking, and more discriminative markers of GBS virulence
are needed to be investigated.
2. Development of two computational CGP methods to assist with functional dis-
covery of prokaryotic genes:
Two in silico CGP methods were developed based on comparative genomics:
statistical CGP exploits the differences in gene frequency against phenotypic
ii groups, while inductive CGP applies supervised machine learning to identify
genes with similar occurrence patterns across a range of bacterial genomes.
Three rediscovery experiments were carried out to evaluate the CGP methods:
• Rediscovery of peptidoglycan genes was attempted with 417 published
bacterial genome sequences. Both CGP methods achieved their best AUC
>0.911 in Escherichia coli K-12 and >0.978 Streptococcus agalactiae
2603 (SA-2603) genomes, with an average improvement in precision of
>3.2-fold and a maximum of >27-fold using statistical CGP. A median
AUC of >0.95 could still be achieved with as few as 10 genome examples
in each group in the rediscovery of the peptidoglycan metabolism genes.
• A maximum of 109-fold improvement in precision was achieved in the
rediscovery of anaerobic fermentation genes.
• In the rediscovery experiment with genes of 31 metabolic pathways in SA-
2603, 14 pathways achieved an AUC >0.9 and 28 pathways achieved AUC
>0.8 with the best inductive CGP algorithms. The results from the re-
discovery experiments demonstrated that the two CGP methods can assist
with the study of functionally uncategorised genomic regions and the dis-
covery of bacterial gene-function relationships.
3. Application of the CGP methods to discover GBS virulence genes:
Both statistical and inductive CGP were applied to assist with the discovery of
unknown GBS virulence factors. Among a list of hypothetical protein genes,
several highly-ranked genes were plausibly involved in molecular mechanisms
in GBS pathogenesis, including several genes encoding family 8 glycosyltrans-
ferase, family 1 and family 2 glycosyltransferase, multiple adhesins, strepto-
coccal neuraminidase, staphylokinase, and other factors that may have roles in
contributing to GBS virulence. Such genes may be candidates for further bio-
iii logical validation. In addition, the co-occurrence of these genes with currently known virulence factors suggested that the virulence mechanisms of GBS in causing perinatal diseases are multifactorial. The procedure demonstrated in this prioritisation task should assist with the discovery of virulence genes in other pathogenic bacteria.
iv Acknowledgements
First of all, I wish to express my gratitude to my supervisor Professor Enrico Coiera for his guidance and encouragements over the last three years. In particular, I could not have finished my work without his constant optimism, experience, and strive for perfection. My gratitude goes equally to Dr Vitali Sintchenko, my co-supervisor, whose vision and the remarkable attention to details have truly inspired me. Both
Enrico and Vitali have strengthened my interest in the fields of clinical decision support and genomics. Their encouragements have been an essential element to my candidature.
Coming from a non-technical, non-laboratory background, I could not have completed this thesis without the expertise and assistance of the following people:
Professor Lyn Gilbert and Dr Fanrong Kong for leading me into the fascinating
fields of clinical microbiology and molecular epidemiology; Heather Hiddings for her helpful discussions in biostatistics; Danny Ko for collecting and curating GBS genotyping data; Dr. Ruiting Lan for his expert knowledge on microbial genet- ics; and Drs. Mike Bain, Ashwin Srinivasan and Guy Tsafnat for sharing their knowledge on machine learning and their generous comments in assisting me with experimental design. I am also greatly indebted to Enrico, Vitali, Lyn, Kong, and
Ruiting for their assistance for editing the earlier drafts of this thesis.
I would also like to thank many anonymous reviewers and editors of BMC
Bioinformatics, Journal of Infectious Diseases, Clinical Microbiology and Infec-
i tion,andPathology with their invaluable insights on my work. Their constructive criticisms constituted a substantial part of my learning in conducting rigorous sci- entific research (albeit sometimes the hard way!).
Over the years, I received numerous useful advice and helpful discussions from my colleagues and seniors at the Centre for Health Informatics, including but not least (in alphabetical order) Stephen Anthony, Farshid Anvari, Grace Chung, Adam
Dunn, Blanca Gallego Luxan, Andrew Georgiou, Simon Li, Annie Lau, Farah Ma- grabi, Geoff McDonnell, Hieu Phan, Victor Vickland, Prof. Johanna Westbrook,
Tatjana Zrimec, and from my fellow students past and present: Afra, David, Jaron,
MeiSing, Nerida, Rosie, Torsten, and Zafar. Also, I could not have done without the dedicated admin team for their assistance: Sarah Behman, Keri Bell, Danielle
Del Pizzo, Janice Ooi, Samantha Sheridan, Denise Tsiros, and Gerard Viswasam.
Financially, I would like to thank National Health and Medical Research Coun- cil of Australia for funding my scholarship.
I wish to thank my family for their constant encouragements during my can- didature. I thank my parents for supporting my decisions on what I wanted to do.
Finally, I could not have completed my work without the support from Jerlyna, my long-term girlfriend (now fiancee),´ who shared much of my frustrations and emo- tional upheavals over the last few years. Maintaining a long-distance relationship was challenging – and I am extremely grateful to have her stood by my side, with much understanding and patience, throughout the journey in pursuing my goal.
ii Table of abbreviations
ADTree Alternating decision tree algorithm amss Arithmetic mean of sensitivity and specificity (scoring function) AUC Area under receiver operating characteristic curve bchisq Directional chi-square scoring function BLAST Basic Local Alignment Search Tool CG Cumulative gain function CGP Candidate gene prioritisation chisq Chi-square scoring function CMP-NeuNAc Cytidine 5’-monophospho–N-acetylneuraminic acid COG Clusters of orthologous groups CoNS Coagulase-negative staphylococci CSF Cerebral spinal fluid DNA Deoxyribonucleic acid ECM Extracellular matrix EOD Early-onset (neonatal) disease EST Expressed sequence tag F F-measure scoring function GAG Glycosaminoglycan GBS Group B streptococcus GC content Guanine-Cytosine content GT1/2/8 Glycosyltransferase family 1/2/8 HGT Horizontal gene transfer Hib Haemophilus influenzae type b hmss Harmonic mean of sensitivity and specificity (scoring function) IBk k-nearest neighbour algorithm IgA/G/M Immunoglobulin A/G/M IR Information retrieval KEGG Kyoto Encyclopaedia of Genes and Genomes LOD Late-onset (neonatal) disease LPS Lipopolysaccharide LR Logistic regression
iii MGE Mobile genetic elements MLP Multilayer perceptrons MLST Multilocus sequence typing mPCR/RLB Multiplex PCR-based reverse line blot MRSA Methicillin-resistant Staphylococcus aureus MS Molecular serotype MSST Molecular serosubtype (of MS type III) MSSA Methicillin-sensitive Staphylococcus aureus NB Na¨ıve Bayes algorithm NCBI National Enter for Biotechnology Information NeuNAc N-acetylneuraminic acid NICU Neonatal intensive care unit npv Negative predictive value (scoring function) OMIM Online Mendelian Inheritance in Man (database) OR Odds ratio OR Odds ratio scoring function orf Open reading frame(s) PCR Polymerase chain reaction pct Percentile PE Probability enrichment PGP Protein genetic profiles ppv Positive predictive value (scoring function) PROM Premature rupture of the membranes PRU Polysaccharide repeating units RFLP Restriction fragment length polymorphism ROC Receiver operating characteristic sens Sensitivity (scoring function) SMO Sequential minimal optimization algorithm spec Specificity (scoring function) ST Sequence type (of MLST) SVM Support vector machines SVM/RBF SVM with radial basis function kernel SVM/Poly SVM with polynomial kernel VSM Vector-space model WEKA Waikato Environment for Knowledge Analysis ZeroR ZeroR majority-class classifier
iv List of Publications
1. F Lin, V Sintchenko, F Kong, GL Gilbert, and E Coiera. Commonly-used
molecular epidemiology markers of Streptococcus agalactiae do not appear
to predict virulence. Pathology (Accepted 29 October 2008).
2. F Lin, E Coiera, RT Lan, and V Sintchenko. In silico prioritisation of can-
didate genes for prokaryotic gene function discovery. BMC Bioinformatics
2009, 10:86.
3. F Lin. The role of data mining in clinical predictive medicine: a narrative
review. In: Westbrook, J. and Callen, J., eds. Bridging the Digital Divide:
Clinician, Consumer and Computer: Proceedings of 14th Annual Confer-
ence of Health Informatics Society Australia (HIC 2006); Sydney, Australia,
August 2006.
4. F Lin. Factors affecting the classification performance of machine learning
algorithms in clinical genomics. (poster) presented at Bioinformatics Aus-
tralia 2006 Conference; Sydney, Australia, 21 November 2006.
5. F Lin, V Sintchenko, and E Coiera. A comparative genomic approach for
suggesting candidate virulence genes in Streptococcus pneumoniae. (poster)
presented at Genetics and Genomics of Infectious Diseases 2009 (Nature
conference); Singapore, 21–24 March 2009 .
v This work applied the inductive candidate gene prioritisation method devel- oped in Chapter 5 to S. pneumoniae, an important respiratory pathogen, in the identical fashion to the methods described in Chapter 9.
vi Contents
Abstract i
Acknowledgements i
Table of abbreviations iii
List of Publications v
1 Introduction 2
1.1 Explosion of genetic information ...... 3
1.2 Translational bioinformatics ...... 4
1.3Bacterialpathogens...... 6
1.3.1 Bacterialclassificationandtyping...... 6
1.3.2 Virulencemechanismsandvirulencegenes...... 9
1.4 Antibiotics and their optimal use ...... 11
1.5Virulencegenediscovery...... 13
1.5.1 Classicalapproach...... 13
1.5.2 Bacterialcomparativegenomics...... 15
1.5.3 Tools for comparing different bacterial genomes ..... 16
1.6 Candidate gene prioritisation (CGP) ...... 18
1.7Guidetothesis...... 18
vii I Virulence prediction in Streptococcus agalactiae 21
2 Group B streptococcal diseases and typing methods 22
2.1 Introduction ...... 22
2.2 Clinical manifestations of GBS diseases ...... 23
2.2.1 Maternalcarriageanddisease...... 24
2.2.2 Early-onsetneonataldisease...... 24
2.2.3 Late-onsetdisease...... 26
2.3 Typing of GBS strains ...... 27
2.3.1 Serotyping...... 27
2.3.2 Molecular characterisation ...... 28
2.3.3 Genome sequences ...... 31
3 GBS virulence classification 33
3.1 Introduction ...... 33
3.2DescriptiveanalysisofGBSmarkers...... 34
3.2.1 Material and Methods ...... 34
3.2.2 Results...... 38
3.2.3 Discussion...... 43
3.3 Predictive analysis by machine learning ...... 44
3.3.1 Materialandmethods...... 44
3.3.2 Results...... 46
3.3.3 Discussion...... 46
3.4 Potential limitations ...... 50
3.5 Selection of discriminatory features ...... 52
3.5.1 Methods...... 52
3.5.2 Results...... 54
3.5.3 Discussion...... 56
viii 3.6 Evolutionary considerations ...... 56
3.6.1 Horizontal gene transfer ...... 59
3.6.2 Positiveselectionofvirulencegenes...... 59
3.6.3 Virulencegenetyping...... 60
3.7 Chapter summary ...... 61
II Co-occurrence-based candidate gene prioritisation 62
4 Candidate gene prioritisation: literature review 63
4.1 Introduction ...... 63
4.1.1 Overview of the in silico CGPprocess...... 65
4.2TypesofCGP...... 66
4.2.1 Characteristics-based prioritisation ...... 67
4.2.2 Inductive prioritisation ...... 68
4.3Datasources...... 69
4.3.1 Primarydatasource...... 69
4.3.2 Secondarydatasources...... 70
4.3.3 Primaryversus.secondarydatasources...... 70
4.4Genefeatures...... 71
4.4.1 Co-occurrence...... 73
4.4.2 Similarity ...... 74
4.5 Feature integration ...... 74
4.5.1 ad hoc sorting and filtering ...... 75
4.5.2 Vector-space model ...... 76
4.5.3 Statisticalmodels...... 77
4.5.4 Machine learning models ...... 79
4.6 Methods of evaluation ...... 79
ix 4.6.1 Validation by rediscovery experiments ...... 79
4.6.2 Threshold-dependent measures ...... 79
4.6.3 AreaunderROCcurve...... 81
4.6.4 External validation ...... 81
4.7Discussion...... 82
4.7.1 Conservation hypothesis and biases ...... 82
4.7.2 ChallengesinGBSvirulencegenediscovery...... 82
5 CGP methods for prokaryotic genes 84
5.1 Gene-function co-occurrence ...... 86
5.2 Formal Definitions ...... 88
5.3StatisticalCGP...... 91
5.3.1 Occurrencematrix...... 91
5.3.2 The 2 × 2 contingency tables ...... 93
5.3.3 Scoring functions ...... 93
5.4InductiveCGP...... 96
5.5 Evaluation by rediscovery experiments ...... 97
5.5.1 Threshold-dependent measures ...... 98
5.5.2 Area under ROC curve ...... 100
5.5.3 Evaluation measures used in this thesis ...... 101
6 Co-occurrence-based CGP: case studies 103
6.1 Case study 1: Rediscovery of peptidoglycan genes ...... 103
6.1.1 Methods ...... 104
6.1.2 Results ...... 111
6.1.3 Discussion...... 123
6.2 How many genome examples are needed? ...... 124
6.2.1 Methods ...... 124
x 6.2.2 Results ...... 125
6.2.3 Discussion...... 130
6.3 Case study 2: Anaerobic mixed-acid fermentation genes ..... 130
6.3.1 Methods ...... 131
6.3.2 Results ...... 131
6.3.3 Discussion...... 137
6.4Casestudy3:RediscoveryofKEGGpathways...... 138
6.4.1 Methods ...... 138
6.4.2 Results ...... 139
6.4.3 Discussion...... 143
6.5 Potential limitations ...... 144
6.6Summary...... 146
III Virulence gene discovery in S. agalactiae 147
7 Review: virulence genes of Streptococcus agalactiae 150
7.1 Introduction ...... 150
7.2Adhesins...... 153
7.2.1 Fibrinogen-binding protein genes ...... 153
7.2.2 Fibronectin-binding protein gene ...... 153
7.2.3 Streptococcal C5a-peptidase ...... 154
7.2.4 Laminin-binding protein gene ...... 154
7.2.5 Minor pilin gene cluster ...... 154
7.3Invasins...... 155
7.3.1 Cα-protein and α-like protein family genes ...... 155
7.3.2 β-haemolysin/cytolysin gene cluster ...... 155
7.3.3 Hyaluronatelyasegene...... 157
xi 7.3.4 Otherinvasins...... 157
7.4 Immune system evasion ...... 157
7.4.1 Cβ proteingene...... 157
7.4.2 Streptococcal C5a-peptidase ...... 158
7.4.3 Capsular polysaccharide cps genecluster...... 158
7.4.4 Sialicacidsynthasesgenecluster...... 162
7.4.5 SerineproteaseCspA...... 162
7.4.6 Penicillin-binding protein 1A gene ...... 162
7.5Summary...... 163
8 GBS virulence gene discovery by statistical CGP 165
8.1 Material and methods ...... 166
8.1.1 Genome example selection ...... 166
8.1.2 StatisticalCGP...... 167
8.1.3 Comparisonwithcurrentlyknownvirulencefactors.... 168
8.1.4 Clustering of orthologous genes ...... 168
8.2 Results ...... 168
8.3Discussion...... 171
8.3.1 Possible significance of top-ranked genes ...... 171
8.3.2 Few overlaps with currently known virulence factors . . . 171
8.3.3 Sampling bias involved in genome example selection . . . 172
8.3.4 Overlappingwithanaerobic-specificgenes...... 173
9 GBS virulence gene discovery by inductive CGP 174
9.1 Methods ...... 174
9.1.1 Rediscovery of training genes ...... 177
9.1.2 Sub-sampling of negative examples ...... 177
9.2 Results ...... 180
xii 9.2.1 Rediscovery of GBS virulence genes or gene categories . 180
9.2.2 De novo discoveryofGBSvirulencegenes...... 181
9.2.3 Adhesion genes fbsA, fbsB, and pavA (Table 9.3) ..... 181
9.2.4 lmb and scpB genes (Table 9.4) ...... 181
9.2.5 GBS minor pilin genes (Table 9.5) ...... 182
9.2.6 Genes encoding invasins spb1andbca (Table 9.6) .... 182
9.2.7 Cytolysins: the cyl geneclusterandCAMPfactor..... 182
9.2.8 cspAandhylBgenes...... 183
9.2.9 cps and neu geneclusters...... 183
9.3Discussion...... 184
10 Biological significance of prioritised genes 194
10.1 Glycosyltransferases ...... 196
10.1.1 Family 8 glycosyltransferases ...... 196
10.1.2 Family 1 and 2 glycosyltransferases ...... 197
10.2Adhesins...... 197
10.2.1 Adherence to ECM and epithelial cells ...... 197
10.2.2 Adherence to collagen ...... 198
10.3DegradationofECM...... 199
10.3.1 Metalloproteases ...... 199
10.4 Evasion of host immune system ...... 200
10.4.1 Neuraminidase ...... 200
10.4.2 Staphylokinase homologue ...... 200
10.5Othergenes...... 200
11 Conclusions 202
11.1 Summary of contributions ...... 202
11.1.1 “Typeable” versus “Predictive” ...... 202
xiii 11.1.2 Development of two CGP methods ...... 203
11.1.3 Used the CGP methods in virulence gene discovery .... 204
11.1.4 Some prioritised genes have biological plausibility .... 205
11.2 Future directions ...... 206
11.2.1 Bench-side validation ...... 206
11.2.2 Virulence prediction studies ...... 206
11.2.3 Applying other data sources and algorithms ...... 206
11.2.4 Practical CGP tools ...... 207
11.2.5 Virulence gene occurrence patterns ...... 208
11.3 Concluding remarks ...... 209
Bibliography 210
Appendices 254
A Genomes used in the sCGP of peptidoglycan genes 255
B Genomes used in the sCGP of anaerobic genes 266
C Using KEGG pathways as validation sets 275
D Bacterial pathogens causing neonatal sepsis 281
D.1 Introduction ...... 281
D.2Gram-positivepathogens...... 282
D.2.1GroupBstreptococcus(GBS)...... 282
D.2.2 Staphylococcus spp...... 283
D.2.3Othergram-positivepathogens...... 284
D.3Gram-negativepathogens...... 285
D.3.1 E. coli ...... 286
xiv D.3.2 Klebsiella pneumoniae ...... 286
D.3.3 Haemophilus influenzae ...... 287
D.4Otherpathogens...... 287
E Genomes used in the sCGP of GBS virulence genes 288
xv List of Figures
1.1 Number of bacterial whole-genome sequences (1995—2007) . . . 5
1.2 Translational research ...... 6
1.3 Tracking of bacterial clonal lineages by typing ...... 7
1.4Theclassicalapproachofvirulencegenediscovery...... 14
3.1 Virulence classification ...... 34
3.2 The important GBS clinical subgroups ...... 36
3.3 Distributions of MS and PGP in 912 GBS isolates ...... 38
3.4 Predictive analysis by machine learning ...... 44
3.5Clusteringversuspredictiveclassifications...... 48
3.6 Pseudocode: modified cross-validation procedure ...... 53
3.7 AUC versus number of features post-feature selection ...... 57
3.8 Horizontal gene transfers and positive gene selections ...... 58
4.1 The workflow of in silico candidate gene prioritisation ...... 65
4.2 Characteristics-based prioritisation ...... 68
4.3 Inductive candidate gene prioritisation ...... 69
4.4 Co-occurrence and similarity ...... 72
4.5 Statistical feature integration ...... 77
5.1 Gene-function co-occurrence in bacterial genomes ...... 86
xvi 5.2 Using gene-function co-occurrence for gene prioritisation ..... 88
5.3TheworkflowofstatisticalCGP...... 90
5.4Theoccurrencematrix...... 92
5.5 2 × 2 contingency table ...... 93
6.1 Case study 1: peptidoglycan genes ...... 106
6.2Casestudy1:statisticalCGPperformance...... 113
6.3Casestudy1:partialprecisionandP-Rgraphs...... 117
6.4Casestudy1:theADTreemodel...... 121
6.5 Case study 1, simulation 1 ...... 127
6.6 Case study 1, simulation 2 ...... 128
6.7 Case study 1, simulation 3 ...... 129
6.8Casestudy3:performanceonKEGGpathways...... 140
6.9Amechanismsofgeneoccurrenceingenomes...... 144
7.1CurrentlyknownGBSvirulencefactors...... 164
9.1 Sub-sampling of training data ...... 179
10.1 Potential GBS virulence factors ...... 195
11.1 A prototype web-based CGP tool ...... 208
C.1 Variations in CGP performance using KEGG pathways ...... 278
xvii List of Tables
2.1 Clinical manifestations of perinatal GBS diseases ...... 23
2.2 GBS clinical risk factors ...... 25
3.1Characteristicsof912GBSisolates...... 35
3.2 Frequencies of GBS markers in invasive vs. colonising isolates . . 40
3.3 Frequencies of GBS markers in clinical subgroups (a) ...... 41
3.4 Frequencies of GBS markers in clinical subgroups (b) ...... 42
3.5Predictiveanalysiswithmachinelearningclassifiers...... 47
3.6 (Nopt,A) of classifier-feature-ranking algo. combinations ..... 55 3.7 Top GBS markers ranked by feature-ranking algorithms ...... 56
4.1ComparisonofCGPandIRsystems...... 67
6.1 Case study 1: List of peptidoglycan-related genes ...... 107
6.2Casestudy1:SA-2603genesrankedbystatisticalCGP...... 114
6.3 Case study 1: CGP performance on the SA-2603 genome ..... 118
6.4 Case study 1: CGP performance on the EC-K12 genome ..... 119
6.5 Case study 1: CGP performance on the control validation set . . . 122
6.6Casestudy2:CGPperformance...... 132
6.7 Case study 2: list of genes in the validation set ...... 133
6.8 Case study 2: statistical CGP prioritised genes ...... 135
xviii 6.9Casestudy3:evaluatedKEGGpathways...... 141
6.10 Case study 3: inductive CGP of KEGG pathway genes ...... 142
7.1ListofGBSvirulencefactors...... 152
7.2 The β-haemolysin/cytolysin gene cluster ...... 156
7.3 The GBS cps–neu genecluster...... 159
8.1PositivegenomeexamplesusedinstatisticalCGP...... 169
8.2Highly-rankedgeneclustersfromstatisticalCGP...... 170
8.3 Known virulence genes versus the sCGP prioritised rank ..... 172
9.1 Training genes for virulence gene discovery by inductive CGP . . 176
9.2 Rediscovery performance of training genes by inductive CGP . . . 185
9.3 Highly-ranked genes (training set: fbsAB and pavAgenes).... 186
9.4 Highly-ranked genes (training set: scpBandlmb genes) ..... 187
9.5 Highly-ranked genes (training set: GBS minor pilin genes) .... 188
9.6 Highly-ranked genes (training set: spb1andbca genes)...... 189
9.7 Highly-ranked genes (training set: cyl and cfb genes)...... 190
9.8 Highly-ranked genes (training set: cspAandhylBgenes)..... 191
9.9 Highly-ranked genes (training set: cps and neu gene clusters) . . 192
9.10 Highly-ranked genes (training set: pbp1Agene)...... 193
A.1GenomeexamplesusedinChapter6.1...... 255
B.1 Genome examples used in Chapter 6.3 ...... 266
C.1 AUC of original vs. processed sag00400 validation sets ...... 276
C.2 Genes in original vs. processed sag00400 validation sets ..... 276
C.3 KEGG pathway listed in Figure C.1 ...... 279
E.1 Negative genome examples in GBS virulence gene discovery . . . 288
1 Chapter 1
Introduction
Accurate prediction of pathogen virulence (the ability of a pathogen to cause in- fection) remains a challenge in clinical practice and the basic sciences. The ability to discriminate pathogenic microorganisms from non-pathogenic counterparts can improve therapeutic decisions in infectious diseases medicine. Improved virulence prediction contributes to health economic benefits, as well as limiting the extent of antibiotic resistance with more precise antibiotic prescribing behaviour. In the last two decades, a variety of molecular genetic techniques have been developed to give us the ability to identify different subpopulations of bacterial pathogens.
Using such data to predict virulence, however, remains largely unexplored.
A second challenge lies with the discovery of virulence genes, the genetic determinants that confer the ability to cause infections on pathogenic microbes.
Identification of such genes can lead to improvements in diagnostics, therapeutics, and vaccine development. The traditional method for finding such genes can be characterised as a “fishing expedition” involving years of painstaking laboratory screening. The use of comparative genomics with high-throughput experimental methods, such as microarrays and whole-genome sequencing, are gradually accel-
2 erating this process, although such methods are considerably more expensive to conduct.
The number of whole-genome sequences of bacteria have been increasing over the last decade. Effective bioinformatic methods of genomic data analysis can provide a low-cost alternative in assisting with virulence gene discovery in the laboratory.
This thesis focuses on the topics of virulence prediction and virulence gene discovery. The two challenges were addressed in the analysis of virulence of an important perinatal pathogen, Streptococcus agalactiae or group B streptococcus
(GBS). GBS, normally a part of the microflora of the colon and female urogenital tract, is the leading cause of neonatal sepsis in developed countries. The first part of this thesis will describe using machine learning models to predict clinical outcomes by analysing GBS molecular epidemiology data. Subsequent chapters will focus on developing two computational candidate gene prioritisation (CGP) methods to assist with virulence gene discovery. The later part of this thesis will apply the
CGP methods to the task of identifying unknown virulence factors in GBS. Lastly, the potential virulence genes identified by the CGP methods will be discussed in detail.
1.1 Explosion of genetic information
The modern foundation of genetics was established with Mendel’s experiment with garden peas in 1865 [1, 2]. Decades later, the molecular mystery of genetic inher- itance was unravelled by the discovery of deoxyribonucleic acid (DNA), a scien- tific breakthrough that marked the beginning of the era of molecular biology and genetics [3]. In 1977, Frederick Sanger et al. first published the whole-genome sequence of bacteriophage Φ-X174 with 5,375 nucleotides [4]. A decade later, the
3 development of the polymerase chain reaction (PCR) by Mullis et al. permitted large-scale studies of molecular genetics through sequencing projects [5]. In 1995, the first genome sequence of a free-living organism, Haemophilus influenzae Rd, was determined by using the whole-genome shotgun sequencing method [6]. Since then, the number of sequenced genomes has grown exponentially (Figure 1.1). In
2001, an international collaborative effort sequenced the complete haploid genome of Homo sapiens and yielded approximately three billion base-pairs of genetic in- formation [7]. In 2008, the inception of a massive parallel sequencing project achieved the complete sequence of a diploid human genome of a single individual
(Dr. James Watson) within two months at a fraction of the cost [8].
These rapid advancements in modern genomics have now led to the anticipa- tion of how a “thousand-dollar” genome sequence will revolutionise healthcare and medical sciences [9]. Besides our interest in understanding the biological world, the potential for using genomic data in medicine is driving speculation about promised individualised care. Developing effective method for translating bench- side genetic and genomic data into useful clinical information is thus an important task in the decades to come.
1.2 Translational bioinformatics
The novel field of translational medicine bridges the “research-clinical gap” by achieving a close collaboration between laboratory science and clinical practice.
As shown in Figure 1.2, translational medicine is a two-way process that connects the progresses of both basic and clinical research [12]. The expansion of genetic in- formation mandates an increasing requirement for rapid and accurate data analysis.
To serve these needs, the field of translational bioinformatics deals with develop- ing effective analytic methods in accomplishing this goal. The resources and tools
4 1000 100 n H. inf. Rd 10 ?
1 1994 1996 1998 2000 2002 2004 2006 2008 Year
Figure 1.1: Number of completely sequenced bacterial genomes in NCBI databases between years 1995 to 2007 [10]. Abbreviation: H. inf. Rd: Haemophilus influenzae Rd. at the cornerstone required to achieve an effective translation process include data mining and advanced statistical methods.
The high-level theme in this thesis is to apply the translational bioinformatic approach to the domain of clinical microbiology and infectious disease medicine.
The main contribution of this thesis is to demonstrate how biomedical informatics methods are able to assist clinical sciences in the formulation of new scientific hy- potheses. Specifically, within the “translational” framework, this thesis will first study how to apply bacterial genetics to predict clinical outcome by analysing bac- terial genetic markers with data mining tools. The later part of this thesis will explore methods based on comparative genomics and bacterial sequence data to discover potential virulence genes.
5 Biomarker discovery Biostatistics methods
Exploration
Application
Phase I clinical trials Bench-side Rapid diagnostics Bed-side Predictive algorithms
Figure 1.2: Translational research. Developing systematic scientific methodology in both directions are required to achieve effective translation. [11]. In molecular genetics, methods in bridging the clinical–research gap include screening for de novo biomarkers and development of systematic approach to discovery of disease- related genes. Rigorous evaluation of research methodology is an important step in the exploration and the application of translational efforts. In translational research of bacterial virulence studies, the exploration phase of the translational cycle may involve identification of virulence strains via descriptive molecular epidemiology studies. In the opposite direction, the application phase may utilise the bacterial genetic data to predict clinical outcomes.
1.3 Bacterial pathogens
The observation of animalcules in 1684 by Antonie van Leeuwenhoek marked the
first discovery of bacteria in human history. Bacteria are the simplest free-living unicellular microorganisms organised in a variety of shapes and arrangements and contain a nucleoid with circular DNA [13]. Most bacteria possess a rigid layer of peptidoglycan with various surface components such as flagella or fimbria.
1.3.1 Bacterial classification and typing
Bacterial classification
Bacteria have been classified by phenotypic features such as morphology, metabolism, staining results, and specific enzymatic activities. Bacterial classification not only has implications in taxonomical studies, it also has clinical significance in dis- tinguishing potential pathogens from non-virulent commensals empirically [14].
6 A B C
Figure 1.3: Tracking of bacterial clonal lineages by typing. Classification of bac- terial subpopulations (for example types A, B, and C) can be made by identify- ing features that distinguish individual clonal lineages (depicted by different node colours in the diagram).
In clinical settings, bacterial classification is important in identifying causal agents and in guiding appropriate drug therapies [14]. Nevertheless, it is evident that these broad classification schemes cannot explain the wide range of pathology caused by species within simple taxonomical groups, in which better discrimination within a taxonomy is required in differentiating subpopulations within a bacterial species.
Bacterial typing
In order to distinguish different bacterial subpopulations (for example, within a species) from one another, or strains of bacteria, many methods for bacterial typing have been developed. Typing establishes the clonal lineage of a particular bacte- rial strain, which enables tracking of its ancestry and origin. Methods of bacterial typing are important in defining the molecular epidemiology of a bacterial species, a discipline that focuses on the investigation of the spread of bacterial clusters in host populations [15,16]. The goal of typing is to determine which bacterial strains are colloquially the “same” or indistinguishable (whether they are derived from the
7 same ancestor, Figure 1.3), thus enables the determination of the pathogen origins, the tracking of transmission routes, and aids the discovery of virulence genes [17].
Typing provides the ability to track the clonal expansion of a bacterial popula- tion and has application in infection control. For example, the use of pulsed-field gel electrophoresis (PFGE) typing in enterohaemorrhagic strains of Escherichia coli (O157:H7) is a well-established method in public health surveillance in the
United States, as serotyping provides the ability to track the transmission of this pathogenic strain during sporadic food-borne outbreaks [18]. In addition, bacte- rial typing has been applied to the clinical diagnostics and the study of population genetics [19].
Advances in biotechnology have led to better typing techniques and improved clonal discrimination. Typing can be either phenotype-based or based on molec- ular targets (gene-based). The classical typing methods are phenotype-based and require differential expressions of bacterial phenotypes, such as using patterns of electrophoretic mobility of soluble cellular enzymes (for example, multilocus en- zyme electrophoresis or MLEE [20]), the use of lytic bacteriophage (for example,
Staphylococcus aureus phage typing [21]), and serotyping (for example, Strepto- coccus pneumoniae [22]). Gene-based typing methods involve the characterisation of genetic variations in bacterial chromosome or plasmid. Several gene finger- printing methods have been investigated, including as restriction fragment length polymorphism (RFLP) [23], ribotyping [24], variable number tandem repeat anal- ysis, single nucleotide polymorphisms, and multilocus sequence typing [25]. In general, gene-based typing methods offer better reproducibility and discriminatory power compared to the phenotypic methods [25].
8 1.3.2 Virulence mechanisms and virulence genes
Definition of virulence
Broadly speaking, the virulence of a microorganism is defined as the degree of pathogenicity or the relative capability of a microbe to cause host damage. To de-
fine the term “virulence” sensu stricto, however, is non-trivial. Many attributes of virulence have been proposed, including the toxicity of a pathogen (the “dosage” of bacteria required to cause disease), the aggressiveness of the pathogen (the time- liness or severity of disease), the ability to colonise and multiply in the host, the ability of pathogen to spread from one host to another (pathogen transmission), and the ability of pathogen to evade or elicit inappropriate immune responses in the host [26, 27]. Because virulence encompasses complex interactions with the host, several attempts have been proposed to define virulence with the inclusion of host factors and host immune status [26, 28].
Virulence genes
Virulence genes are the genes that encode a constellation of biochemical mecha- nisms to confer pathogenicity in a microorganism. In bacterial pathogens, the vir- ulence gene products may have roles in attachment, colonisation, multiplication, spreading, invasion, and self protection [29]. Wassenarr and Gaastra classified vir- ulence genes into three broad categories, namely the true virulence genes (protein invasins or genes that directly interact with the host), virulence-associated genes
(genetic factors required to activate true-virulence genes), and virulence-lifestyle genes (genes assisting with the colonisation and replication of the pathogen) [30].
9 Characteristics of virulence genes
Identification of virulence genes is important in assisting with the scientific de- velopment of clinical diagnosis and therapeutics of infectious diseases. In 1880s,
Koch proposed a set of criteria that must be fulfilled (known as Koch’s postulates) in establishing an aetiological role of a microbe in causing human infection [31].
Analogous criteria were proposed by Falkow to cover the contributions of bacterial genetic factors in human infections (the molecular Koch’s postulates [32]):
1. The virulence genes should be present in all pathogens and absent in non-
pathogens.
2. The presentation of disease should be associated with the pathogenic mi-
crobes and vice versa.
3. The inactivation of the virulence gene(s) should demonstrate an attenuated
virulence in the pathogen and its clones.
4. The reversal of virulence gene functions, with allelic replacement, should
restore pathogenicity.
Alternative frameworks to Falkow’s postulates have been proposed, for exam- ple, Fredericks and Relman’s adoption of Hill’s criteria of causation in inferring a potential pathogenic role of a putative pathogenicity gene [33]. In the second part of this thesis, the first rule of molecular Koch’s postulates will be explored for de- veloping statistical candidate gene prioritisation (CGP), a bioinformatic method that assists with the task of virulence gene discovery.
10 1.4 Antibiotics and their optimal use
Antibiotics
The discovery of antibacterial action of Penicillium fungus by Fleming revolu- tionised the treatment of bacterial infections [34]. Antibiotics (antibacterial drugs used for infectious diseases treatment) selectively target pathogenic bacteria by either direct damage to bacterial cells (bactericidal) or by inhibiting its multipli- cation (bacteriostatic). Advances in bacteriology have led to the development of antibacterials targeting different components of bacterial cells, including inhibit- ing cell wall (β-lactams, vancomycin) or protein synthesis (macrolides, tetracy- cline, aminoglycosides, linezoids), interfering with metabolism (sulphonamides and trimethoprim), and inhibiting nucleic acid synthesis and replication (metron- idazole, rifampin, and quinolones) [35].
Antibiotic resistance
Despite the success of antibiotics in treating bacterial infections, it was realised as early as 1948 that overuse of penicillin could lead to a phenomenon, known as drug indifference, allowing bacteria to survive under constant antibiotic treatments [36].
The decrease of antibiotic sensitivity is a significant clinical problem as it lim- its the choices for clinicians in managing infectious diseases effectively. Bacteria possess an array of versatile mechanisms to combat antibiotics, including efflux of antibiotic molecules (macrolide and tetracycline efflux pumps), inactivation of drug molecules (β-lactamase), and modification of enzymatic drug targets (altered penicillin-binding proteins) or genes responsible for replication, transcription and translation (e.g. quinolones, linezoids, sulphonamide, trimethoprim, rifampin re- sistance) [35].
11 Resistance to antibiotics plays an an important part in pathogen colonisation
[37]. Administration of antibiotics exerts a positive selection pressure on hetero- geneous bacterial populations and favours the survival of resistance phenotypes, which can then undergo clonal expansion and dissemination, leading to the ex- pansion of a resistant bacterial population [37, 38]. The emergence of antibiotic resistance in vivo is evident in a recent clinical trial, where resistance of oral strep- tococcal flora to macrolides was demonstrated to be inducible by only a short-term administration of azithromycin and clarithromycin [39].
Mechanisms for acquiring antibiotic resistance
The genes encoding resistance phenotypes can be acquired either through de novo mutations or horizontal gene transfer (HGT). HGT allows bacteria to share their resistance genes through mechanisms of DNA uptake (transformation), bacterio- phage transfection (transduction), or sharing via transposon-plasmid mechanism
(conjugation) [38]. HGT facilitates the spread of resistant phenotypes among pathogen strains or across bacterial species. For example, genes contributing to tetracycline resistance (tet) are almost invariably associated with mobile genetic elements in forms of plasmids, conjugative transposons, or gene cassettes (inte- grons), allowing their spread between different pathogen strains and species [40].
Optimal prescribing of empirical antibiotic therapies
Empirical antimicrobial therapy is the “blind” treatment of infectious diseases with potential virulent pathogens in mind. An empirical regimen is a “cover-all” ap- proach in infectious disease management with unknown pathogen virulence. The imprecision of such antibiotic use poses a risk of accelerated emergence of antibi- otic resistance.
12 To minimise this risk of emerging resistance phenotype in bacteria, antibiotic guidelines frequently advocate “prudent prescribing”. However, such “prudence” cannot be efficiently achieved without precise identification of invasive bacterial types. Optimal antibiotic treatment thus requires accurate differentiation of virulent pathogens from their non-virulent counterparts.
Translating bacterial genetics to optimise antibiotic prescribing
The study of bacterial genetics may help us in identifying pathogens in clinical settings. There are two main of methods that can be applied to study the associ- ation between bacterial genetics and clinical outcomes. Descriptive studies with molecular markers investigate the specific distribution of bacterial strains in host populations. This type of study is useful in identifying pathogen transmission pat- terns through the tracking of molecular patterns for outbreak detection [15, 19].
As an alternative to descriptive studies, predictive studies aim to build models using multiple biological markers to forecast clinical outcomes [41,42]. Predicting clinical outcomes of infectious diseases with pathogen characteristics, in forms of
“pathogen profiles” with microbial features comprised of genomic, transcriptomic, proteomic, and metabolomic data, may contribute to early diagnosis thus resulting in better risk stratification and improved antibiotic therapy [43].
1.5 Virulence gene discovery
1.5.1 Classical approach
As illustrated in Figure 1.4, searching for virulence genes in bacteria requires a two-stage experimental approach. The first stage involves the confirmation of bi- ological role of a candidate gene by inactivation, followed by its isolation from invasive isolates, and insertion into a non-virulent strain to reproduce virulent
microbial factor hypothesised virulence trait
Candidate gene avirulent mutant Virulent Virulence gene Virulent clone extraction mutant identified infection
gene knock-out
host factor animal hypothesised survives
animal Animal Host factor infected Transgenic Virulence gene model identification animal characterised
Figure 1.4: The classical approach of virulence gene discovery. The traditional method of virulence gene discovery necessitates the demonstration of the gene’s pathogenicity by gene-knockout studies. The pathogenic mechanism of the viru- lence gene is subsequently investigated by using a transgenic animal model.
behaviour [44]. Mutants with inserted virulence genes then demonstrate their
pathogenicity in an animal model [44]. The second stage involves using meth-
ods such as using a transgenic animal model (for example, a knockout mouse) or
positional cloning to investigate host response to the virulence gene [32]. Over-
all, such experimental procedures are time-consuming, requiring years of lengthy
laboratory work.
The use of comparative genomics with computational analysis or other high-
throughput methods is gradually gaining wider acceptance and replacing the tradi-
tional approach.
14 1.5.2 Bacterial comparative genomics
Aspects of genomes that can be compared
Many aspects of bacterial genomes can be compared. Crude comparisons of bac- terial genomes, as summarised by Binnewies et al., may involve comparison of chromosome alignment, genome size, guanine-cytosine (GC) content, tRNA and codon usage, analysis of transcription factors, and BLAST atlas and matrices [45].
In depth analysis with comparative genomics aims to ask the following funda- mental questions, which are important in understanding microbial evolution and assisting in the identification of gene functions [46]:
1. Which genes are unique to a bacterial species or genome? (Unique genes)
2. Which genes are required for normal functioning bacteria? (Core genes)
3. Which genes undergo selection, thus conferring survival advantage to a bac-
teria in the host environment?
Unique genes
The central dogma of molecular biology is the unidirectional flow of information from DNA, protein, to phenotype. As stated by molecular Falkow’s postulates, genes encoding for a particular function are expected to be present in the strains displaying the phenotype and absent in the genomes that do not. In virulence gene research, comparative genomics can reveal different gene compositions between pathogenic and non-pathogenic organisms, and has led to the identification of novel diagnostic markers and vaccine targets in several pathogenic bacterial species in- cluding E. coli, S. aureus,andHelicobacter pylori [44, 47].
15 Core genes
Genes required for basic functioning of a bacteria are usually well-conserved across a large number of species. These genes, or core set of genes, can be identified by comparing multiple genomes. For example, Arigoni et al. performed a compara- tive analysis by comparing an Escherichia coli genome with Mycoplasma genital- ium genome to reveal essential genes that are needed for bacterial survival [48].
Genes under natural selection pressure
Virulence genes are constantly under natural selection pressure in hostile environ- ments. Comparative genomics can be used to identify genes undergoing positive selection. Chen et al. compared uropathogenic E. coli with other E. coli species by phylogenetic analysis and identified positively-selected genes which may con- tribute to urinary tract infection and subsequently validated these in a case-control study [49].
1.5.3 Tools for comparing different bacterial genomes
DNA microarrays
The use of DNA microarrays to search for virulence genes allows population- based comparison of gene composition between pathogenic and non-pathogenic genomes [50]. Earlier studies with DNA microarrays included H. pylori and N. meningitidis. Analysis of H. pylori has identified a pathogenicity island (cag gene cluster) associated with increased virulence, and several candidate virulence genes were also suggested [51]. DNA microarrays have also been applied to Neisse- ria spp. to identify specific genes of N. meningitidis and have identified several potential genes on laterally-acquired pathogenicity islands associated with viru- lence [52, 53].
16 Genome sequence comparison
The increasing number of publicly available whole-genome sequences should now allow in-depth analysis of pathogen biology. Genome sequences can help both gene screening and DNA microarray construction. Genome sequences analysis has several advantages over DNA microarrays in virulence gene discovery:
1. DNA microarrays are unable to detect unknown or uncharacterised genes.
2. The intergenic regions of bacterial chromosome are typically not printed on
DNA microarrays and so these areas are usually missed.
3. The highly specific process of microarray hybridisation means that unknown
mutations are less likely to be detected.
In S. agalactiae, genome sequence comparison has lead to the discovery and rediscovery of potential virulence candidates, including the Sip protein, CAMP factor, hyaluronidase, and β−haemolysin [54].
Reverse vaccinology is a novel application of comparative genomics to identify likely vac- cine candidates in bacteria. Classical “forward” vaccine development involves using a live- attenuated microbes pathogen to test for immunogenicity, or using protein screening to identifying immunogenic bacterial components. The process of vaccine development may take decades of lengthy research. With reverse vaccinology, vaccines candidates are first screened by in silico bioinformatic analysis, followed by concurrent immunogenicity test- ing using parallel recombinant DNA expression techniques [55]. This approach drastically shortened the vaccine development time to 1-2 years [56], and has successfully been ap- plied to develop vaccine for N. meningitidis serogroup B infections [57].
Box 1.1: Reverse vaccinology – an example of in silico comparative ge- nomics
17 1.6 Candidate gene prioritisation (CGP)
Biological validation with wet-lab experiments is the rate-determining step in vir- ulence gene discovery. Thus, each incorrect candidate gene adds time to the dis- covery cycle. An improvement in gene selection can thus accelerate the rate of discovery.
One approach to improve the “hit-rate” of gene search is to rank the candidate genes according to an objective likelihood measure, thus increasing the chance of
finding “correct” genes with the least number of trials. Several in silico CGP meth- ods have been reported for ranking human genes in search of inheritable diseases.
CGP methods will be reviewed in detail in Chapter 4.
1.7 Guide to thesis
This thesis explores several research questions around in silico virulence prediction and virulence gene discovery in three parts:
Part I examines the clinical and molecular epidemiology of a perinatal pathogen, group B streptococcus (GBS), with an aim to translate bacterial genetic data into useful clinical information by performing descriptive and predictive analyses.
Chapter 2 reviews the clinical manifestations, epidemiology, and methods of typing of GBS.
Chapter 3 contains a descriptive study of 912 GBS clinical isolates using 18 distinct molecular markers to investigate potential markers associated with GBS virulence. The second part of this chapter investigates the hypothesis that GBS virulence can be predicted by using molecular markers with supervised machine learning models.
18 Part II of the thesis develops and evaluates two CGP methods, based on phylo- genetic profiles, to prioritise bacterial genes associated with particular phenotypic traits. The purpose for developing such bioinformatic methods is to assist with the labour-intensive task of virulence genes discovery in bacterial pathogens.
Chapter 4 reports a literature review of methods of in silico CGP that are cur- rently used for discovering gene related to inheritable diseases in humans.
Chapter 5 develops two methods for prioritising bacterial genes in functional discovery. Statistical CGP extends the concept of Falkow’s molecular Koch’s postulates by ranking candidate genes based on their frequency of occurrence in genomes of known phenotypes. The second method, inductive CGP, aims to find genes of similar function by applying supervised machine learning to patterns of gene occurrence across multiple genomes.
Chapter 6 To benchmark these CGP methods, Chapter 6 reports three system- atic evaluations of both these CGP methods by rediscovering peptidoglycan-related genes, genes responsible for anaerobic mixed-acid fermentations, and genes in the pathways curated in the Kyoto Encyclopaedia of Genes and Genomes (KEGG) database.
Part III of this thesis applies both CGP methods to find unknown virulence genes in S. agalactiae.
Chapter 7 reviews the currently known virulence genes in GBS for discovering genes with inductive CGP in Chapter 9.
Chapters 8 and 9 performs CGP experiments to discover GBS virulence factors yet to be characterised.
19 Chapter 10 discusses the biological significance of the prioritised genes in Chap- ters 8 and 9.
20 Part I
Virulence prediction in Streptococcus agalactiae
21 Chapter 2
Group B streptococcal diseases and typing methods
2.1 Introduction
Streptococcus agalactiae, or group B streptococcus (GBS), is a Gram-positive β- haemolytic bacteria which emerged as an important human pathogen in the early
1960s [58,59]. GBS is normally part of the microflora in human female urogenital tract, with sporadic ability to cause serious infection around the pregnancy pe- riod. In recent years, GBS infections have also been increasingly reported in non- pregnant adults with underlying comorbidities, such as in immunocompromised individuals or in patients with diabetes mellitus or neoplastic disorders [60].
The significant morbidity associated with neonatal GBS infections poses a challenge for obstetricians and paediatricians. The manifestations of GBS neonatal infections may include life-threatening pneumonia, meningitis, or sepsis, a range of serious clinical consequences that prompted scientific investigations into effec- tive public health measures to reduce its incidence. The institution of prophylactic antibiotic therapy has been effective against an important subgroup of neonatal
22 Table 2.1: Clinical manifestations of perinatal GBS diseases Neonatal diseases Maternal diseases Sepsis Urinary tract infection Pneumonia Chorioamnionitis Acute respiratory distress syndrome Endometritis Meningitis Bacteraemia Septic arthritis Osteomyelitis Cellulitis infections, known as early-onset disease (EOD). However, the other subgroup of
GBS neonatal disease, late-onset disease (LOD), has not been reduced by the pre- ventative measures.
As discussed in Chapter 1, improvements in microbial typing technology have enabled the tracking of bacterial subpopulations and identification of individual strains of bacteria. Evidence suggests that some virulent GBS strains are associ- ated with certain typing markers. The later part of this chapter will discuss differ- ent typing methods used in distinguishing different GBS strains. With our ability to track GBS strains improving, it is anticipated that these typing methods can as- sist with risk stratification in clinical settings. In Chapter 3, methods for building predictive models with molecular markers will be explored in more detail.
2.2 Clinical manifestations of GBS diseases
Most GBS strains isolated from humans can be found colonising the female uro- genital tract. However, a small proportion of GBS can display virulent behaviour and cause a spectrum of diseases in pregnant women or in neonates.
23 2.2.1 Maternal carriage and disease
GBS is found to colonise 15–35% of women during the normal course of pregnancy
[61, 62]. Although most individuals colonised with GBS are clinically asymp- tomatic, up to 1% of these pregnant women can develop clinical or sub-clinical presentation of bacteraemia with leukocytosis, bacteriuria, concurrent urinary tract infection, chorioamnionitis, and endometritis [63]. The development of signs and symptoms of maternal infection are closely associated, although not a prerequisite, with early-onset neonatal infection (Table 2.2) and stillbirth [64].
2.2.2 Early-onset neonatal disease
EOD is defined as the clinical manifestation of bacterial infection within the first seven days of life. Data from 1970–1990 indicated that the incidence of EOD caused by GBS was up to 2.09 cases per 1000 live births [65]. EOD constituted up to 73% of all neonatal GBS diseases [66]. The pathogenesis of EOD caused by GBS is believed to be via vertical transmission (from mother to baby). The development of chorioamnionitis in pregnancy indicates an early infectious pro- cess, which leads to ascending spread of bacteria from the female genital tract to uterus and causes uteritis [67]. Foetal contact with GBS is believed to be caused by subsequent aspiration of contaminated fluids during delivery [67].
Compared with term infants, the mortality of EOD caused by GBS is consider- ably higher in premature infants. The overall mortality of EOD in term infants was found to be up to 7.4%, but 19–25% of infants born before 37 weeks of gestation can die as an unfortunate consequence [68].
Antibiotic prophylaxis in preventing GBS EOD
A risk-based prophylactic strategy has been used to treat the high-risk pregnancy group (pregnancies with clinical risk-factors) with intrapartum penicillin, and evi-
24 Table 2.2: Maternal and obstetric risk factors associated with early-onset GBS diseases [61] Risk factors Ref. Chorioamnionitis [69] Maternal GBS bacteriuria or urinary tract infection [69] Heavy maternal colonisation [61] Fever (> 38.0◦) [61] Pre-term labour (< 37 weeks) [61] Premature rupture of membrane (PROM) [61] Prolonged PROM (> 12 hours) [69, 70] Previous stillbirth [64] Low birth weight [70] Prolonged obstetric monitoring or vaginal examinations [61] Gestational diabetes [68] Twin with early-onset GBS disease [69] dence has demonstrated that such strategies have effectively reduced the incidence of GBS EOD by up to 60–75% [71–74]. Several surveys conducted in 1990-2003 suggested that the incidence of EOD caused by GBS was reduced to 0.4–0.8 per
1000 live births in developed countries [68, 74, 75].
A screening-based strategy has been proposed as an alternative approach to treating high risk groups [76] based on evidence that heavy maternal GBS coloni- sation is strongly associated with EOD. The strategy recommends that intrapartum antibiotic should be given if GBS is present in cultures at the late stage of preg- nancy (35–37 weeks) [77]. Studies have found that universal screening prevents more cases of EOD than the risk-based method [78]. The benefits of antenatal screening were demonstrated in several population cohorts, in which the incidence of EOD caused by GBS was further reduced to 0.34 per 1000 live births in the
United States [79, 80]. On this basis, the screening-based approach was recom- mended in the revised guidelines published by Centre for Disease Control and Pre- vention of the United States [77].
25 Potential concerns over the emergence of antibiotic resistance
Current GBS guidelines recommend a risk or screening-based approach in pre- venting EOD in neonates. The high carriage rate suggests that these approaches grossly over-treat many GBS types that are part of the commensal microflora. As discussed in Chapter 1, antibiotic resistance is a potential concern with any antimi- crobial therapy. Although there is currently no evidence to suggest an increase in antibiotic resistance among GBS and non-GBS after the introduction of chemo- prophylactic regimes [81, 82], there remains a theoretical threat at the emergence of resistant pathogens due to the high prevalence of GBS carriage among preg- nant women. In addition, indiscriminatory exposure of mother and baby to peni- cillin may also increase the risk of replacing penicillin-sensitive anaerobes with penicillin-resistant counterparts. Infants who are exposed to penicillin during de- livery may be associated with delayed or distorted colonisation of normal flora.
To reduce the over-treatment of GBS carriers, a more targeted approach of antibiotic prescribing than the empirical recommendations is needed. Accurate classification of invasive GBS strains is essential in the identification of high-risk pregnancies. At present, there has been only limited work attempting to predict
GBS virulence. The next chapter (Chapter 3) will investigate prediction of GBS virulence using GBS genetic markers and supervised machine learning.
2.2.3 Late-onset disease
A distinct group of GBS infections in neonates, late-onset disease (LOD), occurs between 7 to 30 days of life. LOD was reported to have an incidence from 0.2 to 0.5 per 1000 live births and carried a mortality of up to 6% [61,80]. As with EOD, ma- jor risk factors associated with GBS LOD include prematurity and heavy maternal colonisation as evident in antenatal cultures [83]. Although the exact pathogenesis of LOD remains elusive, a nosocomial mode of transmission has been suggested as
26 there were case reports of GBS dissemination within neonatal intensive care units
(NICU) [67]. Infants with LOD present with bacteraemia, meningitis, osteoarthri- tis, and cellulitis, in contrast to EOD where pneumonia is the most common pre- sentation [84].
Antibiotic prophylaxis for EOD has not affected the overall incidence of LOD caused by either GBS and other non-GBS microorganisms [85], an observation that further supports the hypothesis that EOD has a different aetiology [73].
2.3 Typing of GBS strains
2.3.1 Serotyping
Methods of GBS typing allow GBS populations to be grouped and tracked. Typing of GBS strains is traditionally performed by immunophenotypical methods. Recent
DNA-based typing methods allow more accurate and rapid detection.
Group B antigen
All GBS strains possess the Lancefield group B antigen in the classification of β- haemolytic streptococci [86]. The group B specific antigen is a complex polysac- charide made up of four distinct oligosaccharides containing L-rhamnose, D-galactose,
D-glucitol, and N-glucosamine forming a multiantennary structure [87]. The group
B antigen, however, does not confer protective immunity in human and animal models or associated with invasiveness of GBS [84].
Polysaccharide capsule
The serological sub-classification of group B haemolytic streptococci was origi- nally performed by Lancefield in 1934 [86]. Nevertheless, the duration and poor sensitivity of capillary precipitation prompted the development of alternative meth-
27 ods to improve efficiency and accuracy, including latex agglutination [88, 89], mi- croimmunodiffusion [90], co-agglutination patterns [91], and inhibition enzyme- linked immunosorbent assay [92]. To date, nine distinct serotypes have been de- scribed (Ia, Ib, II to VIII) and a further serotype has been proposed [93]. Variations on serotypes originate from the genetic polymorphism of the cps gene cluster [94].
Genetics of the cps operon will be reviewed in detail in Chapter 7.4.3.
Distribution of GBS serotypes in clinical populations
Serotype Ia (16–32%) and III (20–59%) are the predominant serotypes in EOD
[61, 62, 66, 75, 80]. Up to 50–71% of LODs belong to serotype III [80, 95] and are particularly associated with meningitis [75, 95]. Serotype V is most common in causing infections in non-pregnant adults [80, 96, 97], although this serotype is responsible for up to 23% of neonatal invasive infections [75]. Serotype VIII
(25%) and VI (36%) are the most common serotypes colonising healthy Japanese women [98].
2.3.2 Molecular characterisation
Characterising GBS strains by serotyping can be subjective and up to 20% of iso- lates are non-typeable [99, 100]. Reasons for failure of serotyping methods in- clude non-expression of polysaccharide capsule or inadequacy of bacterial count
[15]. Such poor discrimination has prompted the development of gene-based typ- ing methods to improve the efficiency and accuracy in characterising GBS iso- lates [101].
Genotyping of the cps gene cluster
The GBS capsular exopolysaccharide gene cluster (cps cluster) is responsible for the serotype diversity of GBS (see Chapter 7.4.3). Several DNA-based methods
28 have been proposed to characterise GBS genotypes in concordance with serotypes, including the analysis of restriction fragment length polymorphism (RFLP) pat- terns [102, 103] and DNA microarrays [104]. Kong et al. developed a multiplex
PCR-based reverse line blot (mPCR/RLB) method, based on the mapping of se- quence polymorphism in the cpsE–G region to individual serotype groups (molec- ular serotyping, MS) [105, 106]. MS is included in the panel of markers for the analysis of GBS virulence in the next chapter.
Multilocus sequence typing
Multilocus sequence typing (MLST) distinguishes different clonal lineages of bac- teria by examining the genetic polymorphism of housekeeping genes. Jones et al. developed a GBS MLST system consisting of seven housekeeping genes [107].
The MLST system was used to characterise 158 clinical GBS isolates, in which two clonal complexes (CC) CC17 and CC10 were found to be cosegregating with mobile genetic elements GBSi1 and IS1548 respectively [108].
Single nucleotide polymorphisms
Single nucleotide polymorphisms (SNPs) have also been used to discriminate dif- ferent clonal complexes of S. agalactiae. Honsa et al. described a five-SNP typing method, together with the Not-N bioinformatic algorithm, capable of subclassify- ing GBS isolates into MLST-assigned clonal complexes, including the clinically significant strain CC-17 [109, 110].
Surface proteins C
The surface proteins Cα and Cβ were among the first antigenic proteins charac- terisedinGBS.Cα is a protein invasin (a protein that facilitates invasion into the host) of which there are several allelic variants (bca, alp1–5, and rib), which can be
29 detected by PCR-based methods to achieve surrogate serotyping [111]. The pres- ence of rib gene is associated with multilocus sequence typing (MLST) sequence types (ST)-17 and ST-19 [112,113], whereas ST-22 was found to be associated with bca [113]. Multiplex PCR-based characterisation of the second C protein antigen
Cβ (bac, Chapter 7.4.1) have also been reported [114]. Invasiveness was found to be associated with shorter tandem repeats serotype Ib [114].
Mobile genetic elements
Mobile genetic elements (MGE) play important roles in horizontal gene trans- fer and alteration of virulence in pathogenic bacteria. Several MGEs are well- characterised in GBS. IS861 (1,442bp), an insertion sequence sharing homology with IS3 and IS150 of S. pneumoniae, was reported to be present in the hyperviru- lent strain COH-1 (serotype III) [115]. ISSa4 (963bp) was found in some isolates to be inserted into cylB(β-haemolysin/cytolysin gene cluster, (Chapter 7.3.2), pro- ducing ahaemolytic GBS strains [116]. The presence of IS1548 (1,317 bp) has been reported in some strains and insertion of IS1548 in hylB gene causes the inactiva- tion of streptococcal hyaluronan lyase [117]. The group II intron GBSi1 was iden- tified to be inserted between the C5a peptidase gene (scpB) and laminin-binding protein gene (lmb) in GBS strains not containing IS1548 [118]. The presence of
GBSi1 is associated with higher proportion of meningitis type III isolates [119].
TwocopiesofISSag2 flanked the edges of a composite transposon containing the scpB–lmb gene region, which are present in nearly all human GBS isolates [120].
In addition to the association of MGE with virulence, the acquisition of new mo- bile elements can be used to track the evolutionary lineages of GBS isolates when compared with MLST studies [121], making MGEs candidates for molecular clas- sification.
30 The distribution of the MGEs in Australasian GBS strains is: IS1381 (87%),
IS861 (52%), IS1548 (17%), ISSa4 (6%), and GBSi1 (18%) [122].
Antibiotic resistance genes
The phenotypic traits of antibiotic resistance may assist in pathogen colonisation and are important factors associated with virulence. In particular, gene encod- ing mechanisms of antibiotic resistance are frequently associated with horizontal gene transfers. For example, the conjugative transposon Tn916 is found in several streptococcal species and is associated with the horizontal transfer of antibiotic resistance mediated by the tetracycline resistance tetM genes [123]. Zeng et al. described a multiplex PCR-based reverse line blot method for detecting a panel of antibiotic resistance genes simultaneously [124]. It was found that genes as- sociated with tetracycline-resistance (tetM, 77–82%) and the integrase of Tn916
(int-Tn, 57-84%) were the most prevalent markers in the 512 Australasian isolates studied [124].
2.3.3 Genome sequences
The genome sequences of several virulent strains of GBS have been determined.
In 2002, two GBS genomes were sequenced by Glaser et al. and Tettelin et al. re- spectively, including an invasive serotype III strain (NEM316, MLST ST-23) and a serotype V strain (2603 V/R) [54,125]. The analysis of NEM316 genome revealed
14 pathogenicity islands containing surface proteins [125]. The whole-genome sequences of six pathogenic GBS reference strains (A909, H36B, 18RS21, 515,
COH1, and CJB111) were further sequenced in 2005 [126]. Comparative genomic analysis of eight GBS genomes suggested that there are potentially infinite dis- pensable genes in the GBS “pan-genome”, a phenomenon that greatly contributes to genetic diversity of S. agalactiae [126].
31 The later part of this thesis will perform analyses on the whole-genome se- quences for searching for currently uncharacterised virulence genes. At the time when the experiments were conducted (April 2007), three GBS whole-genome se- quences (NEM316, 2603V/R, A909) were available from the National Centre for
Biotechnology Information (NCBI) database.
32 Chapter 3
Classification of GBS virulence with bacterial genetic markers
3.1 Introduction
As discussed in Chapter 2, the balance between benefit and cost needs to be con- sidered with any practice of any antimicrobial chemotherapy. Figure 3.1 illustrates that a more accurate identification of invasive microorganisms from the colonis- ing microflora should lead to clinical benefits, as potential emergence of antibiotic resistance will be minimised by more precise targeting of pathogenic strains of S. agalactiae.
Several molecular methods have been described to characterise the genetics of invasive GBS strains [97,127–129]. Despite advances in GBS genotyping methods, it remains largely unknown whether we can apply molecular markers to predict the spectrum of GBS diseases. It can be hypothesised that clinical outcomes can be accurately predicted using bacterial genetic markers. This chapter will explore the predictive relationships between GBS genotypes and virulence. Both statistical and supervised machine learning methods are applied to test this hypothesis.
33 Bacterial isolates Genotyping Bacterial genotype Prediction
Virulence prediction Non-invasive Invasive
Commensals Pathogens High-risk Low-risk Antimicrobial No therapy therapy or prophylaxis
Figure 3.1: The goals of virulence classification. Accurate identification of inva- sive pathogens can improve on empirical therapy, which leads to a reduction of imprecise antibiotic usage thus also reduces the risk of associated complications.
The results in this chapter have been reported in Pathology [130]
3.2 Descriptive analysis of GBS markers
3.2.1 Material and Methods
GBS Isolates
Nine hundred and twelve human GBS isolates from routine laboratory cultures
were obtained from Australia (n=331), New Zealand (n=448), North America
(n=58), Germany (n=10), and East Asia (n=65) from 1991–2005. GBS isolates
were collected by collaborators at the Centre for Infectious Diseases and Micro-
34 Table 3.1: Characteristics of 912 group B streptococcus isolates
Clinical group Isolate characteristics Invasive Colonising (n = 780) (n =132) Age group Neonate, age <7 years 182 (23%) 0 (0%) Neonate, age 7–90 days 105 (13%) 0 (0%) Adult 421 (54%) 132 (100%) Unknown or not specified 72 (9%) 0 (0%) Gender Female 351 (45%) 132 (100%) Male 278 (36%) 0 (0%) Unknown or not specified 12 (2%) 0 (0%) Site of isolation Vaginal swabs from routine gestational 0 (9%) 132 (100%) screening at 37 weeks Blood 680 (87%) 0 (0%) Cerebrospinal fluid 42 (5%) 0 (0%) Joint aspirate 7 (1%) 0 (0%) Other sterile sites 51 (7%) 0 (0%) Geographical origin Australia 230 (29%) 101 (76%) Germany 10 (1%) 0 (0%) Hong Kong 54 (7%) 0 (0%) Korea 11 (1%) 0 (0%) New Zealand 447 (57%) 1 (1%) North America 28 (4%) 30 (23%) biology, Sydney West Area Health Service, Sydney, Australia. Isolates with un- known sites of isolation were excluded from the analysis. The samples were grouped into two clinical categories (invasive versus colonising). In the invasive group, 780 isolates were acquired from sterile sites including cerebrospinal fluid, blood and joint fluid aspirates (from patients of any age). The colonising group consisted of 132 isolates obtained from vaginal swabs collected from routine an- tenatal screening according to the screening-based protocol. Characteristics of the
35 GBS isolates (n = 912)
Invasive isolates Table 3.2 Vaginal colonising (n = 780) (all) isolates (n = 132)
Tables 3.3&3.4 (a)
Adult invasive Tables 3.3&3.4 (c) Neonatal invasive (n = 493) (n = 287)
Tables Women at Early-onset 3.3&3.4 (b) Late-onset childbearing age diseases diseases (n = 100) (n = 105) (n = 182)
Tables 3.3&3.4(d)
Figure 3.2: The important GBS clinical subgroups. Invasive isolates were com- pared with vaginal colonising isolates. The four clinical subgroup pairs were com- pared, including a) neonatal versus vaginal colonising isolates b) early- versus late-onset neonatal diseases c) adult versus neonatal invasive diseases d) women at childbearing age versus vaginal colonising isolates.
two categories of subjects from whom the isolates were obtained are shown in Ta-
ble 3.1.
Four clinically important subgroups were compared in the following pairs (Fig-
ure 3.2). Invasive neonatal GBS isolates (n=287) were compared with antenatal
vaginal isolates (n=132). In addition to the overall comparison, the following 4
subgroup pairs were compared. Adult invasive isolates (n=493) were compared
with neonatal invasive isolates. Isolates from early-onset neonatal diseases (n=182,
defined as infections occurring within the first 6 days of life) and late-onset diseases
(n=105, infections occurred after 7 days of life) were compared. Invasive isolates
36 from women of childbearing age (WCBA, defined as women aged between 15–45, n=100) were also compared with colonising isolates.
GBS molecular markers
A panel of GBS genotype markers from several virulence-associated genes, in- cluding markers for polysaccharide capsule genes (molecular serotype), variants of the surface antigen/invasin (Cα-like protein family or protein genetic profiles), mobile genetic elements, and antibiotic resistance-related genes was selected for genotyping. Multiplex PCR-based reverse line blot (mPCR/RLB) assays were also performed by collaborators at the Centre for Infectious Diseases and Microbiol- ogy, Sydney West Area Health Service, Australia. The GBS markers studied in this chapter are listed below. These markers were previously described in detail in
Chapter 2.3.2:
• molecular serotypes (MS: 9 types; Ia, Ib, II to VIII),
• molecular subtypes of MS-III (MSST: 4 types; III-1 to III-4),
• protein genetic profiles (PGP: 6 types; bca, alp1-3, rib,ornone),
• 7 mobile genetic elements (presence or absence of insertion sequences IS1381,
IS1548,IS861,ISSag1,ISSag2,ISSa4 and group II intron GBSi1), and
• 8 antibiotic resistance-related genes (presence or absence of the following bi-
nary markers: aad-6 and aph-3, aminoglycoside resistance genes; ermB and
ermTR, ribosomal methylase genes; int-Tn, encoding the integrase of trans-
poson Tn916; mefA/E, encoding macrolide efflux pumps; tetM and tetO,
both associated with tetracycline resistance, and mreA: encoding a flavoki-
nase).
37 Statistical analysis
The differences in binary marker distributions were analysed by Pearson’s Chi-
square statistic, except for groups with less than 5 isolates where Fisher’s exact
test was used. For genotypes with more than 2 alleles or types (MS, MSST, and
PGP), data were first analysed as n × 2 tables to derive the test statistic. Statistical
significance were determined at the level of α =0.01. The standard error of odds
ratios were calculated by methods described by Bland and Altman [131].
3.2.2 Results
VI (16) VII (6) VIII(3) none (8) V (161) Ia (201) bca (169) rib (309) IV (21)
Ib (111) alp1 (219)
III (317) II (76) alp3 (175) alp2 (32)
(a) Molecular serotypes (b) Protein genetic profiles
Figure 3.3: Distributions of MS and PGP in 912 GBS isolates. The numbers shown in the parentheses are number of isolates in each molecular serotype or protein genetic profile.
The overall distribution of molecular serotypes (MS) was Ia: 22.1%, Ib 12.2%,
II: 8.3%, III: 34.7%, IV: 2.3%, V: 17.7%, VI: 1.8%, VII: 0.7%, VIII: 0.3% (Fig-
ure 3.3(a)). The distribution of protein genetic profile was rib: 33.9%, alp3: 19.2%,
38 alp1: 18.5%, bca: 19.5%, alp2: 3.5% (Figure 3.3(b)). Eight isolates contained nei- ther rib nor any Cα-like protein genes. The frequency of other genetic markers are shown in Table 3.2.
Of the 18 markers studied, only alp3 demonstrated an association with inva- siveness, as shown by a significantly increased odds ratio (OR: 2.93, 99% CI:
1.29–7.90, p=0.0003). In contrast, MS III, rib,IS1548,IS861,andint-Tn were inversely associated with invasive isolates (also with the WCBA subgroup) when compared with antenatal vaginal isolates. Serotype III (OR: 2.71), III-2 (OR: 6.72), rib (OR: 2.45), GBSi1 (OR: 2.21), and IS861 (OR: 1.59) were associated with neonatal invasive disease when compared with adult infections in which serotype
V (OR: 0.31), alp3 (OR: 0.38) and IS1381 (OR: 0.40) predominated. Molecular serotype II was found to be associated with early-onset diseases when compared with late-onset diseases [132].
39 Table 3.2: Frequencies of GBS molecular markers in invasive versus colonising groups of isolates
Invasive Colonising Markers OR 99C.I. p-value (n = 780) (n = 132) Molecular serotypes (MS, 9 types) < 0.01 ∗ Ia 179 (23%) 22 (17%) 1.49 (0.79–2.02) 0.11 Ib 98 (13%) 13 (10%) 1.31 (0.60–3.32) 0.47 II 63 (8%) 13 (10%) 0.80 (0.35–2.07) 0.50 III 252 (32%) 65 (49%) 0.49 (0.30–0.82) < 0.01 ‡ Serosubtypes of MS III (MSST, 4 types) 0.05 III-1 150 (19%) 48 (36%) 0.42 (0.24–0.72) NS III-2 65 (8%) 15 (11%) 0.71 (0.32–1.72) NS III-3 7 (1%) 0 (0%) NS III-4 30 (4%) 2 (2%) 2.60 (0.48–53.3) NS IV 21 (3%) 0 (0%) 0.06 V 145 (19%) 16 (12%) 1.65 (0.81–3.78) 0.08 VI 15 (2%) 1 (1%) 2.57 (0.27–548.)0.49 VII 4 (1%) 2 (2%) 0.34 (0.03–8.89) 0.21 VIII 3 (0%) 0 (0%) 1.00 Protein genetic profiles (PGP) < 0.01 ∗ A(bca) 152 (20%) 17 (13%) 1.64 (0.81–3.64) 0.09 alp1 195 (25%) 24 (18%) 1.50 (0.81–2.96) 0.10 alp229(4%) 3 (2%) 1.66 (0.38–15.9) 0.61 alp3 164 (21%) 11 (8%) 2.93 (1.29–7.90) < 0.01 † R(rib) 234 (30%) 75 (57%) 0.33 (0.19–0.54) < 0.01 ‡ None 6 (1%) 2 (2%) 0.50 (0.06–12.2) 0.33 Mobile genetic elements (MGE) GBSi1 145 (19%) 37 (28%) 0.59 (0.33–1.06) 0.02 IS1381 661 (85%) 105 (20%) 1.43 (0.73–2.65) 0.16 IS1548 176 (23%) 51 (39%) 0.46 (0.27–0.79) < 0.01 ‡ IS861 381 (49%) 91 (69%) 0.43 (0.25–0.73) < 0.01 ‡ ISSag1 757 (97%) 131 (99%) 0.25 (0.00–2.20) 0.24 ISSag2 763 (98%) 131 (99%) 0.34 (0.00–3.16) 0.50 ISSa4 39 (5%) 2 (2%) 3.42 (0.65–69.3) 0.11 Antibiotic resistance-related genes aad-6 16 (2%) 1 (1%) 2.74 (0.29–583.)0.49 aph-3 12 (2%) 1 (1%) 2.05 (0.20–444.)0.71 ermB 26 (3%) 2 (2%) 2.24 (0.41–46.3) 0.41 ermTR 24 (3%) 8 (6%) 0.49 (0.17–1.78) 0.12 int-Tn 461 (59%) 95 (72%) 0.56 (0.32–0.97) < 0.01 ‡ mef 19 (2%) 1 (1%) 3.27 (0.36–687.)0.34 mre 780 (100%) 131 (99%) 0.14 tetM 651 (84%) 114 (86%) 0.80 (0.36–1.60) 0.44 tetO 23 (3%) 3 (2%) 1.31 (0.29–12.7) 1.00 Note: The groups MS, MSST and PGP were first analysed as 9×2, 4×2,and6×2 contingency tables respectively. The significance of subgroups was only reported if statistical non-independence was found in the overall group. The remaining binary markers were analysed as 2 × 2 tables. Statistical significance was determined by using Chi-square test or Fisher’s exact test (for groups with less than 5 isolates). Groups marked with (∗) were found statistically significant towards invasive (†)or colonising (‡)atα =0.01. Abbreviations: OR: Odds ratio; 99C.I.: 99% confidence interval;
40 Table 3.3: Frequencies of GBS molecular serotypes and Cα-like protein family genes in 4 clinical subgroups
Clinical subgroups Markers a) N. inv. vs. col. b) EOD vs. LOD c) N. inv. vs. A. inv. d) WCBA vs. col. N. inv. Col. EOD LOD N. inv. A. inv. WCBA Col. Sig. Sig. Sig. Sig. (n = 287) (n = 132) (n = 182) (n = 105) (n = 287) (n = 493) (n = 100) (n = 132) Molecular serotypes (MS, 9 types) Ia 71 (25) 22 (17) 46 (25) 25 (24) 71 (25) 108 (22) 31 (31) 22 (17) Ib 33 (12) 13 (10) 19 (10) 14 (13) 33 (12) 65 (13) 15 (15) 13 (10) II 17 (6) 13 (10) 16 (9) 1 (1) † 17 (6) 46 (9) 8 (9) 13 (10) III 133 (46) 65 (49) 82 (45) 51 (49) 133 (46) 119 (24) ∗ 22 (22) 65 (49) ‡ III-1 68 (24) 48 (36) 48 (26) 20 (19) 68 (24) 82 (17) 15 (15) 48 (36) III-2 50 (17) 15 (11) 24 (13) 26 (25) 50 (17) 15 (3) † 4 (4) 15 (11) 41 III-3 1 (0) 0 (0) 1 (1) 0 (0) 1 (0) 6 (1) 1 (1) 0 (0) III-4 14 (5) 2 (2) 9 (5) 5 (5) 14 (5) 16 (3) 2 (2) 2 (2) IV 4 (1) 0 (0) 3 (2) 1 (1) 4 (1) 17 (3) 1 (1) 0 (0) V 26 (9) 16 (12) 13 (7) 13 (12) 26 (9) 119 (24) ‡ 19 (9) 16 (12) VI 2 (1) 1 (1) 2 (1) 0 (0) 2 (1) 13 (3) 3 (3) 1 (1) VII 0 (0) 2 (2) 0 (0) 4 (1) 1 (1) 2 (2) VIII 1 (0) 0 (0) 1 (1) 0 (0) 1 (0) 2 (0) NT 0 (0) 1 (1) 0 (0) 1 (1) Protein genetic profiles (PGP) A(bca) 46 (16) 17 (13) 30 (17) 16 (15) 46 (16) 106 (22) 20 (20) 17 (13) alp1 76 (27) 24 (18) 52 (29) 24 (23) 76 (27) 119 (24) 32 (32) 24 (18) alp2 10 (4) 3 (2) 7 (4) 3 (3) 10 (4) 19 (4) 5 (5) 3 (2) alp3 34 (12) 11 (8) 19 (10) 15 (14) 34 (12) 130 (26) ‡ 22 (22) 11 (8) † R(rib) 121 (42) 75 (57) 74 (41) 47 (45) 121 (42) 113 (23) † 21 (21) 75 (57) ‡ None 0 (0) 2 (2) 0 (0) 6 (1)
Note: all values shown in the table are in number (percent) of isolates. The statistical significance was determined by Chi-square test or Fisher’s exact test (for groups with less than 5 isolates). Groups marked with (∗) were found statistically significant towards the left (†)orright(‡) group at α =0.01. MS, MSST and PGP were first analysed as 9×2, 4×2, and 6×2 contingency tables respectively. The significance of subgroups was only reported if statistical non-independence was found in the overall group. Abbreviations: N. Inv.: neonatal invasive isolates; A. Inv.: adult invasive isolates; Col.: colonising isolates from routine gestational screening; EOD: early-onset diseases; LOD: late-onset diseases; WCBA: women at childbearing age; NT: isolates that were not typeable. Table 3.4: Frequencies of GBS mobile genetic elements and antibiotic-resistance genes in 4 clinical subgroups
Clinical subgroups
Markers a) N. inv. vs. col. b) EOD vs. LOD c) N. inv. vs. A. inv. d) WCBA vs. col. N. inv. Col. EOD LOD N. inv. A. inv. WCBA Col. Sig. Sig. Sig. Sig. (n = 287) (n = 132) (n=182) (n=105) (n=287) (n=493) (n = 100) (n = 132) Mobile genetic elements GBSi1 76 (27) 37 (28) 43 (24) 33 (31) 76 (27) 69 (14) † 11 (11) 37 (28) ‡ IS1381 221 (77) 105 (80) 147 (81) 74 (71) 221 (77) 440 (89) ‡ 88 (88) 105 (80) IS1548 72 (25) 51 (39) ‡ 52 (29) 20 (19) 72 (25) 104 (21) 20 (20) 51 (39) ‡ IS861 161 (56) 91 (69) ‡ 98 (54) 63 (60) 161 (56) 220 (45) † 42 (42) 91 (69) 42 ISSag1 279 (97) 131 (99) 177 (97) 102 (97) 279 (97) 478 (97) 99 (99) 131 (99) ISSag2 284 (99) 131 (99) 180 (99) 104 (99) 284 (99) 479 (97) 98 (98) 131 (99) ISSa4 13 (5) 2 (2) 9 (5) 4 (4) 13 (5) 26 (5) 4 (4) 2 (2) Antibiotic resistance-related genes aad-6 2 (1) 1 (1) 1 (1) 1 (1) 2 (1) 14 (3) 4 (4) 1 (1) aph-3 2 (1) 1 (1) 1 (1) 1 (1) 2 (1) 10 (2) 2 (2) 1 (1) ermB 8 (3) 2 (2) 5 (3) 3 (3) 8 (3) 18 (4) 3 (3) 2 (2) ermTR 9 (3) 8 (6) 9 (5) 0 (0) 9 (3) 15 (3) 1 (1) 8 (6) int-Tn 175 (61) 95 (72) ‡ 114 (63) 61 (58) 175 (61) 286 (58) 52 (52) 95 (72) ‡ mef 4 (1) 1 (1) 3 (2) 1 (1) 4 (1) 15 (3) 1 (1) 1 (1) mre 287 (100) 131 (99) 182 (100) 105 (100) 287 (100) 493 (100) 100 (100) 131 (99) tetM 248 (86) 114 (86) 160 (88) 88 (84) 248 (86) 403 (82) 85 (86) 114 (86) tetO 5 (2) 3 (2) 5 (3) 0 (0) 5 (2) 18 (4) 3 (3) 3 (2)
Note: all values shown in the table are in number (percent) of isolates. The statistical significance was determined by Chi-square test or Fisher’s exact test (for groups with less than 5 isolates) at the significance level of α =0.01. Groups marked with (∗) were found statistically significant towards the left (†)orright(‡) group at α =0.01. Abbreviations: N. Inv.: neonatal invasive isolates; A. Inv.: adult invasive isolates; Col.: colonising isolates from routine gestational screening; EOD: early-onset diseases; LOD: late-onset diseases; WCBA: women at childbearing age. 3.2.3 Discussion
GBS molecular markers associated with virulence
In the single marker analysis, only alp3 was significantly associated with invasive disease and serotype II was associated with early-onset invasive disease (p<0.01).
MS III and rib (which were previously known to be associated with each other
[133]) were both associated with antenatal vaginal isolates. Serotype III GBS iso- lates have been frequently reported to be associated with neonatal invasive dis- ease [134]. In our comparison, the results suggested an inverse relationship (i.e., associated with colonising rather than invasive isolates) in the aggregate analysis
(Table 3.2). This may be attributable to the inclusion of a relatively large num- ber of adult invasive isolates which include a smaller proportion of serotype III than neonatal disease isolates. Further subgroup comparison, however, revealed no significant differences in comparing vaginal colonising with neonatal invasive iso- lates (Table 3.3). This result suggests that there may only be a limited association between serotype III with neonatal diseases. A similar association in the colonis- ing group was found with the protein rib. Thus, while certain markers could be present more frequently in certain populations (for example, serotype III and pro- tein rib with the neonatal period, and serotype V with the adults [96, 97]), there is a lack of evidence for definitive association between these molecular epidemio- logical markers and overall GBS invasive capability. The differences in serotype and genotype distribution may also illustrate the degree of genetic heterogeneity in infective GBS diseases. This diversity highlights the difficulty in ascertaining the relationships between specific GBS genotypes and their clinical manifestations.
43 Genotyping by mPCR/RLB performance generalisation of Evaluation y1-odcross-validation 10-fold by
Invasive Algorithms: (n = 780) NB LR GBS isolates - - SVM (n = 912) J48 MLP Non-invasive IBk (n = 132) ...
Grouping Genotype data ML Models
Figure 3.4: Predictive analysis of GBS genotyping data by supervised machine learning. Both invasive and non-invasive GBS isolates were typed by using mPCR/RLB. The genotype data were then used to train machine learning mod- els. Predictive accuracies were estimated by 10-fold cross-validation. mPCR/RLB: multiplex-PCR-based reverse line blot; ML: machine learning.
3.3 Predictive analysis by machine learning
In this section, machine learning classifiers are used to construct predictive models
that distinguish isolates according to clinical outcomes, using experimental geno-
type data.
3.3.1 Material and methods
Machine learning algorithms
Six machine learning algorithms were selected from Waikato Environment for
Knowledge Analysis (WEKA, version 3.5.2) [135]. Logistic regression (LR)was
applied with ridge value of log-likelihood of 10−8. k-nearest neighbour classi-
fier (IBk) was applied with inverse-distance weighing function and the number of
neighbours was determined by leave-one-out cross-validation. A network of feed-
44 forward multilayer perceptrons (MLP) with one hidden layer consisted of 16 nodes was trained by the backpropagation algorithm. The information-theoretic decision tree J48 was studied with pruning confidence level set at 0.70. Support vector ma- chine (SVM) with second degree polynomial kernel was trained by the sequential minimal optimisation (SMO) algorithm. Logistic models were built to allow proper estimation of posterior probabilities in trained SVM models [136]. The na¨ıve Bayes classifier (NB) was also used. A majority class predictor (ZeroR), which always predicts the isolates as invasive, was used as control to compare the improvement in performance of the above classifiers.
All markers described in Section 3.2.1 were included in classifier training and virulence prediction.
Evaluation
Performance of classifiers was assessed by both classification accuracy and area under ROC curve (AUC). AUC measures the discriminative power of a classifier.
An AUC of 0.5 indicates that the classifier is no better than chance in discriminat- ing clinical groups of isolates, while an AUC of 1 denotes perfect discrimination; an AUC of greater than 0.8 is usually expected for clinical applications. Classifica- tion accuracy is defined as the proportion of cases correctly classified into invasive or colonising at the default threshold and the standard error was estimated by bi- nomial distribution, where sˆ = p(1 − p)/n. AUC measures the probability of differentiation between groups from a randomly collected sample independent of prior probabilities or test threshold [137], [138]. In this analysis, AUC was esti- mated non-parametrically by using the trapezoidal rule and the standard error of
AUC was estimated by Hanley-McNeil method [139]. The generalisation perfor- mance of all classifiers was evaluated by using stratified 10-fold cross-validation
45 in this testing. GBS subgroups were compared in the same pairs as described in
Section 3.2.1 and shown in Figure 3.2.
3.3.2 Results
In both aggregated and subgroup predictive analyses by machine learning, most of the classifiers produced no significantly better performances in accuracy compared with the majority class predictor ZeroR with exceptions of LR in groups compar- isons (a) and (c), SVM in (c) and NB in (c) and (d) in Table 3.5. Overall, machine learning algorithms separated invasive and colonising groups of GBS isolates with suboptimal accuracy but achieved statistically significant AUC compared to chance
(0.5). The AUCs under 10-fold cross-validation were between 0.64–0.67. In sub- group analyses, classifiers trained to distinguish neonatal invasive from colonising isolates produced an AUC between 0.57–0.63. AUCs between 0.57–0.60 were found when comparing early-onset versus late-onset isolates, 0.67–0.66 in adult versus neonatal invasive disease isolates, and 0.63–0.68 when comparing invasive strains among women at childbearing age to colonising isolates around the perina- tal period.
3.3.3 Discussion
Predictive versus descriptive analysis of virulence by GBS molecular markers
This analysis applied machine learning algorithms to predict clinical outcomes by
GBS genotypes. GBS genotypes have been traditionally assigned based on combi- nations of genetic markers (for example, the genotyping system developed by Kong et al. studied in this chapter [112]) or by studying sequence polymorphisms (e.g.
MLST developed by Jones et al. [107]). Based on genotypes, virulence clusters are then assigned by a phylogenetic dendrogram generated by clustering algorithms such as the Unweighed Pair Group Method with Arithmetic Mean algorithm.
46 Table 3.5: Predictive analysis with machine learning classifiers: performance measured in classification accuracy and AUC with stratified 10-fold cross- validation Accuracy AUC Classifier Accuracy 95 C.I. p-value∗ AUC 95 C.I. p-value† Overall comparison: Invasive versus colonising isolates ZeroR 85.5% (84.7%–) control 0.49 (0.48–0.51) control LR 84.9% (84.0%–)0.88 0.67 (0.66–0.69) < 0.05 IBk 84.2% (83.4%–)0.98 0.66 (0.64–0.68) < 0.05 MLP 83.9% (83.0%–)0.99 0.65 (0.63–0.67) < 0.05 J48 84.0% (83.1%–)0.99 0.65 (0.63–0.66) < 0.05 SVM 85.6% (84.8%–)0.42 0.64 (0.62–0.67) < 0.05 NB 75.3% (74.3%–)10.64 (0.63–0.66) < 0.05 (a) Neonatal invasive versus colonising isolates ZeroR 68.5% (66.9%–) control 0.49 (0.47–0.51) control LR 71.1% (69.5%–) < 0.05 0.60 (0.58–0.62) < 0.05 IBk 69.5% (67.8%–)0.19 0.60 (0.58–0.62) < 0.05 MLP 67.1% (65.4%–)0.90 0.59 (0.55–0.61) < 0.05 J48 69.0% (67.4%–)0.33 0.58 (0.55–0.60) < 0.05 SVM 69.9% (68.3%–)0.10 0.57 (0.55–0.59) < 0.05 NB 63.3% (61.6%–)10.57 (0.55–0.59) < 0.05 (b) Early-onset versus late-onset diseases ZeroR 63.4% (61.4%–) control 0.48 (0.46–0.51) control LR 59.9% (57.9%–)0.99 0.55 (0.52–0.57) < 0.05 IBk 63.8% (61.7%–)0.40 0.57 (0.55–0.60) < 0.05 MLP 58.5% (56.5%–)10.58 (0.56–0.61) < 0.05 J48 62.7% (60.7%–)0.70 0.59 (0.56–0.61) < 0.05 SVM 65.2% (63.2%–)0.10 0.57 (0.54–0.59) < 0.05 NB 60.3% (58.2%–)0.98 0.58 (0.56–0.61) < 0.05 (c) Adult versus neonatal invasive isolates ZeroR 63.2% (62.0%–) control 0.49 (0.47–0.52) control LR 66.5% (65.3%–) < 0.05 0.66 (0.64–0.69) < 0.05 IBk 64.4% (63.2%–)0.08 0.64 (0.62–0.67) < 0.05 MLP 64.0% (62.7%–)0.17 0.63 (0.61–0.66) < 0.05 J48 63.1% (61.8%–)0.57 0.63 (0.61–0.66) < 0.05 SVM 67.7% (66.5%–) < 0.05 0.58 (0.55–0.60) < 0.05 NB 65.6% (64.4%–) < 0.05 0.65 (0.62–0.67) < 0.05 (d) WCBA versus colonising isolates ZeroR 56.9% (51.5%–) control 0.49 (0.42–0.57) control LR 63.4% (58.1%–)0.75 0.63 (0.55–0.70) < 0.05 IBk 63.8% (58.6%–)0.34 0.64 (0.56–0.71) < 0.05 MLP 62.9% (57.7%–)0.25 0.68 (0.61–0.75) < 0.05 J48 65.5% (60.4%–)0.29 0.65 (0.58–0.72) < 0.05 SVM 60.8% (55.5%–)0.79 0.64 (0.57–0.71) < 0.05 NB 62.5% (57.3%–) < 0.05 0.66 (0.59–0.73) < 0.05 ∗ one-tailed t-test compared with the majority class classifier (ZeroR), df=9 † two-tailed t-test compared with chance (0.5), df=9
47 Study isolates
Isolate Clustering Machine learning characterisation
Clustering Group isolates by site of isolation
Assign genotypes Select predictive models
Match site of isolation to Train models using genotypes the isolates
Label genotype Isolates for Trained models clusters prediction
Match genotypes to Isolate Predict outcome acluster characterisation from models
Predicted outcome
Figure 3.5: The traditional clustering technique versus predictive classifications using supervised machine learning
48 There are several differences in using machine learning to classify isolates compared to genotype assignments using clustering methods (Figure 3.5):
1. With supervised machine learning, GBS isolates were firstly grouped and
assigned a class label which allow the training of models that best separate
these classes. This is different from the unsupervised clustering methods
where the genotypes were first described before the corresponding pheno-
types were matched onto the genotype groups.
2. Clustering is prone to model overfitting and reduces the generalisability of
genotype assignment in virulence prediction. In theory, each clinical group
can be perfectly matched to at least one genotype with infinite combinations
of genetic markers in characterising a bacterial strain. However, the many-to-
one (genotype-clinical group) relationships can be overly optimistic, as the
newly discovered strains may not have the exact genotype described in the
study samples. Most modern machine learning algorithms circumvent the
problem of indiscriminately adding genotype markers by methods of early-
stopping (e.g. in artificial neural networks) or pruning of trained machine
models (e.g. J48 decision tree).
3. Clustering methods are focused on maximising the descriptive power of bac-
terial epidemiology, whereas the predictive methods are focused in translat-
ing the genotypes into forecastable results, such as clinical outcomes.
GBS molecular markers are poor predictors of clinical outcome
The molecular markers chosen in this study have been previously found to dis- criminate different GBS clonal linages [122]. It was anticipated that these markers could be “linked” with the virulence phenotypes which would enable us to fore- cast the risks of developing invasive diseases. The predictive powers of the chosen
49 classification algorithms, however, were poor in both aggregate (Table 3.5) and subgroup data (Table 3.5(a)–(d)). Given that similar performance was obtained across all classifiers, it seems unlikely that all classification methods would fail, by chance, to discriminate between the different groups. Our observations supported that the overall predictive power of the individual molecular markers and combi- nations studied here may be weak, and may contradict with positive relations as suggested by previous studies (for example, GBSi1 and serotype III). A similar study of group A streptococcus by machine learning also failed to identify signifi- cant links between genotypes and phenotypes, despite occasional previous studies reporting associations of individual genes with virulence [140]. These findings highlight the need for discovery of new genes and for more comprehensive and po- tentially discriminatory collection of markers of virulence, together with host and environmental factors to achieve improved disease risk stratification.
3.4 Potential limitations
The poor classification performance could be attributed to several factors apart from the poor discriminability of the molecular markers investigated:
Lack of suitable comparators for adult invasive diseases
To delineate the genetic characteristic of invasive GBS isolates, isolates from rou- tine antenatal screening were collected for comparison. Differences in bacterial populations, however, could have been evident in non-pregnant adults and in peri- natal isolates. The colonising isolates from non-pregnant adults were unavailable due to practical constraints. Future studies for studying infections in non-pregnant adults should consider a systematic collection of colonising isolates (for example, rectal swabs) for comparison. Despite the lack of a suitable comparative group,
50 the analytic approach presented in this chapter should reveal markers associated with the “general invasiveness” of GBS. The selection of vaginal isolates was also justified as vaginal colonisation isolates are major sources of neonatal invasion.
Overlapping of clinical groups
It is a common practice to analyse microbial virulence by dichotomise isolates by case-control assignment (“colonising” versus “invasive”), but separation between the two groups cannot be perfect using this method. It seems unlikely that samples in sterile sites such as CSF, or obtained in neonatal infections, would have been in- correctly labelled as invasive. On the other hand, a small proportion of colonising isolates from mucosal sites are clearly capable of displaying virulent behaviour, ei- ther in different circumstances or in different hosts, since these sites are the sources of invasive isolates. Consequently there is overlap in the two classes using site of isolation as the class assignment rule, and it will not be possible to build completely discriminating classifiers in such a circumstance. The best that could be achieved would be to rank colonising strains as more or less likely to cause invasive disease than would be explained by their prevalence at mucosal sites. This could par- tially explain why serotype III and rib isolates were overrepresented among ante- natal colonising isolates – serotype III and rib isolates are common among vaginal colonisers and therefore the most common isolates to which susceptible neonates are exposed.
Other practical factors
Several practical factors could have also contributed to the poor classification per- formance. For example, the choice of GBS isolates were limited such that most colonising isolates were collected from Australia and New Zealand. Clinical vari- ables, such as obstetric risk factors associated with pregnancy of neonatal isolates
51 or mobilities associated with adult invasive isolates (for example, diabetes melli- tus), were also unavailable for analysis. Integration of additional clinical data may improve the predictive power of the models and should be considered in future investigations.
3.5 Selection of discriminatory features
Reducing the number of non-discriminative features, especially in data with high dimensionality, may improve the performance of machine learning algorithms [141].
In Section 3.3, the poor classification results could in part be attributed to poorly discriminating genetic markers. Table 3.2 shows that only a few binary markers were significantly associated with GBS invasion at α=0.01. Therefore, it can be hypothesised that the use of discriminative subset of molecular markers (from the panel of 18 markers) may improve classification performance. Several research questions can be posed:
1. Can a reduction in the number of markers improve classification perfor-
mance?
2. What are the minimum set of markers needed to achieve the best classifier
performance?
3. Which markers are associated with the best classification performance?
4. What feature selection algorithm is best suited to perform this task?
3.5.1 Methods
Four feature ranking algorithms (ReliefF [142], symmetrical uncertainty [143], in- formation gain [144], Chi-square feature selection [145]) were studied in combi- nation with the six classifiers listed in Section 3.3 to ensure reproducibility. The
52 ranks produced by FR algorithms were determined by a modified 10-fold cross- validation, as the the standard cross-validation technique could overestimate clas- sifier performance when the subset of features were also selected from the same data set [146].
Figure 3.6 illustrates the pseudocode describing the procedure of how to obtain the optimal number of features combined with each machine learning algorithm A
(Nopt,A).
X = GBS genotype data (18-markers, 912 isolates) y = the corresponding clinical outcome of X (invasive or colonising)
for each classifier yˆ = C(y, X) do
for each feature ranking algorithm A do divide X into 10 folds
for fold i =1to10 do Xi,ts = test set of fold i (10% of data) Xi,tr = training set of fold i (the remaining 90% of data)
apply A on (y, Xi,tr) to feature rank Ri
for N =1to18 do Xi,N,tr = Xi,tr, with only top N features ranked in Ri Xi,N,ts = Xi,ts, with only top N features ranked in Ri
train C using (y, Xi,N,tr) ` ´ test C on the (y, Xi,N,ts) to obtain AUCi C(y, Xi,N,ts) . end for
end for
calculate the mean AUC for each N,where
10 ` ´ 1 X ` ´ AUC C(XN ),A = AUCi C(y, Xi,N,ts) 10 i=1 . ` ´ obtain Nopt,A ← argmax AUC C(XN ),A N end for
end for
Figure 3.6: Pseudocode: the modified cross-validation procedure for evaluating classifier performance after feature selection
53 3.5.2 Results
Classification performance
The classifier performance of the feature-selected rank is shown in Figure 3.7.
Compared with AUCs without feature selection, no statistically significant results were achieved at the significance level at p<0.05 (Table 3.6). The feature ranking algorithms achieved the best AUC at between 2 and 11 markers (overall median =
2.5, Table 3.6). The best AUCs achieved were between 0.62—0.70 across 6 clas- sifiers in data sets after feature selection. The na¨ıve Bayes classifier achieved the best overall AUC (0.70) when combined with the ReliefF algorithm.
Selected markers
Most feature ranking algorithms identified MS and PGP as the top molecular mark- ers for classification. Within the top-third of the ranking consisting of the 18 markers, markers int-Tn (ranked by 4 algorithms), IS861 (ranked by 3 algorithms),
IS1548 (ranked by 3 algorithms), and GBSi1 (ranked by 3 algorithms) were con- sistently placed at the top of the list (Table 3.7).
54 Table 3.6: Number of features (Nopt,A) required to achieve the best AUC for each classifier.
Feature ranking algorithm Classifier∗ Median Best AUC p-value† ReliefF Sym.Unc. Inf.Gain ChiSq. NB 7 4 3 3 3.5 0.70 0.09 IBk 2 1 11 5 3.5 0.64 0.64 J48 3 3 3 3 3 0.62 0.25 MLP 3 1 2 2 2 0.64 0.69 55 LR 5 1 2 2 2 0.66 0.63 SVM 2 1 1 1 1 0.65 0.79
∗ Abbreviations: machine learning algorithms: NB:na¨ıve Bayes classifier; IBk: k-nearest neighbour classifier; J48: J48 decision tree; MLP: multilayer perceptron; LR: logistic regression; SVM: support vector machine with linear kernel; The parameters of the classifiers are identical to those described in Section 3.3.1. Feature selection algorithms: ReliefF: ReliefF algorithm; Sym. Unc.: symmetric uncertainty algorithm; Inf. Gain: information gain feature selection algorithm; ChiSq.: chi-square feature selection algorithm; † The null-hypothesis was defined as no difference with AUC produced by the same algorithm without feature selection (Ta- ble 3.5). Paired t-test with df =18was used to compare the differences in AUCs. Standard error was estimated by Hanley- McNeil method [139]. Table 3.7: Top one-third of the GBS markers ranked by feature ranking algorithms
Feature ranking algorithm Rank∗ ReliefF Sym.Unc. Inf.Gain. ChiSq. 1 MS PGP PGP PGP 2PGPIS861 MS MS 3 int-Tn MS IS861 IS861 4 tetM IS1548 IS1548 IS1548 5IS1381 mre int-Tn int-Tn 6 GBSi1 int-Tn GBSi1 GBSi1 ∗ The ranks were determined by 10-fold cross-validation Feature selection algorithms: ReliefF: ReliefF algorithm; Sym. Unc.: symmetric uncertainty algorithm; Inf. Gain: information gain feature selection algorithm; ChiSq.: chi-square fea- ture selection algorithm;
3.5.3 Discussion
The results after feature selection suggest that the reduction of marker numbers did not improve classification performance. In addition, the median number of optimal features (Nopt,A) was between 1–3.5 markers, and including MS and PGP, indicating that only these markers are important predictors of GBS virulence. This
finding also suggests that most of the remaining markers have weak predictive power in virulence classification.
3.6 Evolutionary considerations
So far we have investigated the prediction of GBS clinical outcomes by bacterial genotypes using both statistical and machine learning methods. It was found that although different GBS strains could be distinguished with genotyping systems, no markers or their combinations yielded sufficient power to predict clinical outcomes robustly. Apart from the limitations discussed in Section 3.4, several evolutionary aspects of bacterial genetics may have roles in affecting the predictability of the molecular markers.
56 NB IBk 0.4 0.5 0.6 0.7 0.4 0.5 0.6 0.7
0 5 10 15 20 0 5 10 15 20 Number of features (N) Number of features (N)
J48 MLP 0.4 0.5 0.6 0.7 0.4 0.5 0.6 0.7
0 5 10 15 20 0 5 10 15 20 Number of features (N) Number of features (N)
LR SMO 0.4 0.5 0.6 0.7 0.4 0.5 0.6 0.7
0 5 10 15 20 0 5 10 15 20 Number of features (N) Number of features (N)
Feature ranking algorithm ReliefF SymmetricalUncert InfoGain ChiSquared
Figure 3.7: Performance of machine learning classifiers (AUC) versus number of features as selected by feature ranking algorithms. Each point indicates an AUC obtained from one run of 10-fold cross validation by a given classifier trained with top-N attributes. Abbreviations: machine learning algorithms: NB:na¨ıve Bayes; J48: J48 decision tree; LR: logistic regression; IBk: k-nearest neighbour algorithm; SMO: support vector machine (linear kernel); MLP: multilayer perceptron. Feature ranking algorithms: ReliefF: ReliefF algorithm; SymmetricalUncert: symmetrical uncertainty algorithm; InfoGain: information gain algorithm; ChiSquared: chi- squared algorithm.
57
58 =⇒
A B C A B C ⇓ ⇓ Invasive Invasive
(a) Horizontal gene transfers (HGT). Assume the virulence trait is contributed by the (b) De novo mutations under positive selection pressure. In this example, a new viru- true virulence genes (marked ), HGT of the virulence gene from B to C invali- lence gene arising from type C degrades the predictive power of the typing system dates the ability to track invasive behaviour by the markers originally capable of over the bacterial generations. distinguishing subpopulations of type A, B, and C.
Figure 3.8: Evolutionary mechanisms for non-cosegregation of markers with the virulence traits. As illustrated by the diagrams, there are two potential scenarios that may cause disruption of linkage between pathogen subtypes and virulence. 3.6.1 The clonal assumption of genotyping is defied by mechanisms of horizontal gene transfer
A fundamental assumption of bacterial typing necessitates a perfect clonal rela- tionship between a bacterial type and its precursors. Thus, an ideal scenario would be that virulence properties are genetically linked with molecular markers (Fig- ure 3.8(a)), and the cosegregation of virulence mechanisms with the markers would enable tracking of invasive phenotype. Realistically, however, bacteria readily share their genes by mechanisms of horizontal gene transfer (HGT), which would allow the spread of true virulence factors without the co-segregation of molecu- lar markers. The weak linkage of marker and true-virulence factor may cause a predictive marker to gradually lose its predictive power over generations.
3.6.2 Positive selection of virulence genes
Positive selection pressure applied to true virulence genes could also be a con- tributing mechanism to the poor predictive power. Bacterial pathogens constantly undergo selection pressures to survive the hostile environment created by the host immune system and antimicrobials. Fitter virulence genes that confer survival ad- vantage are rapidly positively selected (Figure 3.8(b)). The evolutionary rates for virulence genes are faster than the the rest of the genome [49]. Thus, typing by la- belling a “hypervirulent clone” with more stable markers than the virulence genes may result in the disruption of linkage between the virulence trait and the markers selected for typing, resulting in a gradual decay of predictive power. In supporting this hypothesis, there was evidence that GBS virulence trait could independently segregate with commonly used markers including serotypes and MLST, a phe- nomenon that may be explained by the process of positive selection [126].
59 3.6.3 Virulence gene typing
To aid in effective selection of molecular markers to predict virulence, two empir- ical criteria are suggested:
1. An ideal molecular marker should be closely-linked with true virulence genes
2. An ideal molecular marker should have rates of evolution faster or equal
than virulence genes, such that
rateclone ratevirulence gene
≤ ratemarker
ratestrain (3.1)
Intuitively, the perfect typing method would be whole-genome sequencing (per- fect linkage rate and ratemarker = ratestrain). Nevertheless, genome sequencing for individual pathogens is impractical, as the associated cost and time do not match the immediacy required in clinical decision making. A practical alternative is to perform genetic characterisation on true virulence genes, or virulence gene typing, which would yield a perfect linkage with ratevirulence gene = ratemarker. One challenge in identifying true virulence genes, as stated in Chapter 1, is the time and resource constraints associated with experimental failures from incorrect selection of candidate genes. In GBS, several virulence genes have been experi- mentally verified (Chapter 7). However, it is likely that our knowledge of virulence genes is non-exhaustive and more virulence genes are yet to be discovered. In as- sisting with functional discovery of genes that may help to explain GBS virulence, the later part of this thesis will develop an in silico candidate gene prioritisation
(CGP) method to achieve this goal by adapting comparative genomes analysis and
60 current CGP methods (which have been studied in the discovery of human genetic diseases) to bacteriology.
3.7 Chapter summary
This chapter examined the predictive relationship between GBS virulence and the eighteen molecular markers. The commonly attributed markers of virulence, such as molecular serotypes and protein genetic profiles, were analysed by both machine learning and statistical methods. Some markers were associated with invasive dis- eases (alp3) and antenatal colonising isolates (rib, GBSi1, IS1548,IS861,andint-
Tn) respectively. Other markers were also associated with important subgroups.
The molecular epidemiology markers studied in this chapter were previously found to be discriminative in differentiating GBS strains and defining important
GBS clusters (for example the association of III-2 and MLST ST-17 associated with neonatal invasive diseases). Based on machine learning analysis, however, these markers alone were inadequate in predicting clinical outcomes. Feature se- lection algorithms did not further improve classification performance. Although several limitations such as overlapping groups and other practical constraints could have affected the results, better classification results were expected if the molecular makers are more discriminative. Horizontal gene transfers and positive selective pressure on virulence genes are possible mechanisms that could have also con- tributed to the disruption of linkage association between the markers and the viru- lence factors. Thus, characterisation of virulence genes, or virulence gene typing, is postulated to be the key in virulence prediction to achieve better discriminative power. Developing effective methods of virulence gene discovery is therefore an important step in achieving this goal.
61 Part II
Co-occurrence-based candidate gene prioritisation
62 Chapter 4
Candidate gene prioritisation
4.1 Introduction
Challenges in identifying bacterial virulence genes
The importance of virulence gene identification for predicting infectious disease outcomes was demonstrated in Chapter 3. In the discovery of virulence genes in pathogenic bacteria, Raskin et al. observed a trend toward increasing use of high- throughput genomic technologies with bioinformatic analysis compared with the traditional gene screening methods [44]. However, biological experiments to dis- cover virulence genes are resource intensive and time consuming. Thus, improving pre-experimental selection of candidate genes with computational methods could reduce the number of cycles of negative trials and thus increase the likelihood of gene discovery.
Genomic screening for positively-selected genes as virulence candidates
A computational comparative genomic approach has been described by Chen et al., where an uropathogenic Escherichia coli genome (E. coli UTI89) was com- pared with non-uropathogenic E. coli genomes to identify the positively selected
63 genes using phylogenetic analysis by maximum likelihood [49]. These genes were subsequently validated in 50 clinical isolates where significant variations in sev- eral genes (fepE, ompC, and amiA) were found. However, the approach proposed by Chen et al. requires multiple genomes of the same species with phenotypic variations. Therefore, the method is not directly applicable to our task of GBS virulence genes search, as there is only a limited subset of phenotypic variations in the current database of published genomes (2603 V/R, A909 and NEM316 were all invasive isolates). Alternative methods of in silico comparative genomic analysis are thus needed for this task of virulence gene discovery.
Methods of in silico candidate gene prioritisation (CGP) can assist with dis- covering genes responsible for inherited diseases in human
Several bioinformatic methods for prioritising candidate gene have been described in the search for human disease genes. Selecting appropriate genes for biological validation remains a key constraint in gene function discovery, given that there are more than 20,000 genes in the human genome [147]. CGP covers a wide variety of methods that automate the initial gene selection task for researchers. In this chapter, methods for computational prioritisation of candidate genes are reviewed and practical constraints in their application will be discussed. The purpose of this review is to explore the possibility of adopting the concept of CGP, and then to con- struct such a system for discovering bacterial gene functions and genes associated with virulence. Conceptual analogies between CGP and information retrieval (IR) methods will be drawn. Currently, there are no similar gene-ranking tools available in the field of microbiology.
64 DS 1 f1
f2 DS2
f3 Σ . . . .
fn DSn
Candidate Database Feature Feature Prioritised Data sources Features genes query processing Integration ranks
Figure 4.1: The workflow of in silico candidate gene prioritisation. For each can- didate gene to be ranked, databases are queried to retrieve information associated with the genes. Such information is then transformed into various gene features, which are quantities suggestive of the degree of relevance with the function of in- terest. Subsequently, individual gene features are integrated (by feature integration algorithms) to produce an overall rank. DS: Data source; f1,f2,...fn: gene fea- tures;
4.1.1 Overview of the in silico CGP process
Experimental biologists frequently encounter the problem of having a large set of
candidate genes needing functional validation. Computational CGP methods aim
to automate this gene selection process by ranking the list of candidate genes by a
relevance score. The process for gene prioritisation begins with the user providing
a list of candidate genes, together with certain search criteria (for example, dis-
ease names, keywords, numerical criteria, or sequence features) or a list of genes
with known relationship to the disease of interest (the training genes). The priori-
tiser then retrieves information related to each candidate gene from different data
sources and derives gene features associated with each candidate gene (see Sec-
tion 4.4). A feature integration algorithm then aggregates scores from individual
65 A recent case of using an in silico inductive gene prioritiser ENDEAVOUR to rank disease genes was described by Aert et al.. The authors performed a CGP to identify the potential role of YPEL1 gene in DiGeorge syndrome, a rare congenital condition caused by a deletion in human chromosome 22. Features derived from 11 different data sources and four dif- ferent ranks related to each of the disease characteristics were used. In their experiments, the YPEL gene was ranked within top 3 of the 58 candidate genes in the 2-million base pair deletion region on the chromosome. The role of YPEL1 was subsequently confirmed with a gene-knockdown zebrafish model [148].
Box 4.1: A case study of of candidate gene prioritisation features of each gene to derive a final relevance score for the gene, to assign each candidate gene a rank. An ideal prioritiser would place candidate genes with rele- vant genes (genes that are relevant to the disease of interest) higher on the candidate list. The work flow of CGP is illustrated in Figure 4.1.
4.2 Types of CGP
A computational gene prioritiser can be viewed as a specialised IR system (Ta- ble 4.1). A fundamental assumption in IR posits that documents relevant to a given query share similar attributes such as occurrence of keywords (cluster hypothe- sis) [149]. By analogy, the process of gene prioritisation attempts to retrieving the most relevant genes from a gene collection (identical to finding the most relevant documents from a document corpus). A corresponding “gene cluster” hypothe- sis holds for identifying genes related to inherited diseases, such that the disease- related genes tend to be well-conserved and have important biological roles. [150].
Exploiting these conservation properties, candidate genes with similar character- istics could be identified through two well-understood inference mechanisms – characteristics-based or inductive prioritisation.
66 Table 4.1: Comparison of CGP and IR systems
System feature Candidate gene prioritiser Information retrieval system Search space Candidate genes Document corpus Primary task Prioritisation Best-match retrieval Target object Disease-relevant genes Document of interest Input (search criteria) Characteristics-based Gene features of interest ad hoc queries Name of disease-related genes Search terms Inductive Training genes Documents (training sets in document classification and clustering) Output Prioritised gene list Search results sorted by relevance Relevance model Conservation hypothesis Clustering hypothesis Data sources Primary Crude DNA or protein sequences Document text and structure Gene expression data Secondary Biomedical literature Metadata (for example, tags of image Gene ontology terms or audio multimedia documents) Gene annotations Example of features Co-occurrence Term co-existence in abstracts Query term frequencies in a document Proximity of genes in genome frequency of term co-occurrence Gene co-expression in tissues Semantic relations Protein-protein interactions Similarity Sequence similarity Similarity between documents
4.2.1 Characteristics-based prioritisation – ranking candidate genes by ad hoc criteria
A characteristics-based prioritiser ranks candidate genes based on a set of user- defined gene features. Such features are usually positively correlated with both the disease of interest and the candidate genes. For example, prioritisers may use vo- cabulary or literature data, as the co-existence of disease and gene vocabulary in the same biomedical document (e.g. abstract of a paper) may signify the gene is asso- ciated with the disease of interest [151]. The traditional reciprocal BLAST search for orthology can also be viewed as characteristics-based prioritisation, as genes sharing similar sequences (i.e. lower E-values with higher identity) are generally assumed to have a higher likelihood of sharing similar biological functions [152].
67 A- B+ C 0.8 D+ ... #1 0.95 A- B- C 0.9 D- ... #2 0.89 A+ B- C 1.0 D+ ... #3 0.83 A+ B+ C 0.2 D+ ... #4 0.74 A- B- C 0.5 D- ... ϕ(A, B, C, D,...) #5 0.56 A+ B- C 0.1 D- ... #6 0.47 A- B+ C 0.4 D+ ... #7 0.42 A+ B- C 0.2 D- ... #8 0.36 A+ B- C 0.3 D- ... #9 0.03
Candidate Prioritised Feature association Scoring genes list
Figure 4.2: In characteristics-based prioritisation, scores from different gene fea- tures are integrated by using an ab initio scoring function.
Such sequence-function relationship forms the basis by which BLAST extracts rel- evant genes from a list of candidates.
4.2.2 Inductive prioritisation – finding genes with similar features
An inductive prioritiser builds inference models using genes known to be associ- ated with the function or disease of interest (training genes). Inductive prioritisers assumes that functionally similar genes should share similar biological, ontologi- cal, or literature features. Inductive CGP is useful in recovering important genes that would otherwise be neglected. Several recently described gene prioritisers, including DGP [153], ENDEAVOUR [148], and PROSPECTR [154], all use in- ductive models as the method of inference for in silico disease gene discovery.
68 Known genes Feature association Training of machine learning models
A+ B- C 1.0 D+ ... A+ B+ C 0.9 D+ ... A+ B- C 0.9 D+ ... A+ B- C 0.8 D- ... M(A, B, C, D,...) A- B+ C 0.1 D+ ... A- B+ C 0.2 D- ... A- B- C 0.0 D- ...
A- B+ C 0.8 D+ ... #1 0.95 A- B- C 0.9 D- ... #2 0.89 A+ B- C 1.0 D+ ... #3 0.83 A+ B+ C 0.2 D+ ... #4 0.74 ∗ A- B- C 0.5 D- ... M (A, B, C, D,...) #5 0.56 A+ B- C 0.1 D- ... #6 0.47 A- B+ C 0.4 D+ ... #7 0.42 A+ B- C 0.2 D- ... #8 0.36 A+ B- C 0.3 D- ... #9 0.03
Candidate Machine learning Prioritised Feature association genes prediction list
Figure 4.3: In inductive prioritisation, the scoring function is replaced by a ma- chine learning model, which is trained by using a list of genes known to be associ- ated with the function of interest (training genes). The trained model is then used to predict genes sharing similar characteristics as the training genes.
4.3 Data sources
4.3.1 Primary data source – raw gene or protein sequences
Gene prioritisers use many sources of knowledge to make assessment of how rele- vant genes could be. Primary data sources provide raw nucleotide or polypeptide sequences of a gene. Gene features derived from the primary data sources may include gene length, untranslated terminal regions, intergenic distances, number of exons, or GC content [154]. Gene expression profiles from expressed sequence
69 tags (EST) databases dbEST1, microarray expressions, and transcriptional factors databases (for example, TRANSFAC2 have also been employed [148].
4.3.2 Secondary data sources – meta-knowledge about a gene
The use of external knowledge associated with a gene, or secondary data sources, provides information that is otherwise not available in primary data sources. Sec- ondary data can be viewed as “meta-data” of gene (an analogy to the meta-data of a document or a multimedia). The use of biomedical literature data (for example,
MEDLINE abstracts3, ontological relations (for example, Gene Ontology4, gene- gene interactions (for example, Biomolecular Interaction Network Database5), func- tional databases (for example, Kyoto Encyclopaedia of Genes and Genomes6), or gene homology (for example, BLAST databases7) all belong to this category. In- tuitively, secondary data sources reflect the state of knowledge about a candidate gene and its relevance to the disease of interest.
4.3.3 Differences between primary and secondary data sources
Primary data sources are generally well-determined but the gene-disease relation- ships are less obvious compared with those found in secondary data sources. For instance, it is difficult to predict gene function by examining raw polypeptide se- quences or gene expression levels. In contrast, secondary data sources provide links with other biological entities thus present a stronger gene-disease relationship compared with crude sequence data. However, missing values are frequently found
1dbEST: http://www.ncbi.nlm.nih.gov/dbEST/ 2TRANSFAC: http://www.gene-regulation.com/pub/databases.html 3PubMED: http://www.ncbi.nlm.nih.gov/pubmed/ 4Gene Ontology: http://www.geneontology.org/ 5BIND: http://www.bind.ca/ 6KEGG: http://www.genome.jp/kegg/ 7BLAST: http://www.ncbi.nlm.nih.gov/BLAST/
70 in secondary data sources, representing “gaps” in our knowledge. These missing values may introduce potential biases which are discussed in Section 4.7.1.
4.4 Gene features
Raw data from primary or secondary data sources need to be refined into meaning- ful entities, or gene features, suitable for the prioritisation tasks. A gene feature is a gene-specific characteristic that correlates with a phenotype of interest. To fa- cilitate the ranking process, the strengths of gene features are frequently described in boolean or numerical values (feature scores). A characteristics-based prioritiser combines the feature scores with user-specified criteria similar to “search terms” or “queries” used in database search in IR system. In an inductive prioritiser, the features form different “attributes” around which the inference models are trained.
Gene features can be broadly classified into two categories, either co-occurrence- based or similarity-based.
71 known known Gene 1a Gene 1a Article 1 association association
Shared Disease A Disease A keyword
unknown association Gene 1b Gene 1b Article 2 association hypothesised 72 (a) Co-occurrence: similar to gene 1a, gene 1b may also be a contribute factor to disease A because the gene name “gene 1b” is also co-present with the keyword “disease A” in a biomedical article. known known association Disease B in association Disease B in Gene 2a Gene 2a mouse mouse
Sequence similarity
Disease B in Disease B in Gene 2b Gene 2b unknown human association human association hypothesised
(b) Similarity: genes sharing similar features (for example, nucleotide sequences) are likely to have contributing roles to similar traits.
Figure 4.4: Co-occurrence and similarity are common relevance measures of a gene to a function of interest. 4.4.1 Co-occurrence
Co-occurrence of vocabularies in a biomedical text may suggest functional association between gene and disease
Co-occurrence is a frequently used concept in information retrieval tasks. In CGP, the co-occurrence of gene and disease names in a biomedical text may suggest a possible association between the two, as both vocabularies need to coexist to de- scribe a causal relationship within a single document. The degree of co-occurrence of vocabularies has been quantified by measuring frequencies or fuzzy membership in previous CGP studies [151, 155, 156].
Co-occurrence of biological entities within a higher structure may suggest a functional association
The concept of co-occurrence can be extended to describe entities other than biomed- ical texts, such as gene ontologic or metabolic pathways. In these cases, genes sharing the same ontology terms or metabolic pathway could be postulated to have similar functions or similar pathogenic potential. In addition, interactions between genes or gene products can also be viewed as co-occurrences. For example, the use of the protein-protein interaction database BIND may assist the discovery of genes participating in the same functional unit [157]. Expression databases such as EST have also been used to identify genes sharing similar co-expression patterns in the same tissue, as these are more likely to have similar functional roles. In addition, genes closer to a gene with known disease gene or genomic region may suggest likely candidates, as the genes may be located on a potential gene cluster and are likely to be expressed together in vivo [148, 155].
73 4.4.2 Similarity
Similarity between sequences may suggest similar function
The degree of similarity between the features of a candidate gene and a known gene may suggest functional similarity. For example, sequence comparison algorithms may determine gene similarity. Figure 4.4(b) illustrated how similarity measures
(e.g. identity, E-value of BLAST) may suggest a candidate gene contributes to the disease of interest through homology. In this example, gene 2a is known to be associated with disease B in a mouse model. Another gene (gene 2b) with unknown correlation with the disease shares a similar sequence with gene 2a. Thus, it is reasonable to postulate that a defect in gene 2a may also contribute to disease B in human.
Similar biomedical texts referring to a target disease name may suggest likely gene candidates
Similarity measures can also be derived from the content of biomedical texts or ontology terms. The concept of similarity may be represented by distance func- tions, such as Euclidean distance or cosine similarity function between two text vectors (more similar texts have higher scores), to indicate the similarity between a document with known disease-gene relationship with another document [148,155].
4.5 Feature integration
Using a single feature to rank a list of candidate genes may be insufficient to dis- criminate relevant genes and multiple features are usually required. The process of feature integration aggregates individual features into a single score that generate an overall gene rank.
74 In general, the relevance score function can be written as: