The University of Chicago Machine Learning for The
Total Page:16
File Type:pdf, Size:1020Kb
THE UNIVERSITY OF CHICAGO MACHINE LEARNING FOR THE GENOTYPE TO PHENOTYPE PROBLEM A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN CANDIDACY FOR THE DEGREE OF DOCTORATE OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE BY JOHN WILLIAM SANTERRE CHICAGO, ILLINOIS DRAFTDECEMBER 2017 Copyright c 2017 by John William Santerre DRAFTAll Rights Reserved TABLE OF CONTENTS LIST OF FIGURES . v LIST OF TABLES . vii ACKNOWLEDGMENTS . x ABSTRACT . xi INTRODUCTION . xii I GENOTYPE TO PHENOTYPE: ANTIMICROBIAL RESISTANCE 1 OVERVIEW . 2 2 DETECTING ANTIMICROBIAL RESISTANCE IN THE LAB . 6 2.1 Biological concepts . .6 2.2 AMR Protocol . .8 3 ANTIMICROBIAL RESISTANCE AS A SUPERVISED MACHINE LEARNING PROBLEM . 10 3.1 k-mer Notation . 12 3.1.1 Classification Using k-mer Matrices . 17 3.2 Model Selection and Tuning . 19 3.3 Model Considerations . 20 3.3.1 Hyperparameter Selection . 21 3.4 Metrics . 22 3.4.1 Classification performance . 22 3.4.2 Feature Importance Calculation . 23 3.4.3 Gene Region Identification . 24 3.5 Model Comparison . 25 3.5.1 Naive Bayes . 25 3.5.2 AdaBoost . 26 3.5.3 Logistic Regression . 27 3.5.4 Support Vector Machines . 27 3.5.5 Random Forest . 27 4 ANALYSIS . 31 4.1 Data Sets . 31 4.1.1 Acinetobacter baumannii . 31 4.1.2 Staphylococcus aureus . 34 4.1.3 Mycobacterium tuberculosis . 34 4.1.4 Klebsiella pneumoniae . 35 DRAFTiii 4.2 Classifier Comparison . 36 4.2.1 Classifier Performance . 36 4.2.2 Speed . 39 4.3 Random Forest . 39 4.3.1 RF subtrees . 46 4.3.2 RF on subsets of the data . 47 4.4 Feature Importance Calculation . 49 4.4.1 Biological relevance . 49 4.4.2 Feature Importance Stability . 54 5 COMPRESSED MATRIX FORMULATION . 59 5.1 Compressed Matrix Construction . 59 5.2 Experiments . 63 A SUPPLEMENTARY MATERIAL . 70 A.1 AMR: Classification . 70 A.2 AMR: Gene Stability . 86 A.3 AMR: Code . 101 A.4 PGR . 103 DRAFTiv LIST OF FIGURES 3.1 Example dataset from which the k-mer matrix in (3.1) is constructed. SUS genomes are labelled 0. RES genomes are labelled 1 to denote the presence of a mutation conferring resistance to a specific antibiotic. 12 3.2 Schematic of the AMR phenotype classification workflow using k-mers. 16 3.3 An example RF tree. 28 4.1 Overview of the A. baumannii dataset. The k-mer size (shown on the x axis) affects different metrics of the k-mer matrix identically. 32 4.2 Histogram of the entries in the k-mer matrix for the (a) A. baumannii and (b) S. aureus dataset. Note that increasing the size k can have a dramatic effect on the distribution of entries of the k-mer matrix. 33 4.3 Overview of the A. baumannii dataset. The k-mer size (shown on the x axis) affects different metrics of the k-mer matrix identically. 34 4.4 ROC curves for classifiers trained on each of the A. baumannii and S. aureus datasets (k = 15). 39 4.5 ROC curves for classifiers trained on each of the M. tuberculosis datasets listed in Table 4.1. 40 4.6 ROC curves for different classifiers trained on the A. baumannii dataset. Each ROC curve corresponds to a different size k..................... 41 4.7 ROC curves for different classifiers trained on the S. aureus dataset. Each ROC curve corresponds to a different size k........................ 42 4.8 ROC curves for different classifiers trained on the binary A. baumannii dataset. Each ROC curve corresponds to a different size k.................. 43 4.9 ROC curves for different classifiers trained on the binary S. aureus dataset. Each ROC curve corresponds to a different size k..................... 44 4.10 Accuracy of different classifiers on the (binary) A. baumannii and S. aureus datasets. Additional performance metrics can be found in the supplementary Tables A.1, A.2, A.3 and A.4 in Appendix A.1. 45 4.11 Execution times for different classifiers ran on the A. baumannii dataset. NB denotes Naive Bayes. 46 4.12 ROC curves for the RF classifier trained on the K. pneumoniae dataset. 46 4.13 ROC curves for the RF classifier trained with different number of trees on each of the A. baumannii and S. aureus datasets. 48 4.14 ROC curves for the RF classifier trained on training sets of increasing sizes. In particular, RF was trained on subsets of the M. tuberculosis rifampicin and streptomycin datasets from Table 4.1. The subset size denotes the total number of isolates subsampled from each dataset in each experiment (with number of RES and SUS isolates equal to n=2 in each case). 49 4.15 Plot of the ranking (y axis) of each PEG (x axis) by the aggregate score computed by A1, as described in the text. Additional information about the PEGs can be found in supplementary Tables A.12 and A.11 in Appendix A.2. 51 DRAFTv 4.16 Plot of the ranking (y axis) of each PEG (x axis) by the aggregate score computed by A2, as described in the text. Additional information about the PEGs can be found in supplementary Tables A.13 and A.14 in Appendix A.2. 52 4.17 k-mer ranking according to the feature importance computed by RF on the full or the compressed k-mer matrix. Collapsing by column ID refers to calculating the feature importance separately on the full k-mer matrix and then summing the feature importance for all columns that have an identical column identity in the k-mer matrix. 58 5.1 Python pseudocode for constructing a k-mer matrix. 60 5.2 Pseudocode for constructing a compressed k-mer matrix. 62 5.3 Overview of the compressed A. baumannii and S. aureus datasets. Similarly to the full k-mer matrix case (see Figures 4.1 and 4.3), the k-mer size (shown on the x axis) affects different metrics of the k-mer matrix identically. 64 5.4 Accuracy of different classifiers on the compressed (binary) A. baumannii and S. aureus datasets. Additional performance metrics can be found in the supplemen- tary Tables A.5, A.6, A.7 and A.8 in Appendix A.1. 65 5.5 ROC curves for different classifiers trained on the compressed A. baumannii dataset. Each ROC curve corresponds to a different size k............. 66 5.6 ROC curves for different classifiers trained on the compressed S. aureus dataset. Each ROC curve corresponds to a different size k.................. 67 5.7 ROC curves for different classifiers trained on the compressed binary A. bauman- nii dataset. Each ROC curve corresponds to a different size k........... 68 5.8 ROC curves for different classifiers trained on the compressed binary S. aureus dataset. Each ROC curve corresponds to a different size k............. 69 DRAFTvi LIST OF TABLES 4.1 List of the antibiotics used for the M. tuberculosis isolates. For each antibiotic we list the total number of isolates used for classification with the specific number of resistant and susceptible isolates listed in the fourth and fifth column, respectively. Additional details about the dataset can be found in (Davis et al., 2016). The last column shows the number of features when the k = 15. 35 4.2 List of antibiotics used for the K. pneumoniae isolates. For each antibiotic we list the total number of isolates which have a labelled (i.e., resistant or susceptible) for that antibiotic. The specific number of resistant and susceptible isolates are listed in the fourth and fifth column, respectively. 36 4.3 Classifier comparison on the A. baumannii dataset (classification of resistance to carbapenem) and the S. aureus dataset (classification of resistance to methicillin). 37 4.4 Classifier comparison on each of the seven M. tuberculosis datasets listed in Table 4.1............................................ 38 4.5 Comparison of RF performance on the K. pneumoniae dataset. 47 4.6 Comparison of RF classifier trained with different number of trees on the A. baumannii dataset (classification of resistance to carbapenem) and the S. aureus dataset (classification of resistance to methicillin). 48 4.7 Comparison of RF performance on training sets of increasing sizes. In particular, we trained RF on subsets of the M. tuberculosis rifampicin and streptomycin datasets from Table 4.1 . The subset size n denotes the total number of isolates subsampled from each dataset in each experiment (with number of RES and SUS isolates equal to n=2 in each case). All statistics are averaged over three independent runs, on each one of them we performed 5-fold cross-validation. 50 4.8 The top 20 k-mer matrix features computed by training RF on the k-mer matrix of the M. tuberculosis rifampicin dataset. Each classifier consists of 10 trees. FI denotes feature importance. [Change col index?]................. 54 4.9 The top 20 k-mer matrix features computed by training RF on the k-mer matrix of the M. tuberculosis rifampicin dataset. Each classifier consists of 1000 trees. FI denotes feature importance. [Change col index?]................ 55 4.10 The top 20 k-mer matrix features computed by training RF on the compressed k-mer matrix of the M. tuberculosis rifampicin dataset. Each classifier consists of 10 trees. FI denotes feature importance. 56 4.11 The top 20 k-mer matrix features computed by training RF on the compressed k-mer matrix of the M. tuberculosis rifampicin dataset. Each classifier consists of 1000 trees. FI denotes feature importance. 57 A.1 Performance of different classifiers on the A. baumannii dataset for different k- mer sizes.