Classifying Peroxiredoxin Subgroups and Identifying
Total Page:16
File Type:pdf, Size:1020Kb
CLASSIFYING PEROXIREDOXIN SUBGROUPS AND IDENTIFYING DISCRIMINATING MOTIFS VIA MACHINE LEARNING BY JIAJIE XIAO A Thesis Submitted to the Graduate Faculty of WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE Computer Science May, 2018 Winston-Salem, North Carolina Approved By: William Turkett, Jr., Ph.D., Advisor David John, Ph.D., Chair Grey Ballard, Ph.D. James Pease, Ph.D. Dedication This work is dedicated to my parents, Mingfu Xiao and Lizhu Xu, my sister, Wanling Xiao. Because of their hard work, I get to be exactly who I want to be. ii Acknowledgments First, I would like to show my gratitude towards the Center for Molecular Communication and Signaling for supporting me as a Research Fellow for the fall semester in 2017-2018 academic year. The extra time and focus has helped me expand my research project in directions I would not have been able to explore otherwise. I would also like to thank the Department of Computer Science at Wake Forest University. I appreciate all of the resources and support that the department has provided me with in order to complete my thesis. I would also like to thank the professors who have been very helpful and welcoming during coursework, seminars, and hallway passings. I want to thank Dr. John, Dr. Ballard, and Dr. Pease for being on my committee. I appreciate your input and the time you are taking to help me complete my thesis. I want to thank Dr. Poole for the interactions in the Prx subgroup classification project. I also want to acknowledge Dr. Colyer and Dr. Bonin for giving me permissions on the use of the thrombin aptamer sequencing data. In addition, I want to thank Dr. Salsbury for his support on my graduate study and career development in general. Most importantly, I would like to thank Dr. Turkett. Taking your class and doing research with you was the most memorable and pleasant experience during my time at Wake. Your welcoming, friendly, and encouraging attitude inspire me how to get along with people and face difficulties. Thank you for all your help and support. Lastly, I would like to thank my family. I would like to thank for their emotional support and encouraging discussion on the prospective of this project. Thank you for being there for me. iii Table of Contents Acknowledgments iii List of Figures viii List of Tables xii List of Abbreviations xiii Abstract xiv Chapter 1 Decoding biological sequences through bioinformatics and ma- chine learning1 1.1 Biological introduction............................1 1.2 Bioinformatics and machine learning introduction.............3 1.3 Contribution of this work to bioinformatics and machine learning....6 References......................................7 Chapter 2 Sequence-based classification and sequence motifs in protein and nucleic acids 13 2.1 Sequence-based classifications........................ 13 2.1.1 kmer SVM............................... 15 2.2 Motif discovery................................ 17 2.2.1 Types of Motifs............................ 18 2.2.2 Motif recognition and identification................. 18 iv 2.2.3 Motif evaluation........................... 19 References...................................... 21 Chapter 3 Kmer based classifiers extract functionally relevant features to support accurate Peroxiredoxin subgroup distinction 24 3.1 Introduction.................................. 25 3.2 Materials and Methods............................ 25 3.2.1 Data acquisition............................ 25 3.2.2 Model construction.......................... 26 3.3 Results..................................... 27 3.3.1 Classifier performance........................ 27 3.3.2 Distinguishing kmers......................... 27 3.4 Discussion................................... 28 3.4.1 Classification process comparison.................. 28 3.4.2 Classification performance comparison............... 30 3.4.3 Analysis of distinguishing kmers................... 33 3.4.4 Limitations in analysis........................ 36 3.5 Conclusions.................................. 37 References...................................... 39 Chapter 4 Finding gapped and ungapped motifs using alignment-free two-round machine learning 43 4.1 Introduction.................................. 44 4.2 Materials and Methods............................ 45 4.2.1 Approach............................... 45 4.2.2 Implementations........................... 48 4.2.3 Data acquisition............................ 48 4.3 Results..................................... 49 4.3.1 Case study 1: CCAXXXGGXG................... 49 4.3.2 Case study 2: CTCCC(8X)ATT................... 50 4.3.3 Case study 3: SXPWK[AK]XP................... 50 v 4.3.4 Case study 4: Binding motifs in aptamer selection experiments.. 52 4.4 Discussion................................... 54 4.4.1 Sequence classifier comparisons................... 54 4.4.2 Limitations.............................. 55 4.4.3 Robustness of the methods...................... 55 4.5 Conclusions.................................. 56 References...................................... 58 Chapter 5 Future directions 61 5.1 Algorithm limitations and improvements.................. 61 5.2 Potential biological applications....................... 62 5.3 Other potential applications......................... 63 References...................................... 64 Curriculum Vitae 65 vi List of Figures 2.1 An example of a linear SVM. Support vectors denoted by the dark color nodes define the class margins. Parameters defining these support vectors are tuned during data training to maximize the marginal distance between instances of the two classes.......................... 16 2.2 A linear SVM classifier. Once a linear SVM is constructed, a sequence with kmer encoded features x( 0) can result in predicted functional-positive ≥ or -negative class............................... 17 3.1 For all 63 proteins in the Harper data set where all 3merSVM classi- fiers returned a negative score, the maximum score for each protein is plotted. Plus shapes indicate the scores for the 60 proteins where the 3merSVM classification matched the Harper et al. classification, while triangle shapes indicate the 3 proteins where there was a mismatch be- tween the two approaches to classification.................. 34 3.2 A Weblogo alignment of the regions +/- 5 residues centered on the 3mer FWP extracted from the sequences that all four AhpE 3mers shown in Table 3.11 occur in............................... 36 3.3 A Weblogo alignment of the regions +/- 5 residues centered on the 3mer LPF extracted from the sequences that all three Tpx 3mers shown in Table 3.12 occur in.................................. 37 3.4 A Weblogo alignment of the regions +/- 5 residues centered on the 3mer CPA extracted from the Prx1 sequences that contain the CPA 3mer... 37 vii 4.1 An example of feature extraction guided by the first-round SVM. Different kmers and their normalized feature weights are stacked according to the original sequence. Accumulated weight aW can be computed for each character accordingly. As a result, motif candidates can be extracted from the positions with hight aW ....................... 47 4.2 A complex of thrombin and aptamer. The structure is visualized using VMD 1.9.215 according to PDB 4I6Y16. The thrombin is colored in green. The aptamer is colored in yellow. Bases corresponding to GTCACCC- CAAC are highlighted by the red Licorice representation.......... 53 4.3 Accuracy comparisons of different sequence classifiers. Accuracies in 10- fold cross validations of each classifier on sequences from case study 1 are displayed as box plots. kmer-SVMs are indicated by first 9 entries. The next three entries present gkm-SVMs using l=10 and k=4,5,6. The last entry indicates the second-round SVM using the features chosen via guidance from the first-round SVM learning................. 54 viii List of Tables 1.1 Common amino acids and their side chain properties and corresponding DNA codons. Polar and nonpolar can also suggest the solubility. A polar side chain indicates hydrophilic and a nonpolar one represent hydrophobic. Acid and basic properties can also be reflected by electrical signs of \-" and \+" respectively, while electrically neutral is denoted by \N".....2 1.2 Common hierarchical structures of proteins and nucleic acids......3 1.3 Types of omics-related studies........................5 3.1 Each column represents the number of examples for a Prx subgroup avail- able in the corresponding data set. The Harper-SFLD data set represents the intersection of the Harper data set with the subgroup-annotated Perox- iredoxins available in SFLD as of December 2017. The 0.95-Harper-SFLD data set represents the representative proteins after clustering each sub- group of the Harper-SFLD data set using the CD-Hit algorithm with a 95% similarity setting............................. 26 3.2 The confusion matrix represents results from testing on the Harper data set. For a given protein, the row represents the known annotation (as per Harper et al.) and the column represents the annotation suggested by the 3merSVM classifier. The counts represent how many proteins had each pair of annotations, with large values along the diagonal, representing matching annotations, being ideal....................... 27 ix 3.3 Columns represent distinguishing 3mers for the AhpE, Prx1, and Prx5 subgroups, the percentage of corresponding subgroup active site pseudo signatures from the Harper data that each 3mer occurs in, and the loca- tion relative