CLASSIFYING PEROXIREDOXIN SUBGROUPS AND IDENTIFYING

DISCRIMINATING MOTIFS VIA MACHINE LEARNING

BY

JIAJIE XIAO

A Thesis Submitted to the Graduate Faculty of

WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES

in Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

Computer Science

May, 2018

Winston-Salem, North Carolina

Approved By:

William Turkett, Jr., Ph.D., Advisor

David John, Ph.D., Chair

Grey Ballard, Ph.D.

James Pease, Ph.D. Dedication

This work is dedicated to my parents, Mingfu Xiao and Lizhu Xu, my sister, Wanling Xiao. Because of their hard work, I get to be exactly who I want to be.

ii Acknowledgments

First, I would like to show my gratitude towards the Center for Molecular Communication and Signaling for supporting me as a Research Fellow for the fall semester in 2017-2018 academic year. The extra time and focus has helped me expand my research project in directions I would not have been able to explore otherwise. I would also like to thank the Department of Computer Science at Wake Forest University. I appreciate all of the resources and support that the department has provided me with in order to complete my thesis. I would also like to thank the professors who have been very helpful and welcoming during coursework, seminars, and hallway passings. I want to thank Dr. John, Dr. Ballard, and Dr. Pease for being on my committee. I appreciate your input and the time you are taking to help me complete my thesis. I want to thank Dr. Poole for the interactions in the Prx subgroup classification project. I also want to acknowledge Dr. Colyer and Dr. Bonin for giving me permissions on the use of the thrombin aptamer sequencing data. In addition, I want to thank Dr. Salsbury for his support on my graduate study and career development in general. Most importantly, I would like to thank Dr. Turkett. Taking your class and doing research with you was the most memorable and pleasant experience during my time at Wake. Your welcoming, friendly, and encouraging attitude inspire me how to get along with people and face difficulties. Thank you for all your help and support. Lastly, I would like to thank my family. I would like to thank for their emotional support and encouraging discussion on the prospective of this project. Thank you for being there for me.

iii Table of Contents

Acknowledgments iii

List of Figures viii

List of Tables xii

List of Abbreviations xiii

Abstract xiv

Chapter 1 Decoding biological sequences through and ma- chine learning1 1.1 Biological introduction...... 1 1.2 Bioinformatics and machine learning introduction...... 3 1.3 Contribution of this work to bioinformatics and machine learning....6 References...... 7

Chapter 2 Sequence-based classification and sequence motifs in and nucleic acids 13 2.1 Sequence-based classifications...... 13 2.1.1 kmer SVM...... 15 2.2 Motif discovery...... 17 2.2.1 Types of Motifs...... 18 2.2.2 Motif recognition and identification...... 18

iv 2.2.3 Motif evaluation...... 19 References...... 21

Chapter 3 Kmer based classifiers extract functionally relevant features to support accurate Peroxiredoxin subgroup distinction 24 3.1 Introduction...... 25 3.2 Materials and Methods...... 25 3.2.1 Data acquisition...... 25 3.2.2 Model construction...... 26 3.3 Results...... 27 3.3.1 Classifier performance...... 27 3.3.2 Distinguishing kmers...... 27 3.4 Discussion...... 28 3.4.1 Classification process comparison...... 28 3.4.2 Classification performance comparison...... 30 3.4.3 Analysis of distinguishing kmers...... 33 3.4.4 Limitations in analysis...... 36 3.5 Conclusions...... 37 References...... 39

Chapter 4 Finding gapped and ungapped motifs using alignment-free two-round machine learning 43 4.1 Introduction...... 44 4.2 Materials and Methods...... 45 4.2.1 Approach...... 45 4.2.2 Implementations...... 48 4.2.3 Data acquisition...... 48 4.3 Results...... 49 4.3.1 Case study 1: CCAXXXGGXG...... 49 4.3.2 Case study 2: CTCCC(8X)ATT...... 50 4.3.3 Case study 3: SXPWK[AK]XP...... 50

v 4.3.4 Case study 4: Binding motifs in aptamer selection experiments.. 52 4.4 Discussion...... 54 4.4.1 Sequence classifier comparisons...... 54 4.4.2 Limitations...... 55 4.4.3 Robustness of the methods...... 55 4.5 Conclusions...... 56 References...... 58

Chapter 5 Future directions 61 5.1 Algorithm limitations and improvements...... 61 5.2 Potential biological applications...... 62 5.3 Other potential applications...... 63 References...... 64

Curriculum Vitae 65

vi List of Figures

2.1 An example of a linear SVM. Support vectors denoted by the dark color nodes define the class margins. Parameters defining these support vectors are tuned during data training to maximize the marginal distance between instances of the two classes...... 16 2.2 A linear SVM classifier. Once a linear SVM is constructed, a sequence with kmer encoded features x( 0) can result in predicted functional-positive ≥ or -negative class...... 17

3.1 For all 63 in the Harper data set where all 3merSVM classi- fiers returned a negative score, the maximum score for each protein is plotted. Plus shapes indicate the scores for the 60 proteins where the 3merSVM classification matched the Harper et al. classification, while triangle shapes indicate the 3 proteins where there was a mismatch be- tween the two approaches to classification...... 34 3.2 A Weblogo alignment of the regions +/- 5 residues centered on the 3mer FWP extracted from the sequences that all four AhpE 3mers shown in Table 3.11 occur in...... 36 3.3 A Weblogo alignment of the regions +/- 5 residues centered on the 3mer LPF extracted from the sequences that all three Tpx 3mers shown in Table 3.12 occur in...... 37 3.4 A Weblogo alignment of the regions +/- 5 residues centered on the 3mer CPA extracted from the Prx1 sequences that contain the CPA 3mer... 37

vii 4.1 An example of feature extraction guided by the first-round SVM. Different kmers and their normalized feature weights are stacked according to the original sequence. Accumulated weight aW can be computed for each character accordingly. As a result, motif candidates can be extracted from the positions with hight aW ...... 47 4.2 A complex of thrombin and aptamer. The structure is visualized using VMD 1.9.215 according to PDB 4I6Y16. The thrombin is colored in green. The aptamer is colored in yellow. Bases corresponding to GTCACCC- CAAC are highlighted by the red Licorice representation...... 53 4.3 Accuracy comparisons of different sequence classifiers. Accuracies in 10- fold cross validations of each classifier on sequences from case study 1 are displayed as box plots. kmer-SVMs are indicated by first 9 entries. The next three entries present gkm-SVMs using l=10 and k=4,5,6. The last entry indicates the second-round SVM using the features chosen via guidance from the first-round SVM learning...... 54

viii List of Tables

1.1 Common amino acids and their side chain properties and corresponding DNA codons. Polar and nonpolar can also suggest the solubility. A polar side chain indicates hydrophilic and a nonpolar one represent hydrophobic. Acid and basic properties can also be reflected by electrical signs of “-” and “+” respectively, while electrically neutral is denoted by “N”.....2 1.2 Common hierarchical structures of proteins and nucleic acids...... 3 1.3 Types of omics-related studies...... 5

3.1 Each column represents the number of examples for a Prx subgroup avail- able in the corresponding data set. The Harper-SFLD data set represents the intersection of the Harper data set with the subgroup-annotated Perox- iredoxins available in SFLD as of December 2017. The 0.95-Harper-SFLD data set represents the representative proteins after clustering each sub- group of the Harper-SFLD data set using the CD-Hit algorithm with a 95% similarity setting...... 26 3.2 The confusion matrix represents results from testing on the Harper data set. For a given protein, the row represents the known annotation (as per Harper et al.) and the column represents the annotation suggested by the 3merSVM classifier. The counts represent how many proteins had each pair of annotations, with large values along the diagonal, representing matching annotations, being ideal...... 27

ix 3.3 Columns represent distinguishing 3mers for the AhpE, Prx1, and Prx5 subgroups, the percentage of corresponding subgroup active site pseudo signatures from the Harper data that each 3mer occurs in, and the loca- tion relative to the active site for 3mers with low occurence in the active site profile. For the location data, Ext indicates a location that is an ex- tension of published active site fragments, while Out indicates a location distinct from published active site fragments. The rank ordering is based on weights extracted from the learned subgroup models...... 29 3.4 Columns represent distinguishing 3mers for the Prx6, PrxQ, and Tpx sub- groups, the percentage of corresponding subgroup active site pseudo sig- natures from the Harper data that each 3mer occurs in, and the location relative to the active site for 3mers with low occurence in the active site profile. For the location data, Ext indicates a location that is an ex- tension of published active site fragments, while Out indicates a location distinct from published active site fragments. The rank ordering is based on weights extracted from the learned subgroup models...... 30 3.5 Each row represents a method that can be employed to annotate Prx pro- teins to the subgroup level. Each column beyond the first represents a feature of a given method. Web represents whether or not a method is available via web interface. Batch indicates whether more than one pro- tein sequence can be processed at a time. Prx-specific indicates whether an annotation method is specific only to Prx subfamilies. Hierarchical indicates whether more generalized annotations can be returned...... 31 3.6 Each row represents the canonical active site motif for a Prx subgroup as defined by Harper et al. A lowercase ’x’ represents any residue at that position. Uppercase values in parentheses represent a set of values possible at a given position...... 31

x 3.7 Each row represents, for each Prx subgroup, percentages of the sequences in the Harper data set associated with that subgroup that match just the canonical active site motif for that subgroup (Exact Match %), that match one or more additional subgroup motifs beyond the canonical motif for that subgroup (Non-Unique Match %), and that don’t match the canonical motif for that subgroup (No Match %). Summed values across a row equal 100% barring small rounding issues...... 32 3.8 Each labeled entry represents a protein for which the 3merSVM approach returned a label different from that provided by Harper3. The first column represents the Genbank id for a protein. The second and third columns represent the Harper and 3merSVM annotations respectively. The fourth column shows which canonical motifs exist in the protein sequence, with multiple motifs possible shown one per row. The fifth column indicates the annotation provided by a search against SFLD using the SFLD HMM tool. The sixth column indicates the annotation provided by a search against the PREX database, taking the annotation of the highest scoring match. The seventh column indicates the conserved domain provided by an NCBI CDD search on the sequence. A - symbol indicates no match or value was returned by a given technique...... 33 3.9 Each row represents a protein for which the 3merSVM approach returned a label different from that provided in Harper et al. 3. The first column represents the Genbank id for a protein. The remaining columns provide the score returned from each subgroup specific model. The maximum score for each protein is in bold font...... 33 3.10 The matrix represents the distribution of subgroup annotations for pro- teins from the Harper data set which received all negative scores from the 3merSVM models. For a given protein, the row represents the known an- notation (as per Harper et al.) and the column represents the annotation suggested by the developed classifier. The counts represent how many proteins had each pair of annotations...... 35

xi 3.11 Each row represents a 3mer of interest or the set of all listed 3mers. Each column represents a data set of interest. The counts represent in how many proteins of the data a given 3mer or set of 3mers occurs. The title of each column indicates both the name of and, in parentheses, the total number of proteins in a given data set...... 35 3.12 Each row represents a 3mer of interest or the set of all listed 3mers. Each column represents a data set of interest. The counts represent in how many proteins of the data a given 3mer or set of 3mers occurs. The title of each column indicates both the name of and, in parentheses, the total number of proteins in a given data set...... 36

4.1 Evaluation of motif candidates in case study 1. Motif candidates that have top 5 greatest SVM weights in the second-round SVM learning are listed. 49 4.2 Evaluation of motif candidates in case study 2. Motif candidates that have top 5 greatest SVM weights in the second-round SVM learning are listed. 50 4.3 Evaluation of motif candidates in case study 3. Motif candidates that have top 10 greatest SVM weights in the second-round SVM learning are listed. 51 4.4 Evaluation of motif candidates in case study 4. Motif candidates with top 10 greatest enrichment factor are listed...... 52 4.5 Evaluation of motif candidates in case study 1 (shuffled negative sequences). Motif candidates that have top 5 greatest SVM weights in the second- round SVM learning are listed...... 56 4.6 Evaluation of motif candidates in case study 2 (shuffled negative sequences). Motif candidates that have top 5 greatest SVM weights in the second- round SVM learning are listed...... 56 4.7 Evaluation of motif candidates in case study 3 (shuffled negative sequences). Motif candidates that have top 10 greatest SVM weights in the second- round SVM learning are listed...... 57

xii List of Abbreviations

Abbreviation Meaning DNA Deoxyribonucleic Acid RNA ribonucleic Acid VMD Visual Molecular Dynamics WFU Wake Forest University MEME Multiple EM for Motif Elicitation EM Expectation Maximization PWM Position Weight Matrix SVM Support Vector Machine BLAST Basic Local Alignment Search Tool SFLD Structure Function Linkage Database PDB Protein Data Bank GWAS Genome-Wide Association Study TBA Thrombin Binding Aptamer TP True Positive FP False Positive TN True Negative FN False Negative PRX Peroxiredoxin PAM Point Accepted Matrices HTS High-Throughput Sequencing

xiii Abstract

Accurate and automated functional annotation is a pressing open problem, with func- tional characterizations lagging far behind the exponential growth in biological sequence databases. In this thesis, I present our recent development of machine learning meth- ods for high-throughput, accurate, sequence-based functional annotation. Chapter1 describes the biological and computational background of this study. Chapter2 defines the specific problem we try to solve. Chapter3 demonstrates that our 3mer-SVM, that accurately classifies Peroxiredoxin subgroups, can provide meaningful additional insight into the functional conserved sites in Peroxiredoxin protein. Moreover, in Chapter4, we propose a two-round learning algorithm that can capture gapped-kmer features in sequences and lead to more accurate classifications than the kmer-SVM approach. We illustrate this learning algorithm can be useful as a de novo motif finder for uncov- ering discriminating motifs among sequences associated with particular activities and functions. With a brief discussion on the advantage and limitations on our kmer-based sequence classification and de novo motif identification, in Chapter5, we propose several potential applications for future directions.

xiv Contents

xv Chapter 1

Decoding biological sequences through bioinformatics and machine learning

1.1 Biological introduction

Proteins and nucleic acids are important macromolecules in organisms. They carry essential functions to maintain characteristics of life, including metabolism, homeosta- sis, reproduction, etc1,2. Both of these molecules are biopolymers that are composed of fundamental monomers2. For proteins, there are twenty amino acids commonly being utilized in protein synthesis. For nucleic acids, with five standard nucleobases (i.e. Adenine, Guanine, Cytosine, Thymine, and Uracil) form nucleic acid polymers deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). These sub- units exhibit different biophysical properties (Table 1.1) and eventually result in bi- ological diversity. A succession of characters, which indicate the order of the amino acids or nucleotides in protein and nucleic acid chains, are referred to as the sequence of a protein or nucleic acid. As an essential property of these biopolymers, their se- quence is not just a unique signature indicating their biochemical compositions, but also encodes their biological features and functions2. One famous example of how protein sequence regulates protein’s biological be- haviors is protein folding. Anfinsen demonstrated that, given the same experimental conditions, a protein folds into a particular native structure that is only dependent on its amino acid sequence in 1973 3. Although many local minimums in the free energy surface – a statistical physics concept that describes the likelihood of biomolecular

1 Table 1.1: Common amino acids and their side chain properties and corresponding DNA codons. Polar and nonpolar can also suggest the solubility. A polar side chain indicates hydrophilic and a nonpolar one represent hydrophobic. Acid and basic properties can also be reflected by electrical signs of “-” and “+” respectively, while electrically neutral is denoted by “N”.

Amino acid 1/3-letter codes Properties DNA codons I/Ile Nonpolar, N ATT, ATC, ATA L/Leu Nonpolar, N CTT, CTC, CTA, CTG, TTA, TTG V/Val Nonpolar, N GTT, GTC, GTA, GTG F/Phe Nonpolar, N TTT, TTC M/Met Nonpolar, N ATG C/Cys Polar, N TGT, TGC A/Ala Nonpolar, N GCT, GCC, GCA, GCG G/Gly Nonpolar, N GGT, GGC, GGA, GGG P/Pro Nonpolar, N CCT, CCC, CCA, CCG T/Thr Polar, N ACT, ACC, ACA, ACG S/Ser Polar, N TCT, TCC, TCA, TCG, AGT, AGC Y/Tyr Nonpolar, N TAT, TAC W/Trp Nonpolar, N TGG Q/Gln Polar, N CAA, CAG N/Asn Polar, N AAT, AAC H/His Polar, + CAT, CAC E/Glu Polar, - GAA, GAG D/Asp Polar, - GAT, GAC K/Lys Polar, + AAA, AAG R/Arg Polar, + CGT, CGC, CGA, CGG, AGA, AGG Start codons NA NA ATG Stop codons NA NA TAA, TAG, TGA conformations – can exist, the specific interactions among amino acids lead to spe- cific hierarchical spatial organizations (Table 1.2) with global free energy minimum. Moreover, particular combinations of amino acids may further result in important biological functions. For instance, a catalytic triad, a group of three spatially adja- cent amino acids in an enzyme, is responsible for catalytically cleaving the linkage (i.e. peptide bond) between particular amino acids in the interacting substrates2. In addition, even a single on a protein sequence may sometimes signif- icantly perturb the thermodynamics and kinetic properties of a protein and cause fetal diseases4.

2 Table 1.2: Common hierarchical structures of proteins and nucleic acids

Level Protein Nucleic acids Primary Amino acid sequence Nucleic acid sequence Secondary Alpha helix, beta sheet Helix, stem loop, pseudoknot, quadruplex Tertiary Functional domain A/B/Z forms double helix Quaternary Higher-level organization of proteins or nucleic acids

Nucleic acid sequence has similar functional importance as the protein sequence. The central dogma of molecular biology5,6 illustrates that genetic information is en- coded in DNA sequences. Particular combinations of three nucleotides (see Table 1.1) in can encode an amino acid of the final transcription product protein. During this process called expression, some DNA nucleotides near the coding region ex- hibit specific functional regulation on the transcription levels. The sequences of these nucleotides ensure sequence-specific interactions with regulatory proteins such as tran- scription factors. Moreover, nowadays, oligonucleotides, short stranded sequences, are being actively developed as a potential new type of drugs due to their smaller size, easier synthetic accessibility and fewer concerns of unwanted immune re- sponses (called immunogenicity) to the therapeutic proteins7. These oligonucleotides are referred to as aptamers, which indicates these polynucleotide can selectively in- teract with a target protein8. Therefore, figuring out the association between biological sequences and their cor- responding functions is a critical task to understand the biological activities, causes of diseases, and development of therapeutics. It is a pressing open question how to identify the functional regulatory components in the protein and nucleic acid se- quences9,10,11,12.

1.2 Bioinformatics and machine learning introduc- tion

Bioinformatics arises as an interdisciplinary subject that aids in uncovering the con- nections between biological sequences and biological behaviors13. In retrospect, pio- neer works in bioinformatics were done after protein sequences became available. Li- nus Pauling and Emile Zuckerkandl noticed two hemoglobins with homologous amino acid sequences have similar functions in the early 1960s14, which illustrated that sequence similarity in proteins from different organisms can reflect evolutionary re- lationships. Their work developed a molecular evolution theory15,16, which offers a

3 theoretical basis of sequence homology-based protein function inference. Based on Pauling and Zuckerkandl’s molecular evolution theory, the second pioneer work was done by Margaret Oakley Dayhoff in the 1960s-1970s. Dayhoff first used comput- ers to compare protein sequences and derive evolutionary histories from sequence alignment17. Such computational application resulted in the first reconstruction of a phylogeny – evolutionary tree – from molecular sequences. She also defined matrices (PAM matrices)18 that are still widely used for protein homology alignment today. After these successful works from pioneer researchers, bioinformatics has rapidly progressed since the 1970s due to advances in computer hardware and algorithms. The application of computers made homology alignment more feasible. In 1970, Saul Needleman and Christian Wunsch published the first algorithm (Needleman-Wunsch algorithm)19. This algorithm optimally aligns two sequences in a global way using a dynamic programming approach. In 1981, Temple Smith and proposed a local alignment algorithm (called Smith-Waterman al- gorithm) to achieve insightful alignments for sequences with low similarity20. FASTA (fast-all, representing the FASTA suite of programs for both peptide and nucleotide sequence alignment)21,22 and Basic Local Alignment Search Tool (BLAST)23 algo- rithms have over time replaced the use of the Smith-Waterman algorithm because of their improvements in speed and memory requirements. Except for these pair- wise alignment approaches, multiple sequence alignment algorithms24,25 – progres- sive alignment methods using heuristic search similar to FASTA21,22 – started being utilized to compare similarity of multiple sequences and identify functional regions. Furthermore, the development of de novo – reference-free – assembly algorithms in late 1990s has also facilitated obtaining long sequences in the human genomes26. Due to the development of high-throughput sequencing (HTS) techniques, the number of biological sequences is exponentially growing. From 1982 to the present (March 2018), the number of bases in the GenBank database has doubled approxi- mately every 18 months. To date, such explosive growth results in over 200 million sequence records and 250 billion base pairs in GenBank. Experimental functional characterizations of these sequences now are lagging far behind the sequence incre- ments. Given a large amount of sequence data, bioinformatics plays a more and more profound role in inferring the biological properties and functions of a biopolymer. An example is homology modeling. It has been seen that different evolutionar- ily related proteins that belong to the same protein family typically share similar sequences, tertiary structures and biological functions. As a result, the unknown

4 structure of a protein can be modeled based on the known structures of its homolo- gous proteins – proteins with similar sequences. Such homologous modeling now has been widely employed in structure-based drug design27,28,29,30. Today, bioinformatics has been extended to study vast amounts of biological data generated in various biological activities. These activities cover biological processes at levels up to cells and organisms. Therefore, the research fields of bioinformatics ranges over many “Omics”, which currently include genomics, transcriptomics, lipidomics, proteomics, metabolomics, pharmacogenomics, physiomics, nutrigenomics, phyloge- nomics, etc 31. Table 1.3 summarizes the directions of these omics-related researches. In these research areas, different quantitative analysis methods have been employed and developed to study the connections within the system of living organisms.

Table 1.3: Types of omics-related studies

Types Research focuses Genomics Coding and noncoding regions in genes, regulatory elements. Transcriptomics RNA and gene expression. Lipidomics Large-scale pathways and networks of lipids. Proteomics Protein abundance and their structures and functions. Interactomics Interaction networks of biomolecules. Metabolomics Metabolites and metabolic networks. Pharmacogenomics Effects of genetics factors on host’s response to drugs. Physiomics Physiological dynamics and functions of whole organisms. Nutrigenomics Body’s response to diet and effects of food constituents on gene expression. Phylogenomics Evolutionary tree.

Machine learning is one of the computational methods that furthers the develop- ment of bioinformatics. As a powerful methodology in artificial intelligence, machine learning has been widely employed in different products in our daily life. Through a variety of statistical modeling strategies, machine learning executes inference and provides testable predictions. It has been utilized in a broad range of subjects, includ- ing computer vision32, natural language processing33, financial34, material science35 and molecular biology4,36,37. Cluster analysis was first reported to be employed in the microarray experiments of budding yeast Saccharomyces cerevisiae in 199838. As an unsupervised learning method in machine learning, clustering analysis in this context groups accord- ing to their expression levels and helps identify patterns of the kinetic process of gene expressions. This machine learning method later was used to construct predictable models for functional predictions39,40,41,42.

5 Supervised learning algorithms have also been applied to great success in bioin- formatics problems. Given the biological data sets with different known phenotypes, supervised learning results in predictable models through quantitatively evaluating features for each phenotype class. Support vector machine43, decision-tree44 based regression, and artificial neural networks45 have all provided profound insights into genome-wide association studies (GWAS)46,47,48,49 and sequence annotations50,51,52. For example, recently, deep neural network learning has started being commercially used to detect single nucleotide polymorphisms53 – variants with a point mutation that is genetically susceptible to some disease. Therefore, leveraging the computer hardware development and explosion in biolog- ical sequences, it is more and more necessary to adapt and develop machine learning techniques to uncover the functional sequence components in the genes and proteins and decode millions of bases in the genomes.

1.3 Contribution of this work to bioinformatics and machine learning

In this thesis, we focus on the questions of sequence classification and motif identi- fication, which are pressing open questions in bioinformatics and machine learning. Chapter 2 defines the problem being studied in this work. It also provides a brief introduction of standard concepts and methods involved in this work to help read- ers understand the following chapters. Chapter 3 demonstrates the usefulness of kmer-based support vector machine (SVM) in sequence classification. In the proof of principle study of Peroxiredoxin (Prx) subgroup classifications, we show that 3mer- SVM results in very accurate classification. In addition, without an assumption of the most representative region of the Prx being the active site, our constructed 3mer-SVM provides a mechanistic insight into distinct functions of Prx’s subgroups and corresponding families. In Chapter 4, we propose a 2-round learning algorithm to achieve a more accurate performance in sequence classification than kmer-SVM. Such novel learning algorithm identifies ungapped and gapped motif candidates and thus overcomes the potential over-fitting problem in kmer-SVM classification. We present several case studies using synthetic and real data to validate the proposed method. In the last chapter, we discuss limitations of this work and point out several potential future directions.

6 References

[1] Lin Chao. The meaning of life. BioScience, 50(3):245–250, 2000.

[2] K E Van Holde, W C Johnson, and P S Ho. Principles of Physical Biochem- istry. Principles of Physical . Pearson/Prentice Hall, 2006. ISBN 9780130464279.

[3] C B Anfinsen. Principles that govern the folding of protein chains. Science (New York, N.Y.), 181(96):223–230, 1973. ISSN 0036-8075.

[4] Jiajie Xiao, Ryan L. Melvin, and Freddie R. Salsbury Jr. Probing light chain mutation effects on thrombin via molecular dynamics simulations and machine learning. Journal of Biomolecular Structure and Dynamics, 0(0):1–18, 2018.

[5] Francis Crick. Central dogma of . Nature, 227(5258):561, 1970.

[6] Denis Thieffry and Sahotra Sarkar. Forty years under the central dogma. Trends in biochemical sciences, 23(8):312–316, 1998.

[7] Anthony D Keefe, Supriya Pai, and Andrew Ellington. Aptamers as therapeutics. Nature reviews. Drug discovery, 9(7):537–50, 7 2010.

[8] L C Bock, L C Griffin, J a Latham, E H Vermaas, and J J Toole. Selection of single-stranded DNA molecules that bind and inhibit human thrombin. Nature, 355(6360):564–566, 1992. ISSN 0028-0836.

[9] Jianjun Hu, Bin Li, and Daisuke Kihara. Limitations and potentials of current motif discovery algorithms. Nucleic acids research, 33(15):4899–4913, 2005.

[10] Modan K Das and Ho-Kwok Dai. A survey of motif finding algorithms. In BMC bioinformatics, volume 8, page S21. BioMed Central, 2007.

[11] Eran Eden, Doron Lipson, Sivan Yogev, and Zohar Yakhini. Discovering motifs in ranked lists of dna sequences. PLoS , 3(3):e39, 2007.

7 [12] Robert D Finn, Teresa K Attwood, Patricia C Babbitt, , Peer Bork, Alan J Bridge, Hsin-Yu Chang, Zsuzsanna Doszt´anyi, Sara El-Gebali, Matthew Fraser, et al. Interpro in 2017beyond protein family and domain anno- tations. Nucleic acids research, 45(D1):D190–D199, 2016.

[13] David B. Searls. The roots of bioinformatics. PLOS Computational Biology, 6 (6):1–7, 06 2010.

[14] Emile Zuckerkandl, Richard T Jones, and . A comparison of animal hemoglobins by tryptic peptide pattern analysis. Proceedings of the National Academy of Sciences, 46(10):1349–1360, 1960.

[15] Emile Zuckerkandl and Linus Pauling. Evolutionary divergence and convergence in proteins. In Evolving genes and proteins, pages 97–166. Elsevier, 1965.

[16] Emile Zuckerkandl and Linus Pauling. Molecules as documents of evolutionary history. Journal of theoretical biology, 8(2):357–366, 1965.

[17] Margaret Oakley Dayhoff and Robert S Ledley. Comprotein: a computer pro- gram to aid primary determination. In Proceedings of the De- cember 4-6, 1962, fall joint computer conference, pages 262–274. ACM, 1962.

[18] Margaret O Dayhoff. A model of evolutionary change in proteins. Atlas of protein sequence and structure, 5:89–99, 1972.

[19] Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970.

[20] Temple F Smith and Michael S Waterman. Comparison of biosequences. Ad- vances in applied mathematics, 2(4):482–489, 1981.

[21] David J Lipman and William R Pearson. Rapid and sensitive protein similarity searches. Science, 227(4693):1435–1441, 1985.

[22] William R Pearson and David J Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8):2444–2448, 1988.

8 [23] Stephen F Altschul, Thomas L Madden, Alejandto A Sch¨affer,Jinghui Zhang, Zheng Zhang, , and David J Lipman. Gapped BLAST and PSI- BLAST:a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, 1997. ISSN 13624962.

[24] Desmond G Higgins and Paul M Sharp. Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene, 73(1):237–244, 1988.

[25] Ramu Chenna, Hideaki Sugawara, Tadashi Koike, Rodrigo Lopez, Toby J Gib- son, Desmond G Higgins, and Julie D Thompson. Multiple sequence alignment with the clustal series of programs. Nucleic acids research, 31(13):3497–3500, 2003.

[26] Can Alkan, Saba Sajjadian, and Evan E Eichler. Limitations of next-generation genome sequence assembly. Nature methods, 8(1):61, 2011.

[27] Andreas Evers and Thomas Klabunde. Structure-based drug discovery using gpcr homology modeling: successful virtual screening for antagonists of the alpha1a adrenergic receptor. Journal of medicinal chemistry, 48(4):1088–1097, 2005.

[28] Claudio N Cavasotto and Sharangdhar S Phatak. Homology modeling in drug discovery: current trends and applications. Drug discovery today, 14(13-14): 676–683, 2009.

[29] Tobias Schmidt, Andreas Bergner, and Torsten Schwede. Modelling three- dimensional protein structures for applications in drug design. Drug discovery today, 19(7):890–897, 2014.

[30] Katherine Lansu, Joel Karpiak, Jing Liu, Xi-Ping Huang, John D McCorvy, Wesley K Kroeze, Tao Che, Hiroshi Nagase, Frank I Carroll, Jian Jin, et al. In silico design of novel probes for the atypical opioid receptor mrgprx2. Nature chemical biology, 13(5):529, 2017.

[31] Zheng Rong Yang. Introduction, chapter 1, pages 1–14. World scientific publish- ing, River Edge, NJ, USA, 2011.

[32] Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. In European conference on computer vision, pages 430–443. Springer, 2006.

9 [33] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.

[34] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.

[35] Tim Mueller, Aaron Gilad Kusne, and Rampi Ramprasad. Machine learning in materials science: Recent progress and emerging applications. Reviews in Computational Chemistry, 29:186–273, 2016.

[36] Adi L Tarca, Vincent J Carey, Xue-wen Chen, Roberto Romero, and Sorin Dr˘aghici.Machine learning and its applications to biology. PLoS computational biology, 3(6):e116, 2007.

[37] Robert Burbidge, Matthew Trotter, B Buxton, and Sl Holden. Drug design by machine learning: support vector machines for pharmaceutical data analysis. Computers & chemistry, 26(1):5–14, 2001.

[38] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868, 1998. ISSN 0027-8424.

[39] Yonatan Bilu and . The advantage of functional prediction based on clustering of yeast genes and its correlation with non-sequence based classifi- cations. Journal of Computational Biology, 9(2):193–210, 2002.

[40] Kire Trivodaliev, Aleksandra Bogojeska, and Ljupco Kocarev. Exploring func- tion prediction in protein interaction networks via clustering methods. PLOS ONE, 9(6):1–16, 06 2014.

[41] Lani F Wu, Timothy R Hughes, Armaity P Davierwala, Mark D Robinson, Roland Stoughton, and Steven J Altschuler. Large-scale prediction of saccha- romyces cerevisiae gene function using overlapping transcriptional clusters. Na- ture genetics, 31(3):255, 2002.

[42] Angela F Harper, Janelle B Leuthaeuser, Patricia C Babbitt, John H Morris, Thomas E Ferrin, Leslie B Poole, and Jacquelyn S Fetrow. An atlas of peroxire- doxins created using an active site profile-based approach to functionally relevant clustering of proteins. PLoS computational biology, 13(2):e1005284, 2017.

10 [43] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learn- ing, 20(3):273–297, Sep 1995. ISSN 1573-0565.

[44] Laurent Hyafil and Ronald L Rivest. Constructing optimal binary decision trees is np-complete. Information processing letters, 5(1):15–17, 1976.

[45] Robert J Schalkoff. Artificial neural networks, volume 1. McGraw-Hill New York, 1997.

[46] Malgorzata Maciukiewicz, Victoria S Marshe, Anne-Christin Hauschild, Jane A Foster, Susan Rotzinger, James L Kennedy, Sidney H Kennedy, Daniel J M¨uller, and Joseph Geraci. Gwas-based machine learning approach to predict duloxetine response in major depressive disorder. Journal of psychiatric research, 99:62–68, 2018.

[47] Silke Szymczak, Joanna M Biernacka, Heather J Cordell, Oscar Gonz´alez-Recio, Inke R K¨onig,Heping Zhang, and Yan V Sun. Machine learning in genome-wide association studies. Genetic epidemiology, 33(S1), 2009.

[48] Jung Hun Oh, Sarah Kerns, Harry Ostrer, Simon N Powell, Barry Rosenstein, and Joseph O Deasy. Computational methods using genome-wide association studies to predict radiotherapy complications and to identify correlative molec- ular processes. Scientific Reports, 7:43381, 2017.

[49] Brett A McKinney, David M Reif, Marylyn D Ritchie, and Jason H Moore. Machine learning for detecting gene-gene interactions. Applied bioinformatics, 5 (2):77–88, 2006.

[50] Maxwell W Libbrecht and William Stafford Noble. Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6):321, 2015.

[51] Christopher Fletez-Brant, Dongwon Lee, Andrew S. McCallion, and Michael A. Beer. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets. Nucleic acids research, 41(Web Server issue): 544–556, 2013. ISSN 13624962.

[52] Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, and Michael A Beer. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology, 10(7):e1003711, 2014.

11 [53] Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pe- gah T. Afshar, Sam S. Gross, Lizzie Dorfman, Cory Y. McLean, and Mark A. DePristo. Creating a universal snp and small indel variant caller with deep neural networks. bioRxiv, 2018.

12 Chapter 2

Sequence-based classification and sequence motifs in protein and nucleic acids

2.1 Sequence-based classifications

The development of high-throughput sequencing has lead to the exponential growth in the amount of biological sequences. However, the experimental characterizations of these sequences are still far lagging behind obtaining sequence information. As a biopolymer’s sequence is the essential signature that not only indicates the identity of the molecule but also encodes its properties, it is an important task to understand the association between sequences and biological phenotypes. Annotating the functions of a sequence is equivalent to predicting the phenotype of a biopolymer given its sequence. Given the foundation of molecular homology, similar sequences should share similar biological functions. Therefore, sequence annotation is essentially a classification problem – assigning class labels to new sequences according to the knowledge gained during data training. In other words, an uncharacterized sequence is classified to a group with similar sequences. As this group has been previously annotated, the functional properties of the uncharacterized sequence can be inferred accordingly. Based upon an assumption that the functional similarity is correlated with pro- tein sequence similarity, current functional inference methodologies rely primarily on protein sequences1. Identifying domains with similar sequences (i.e. conserved part in different sequences) or finding hits using BLAST2 among proteins with experi-

13 mentally determined function is generally used as a simple approach to functional inference. In addition to these traditional methods, given the growth in available sequence data in the past two decades, many more automated and more systematic annotation methods have been developed based on sequence features1. Most of these methods require aligned sequences as a prerequisite condition in order to standard- ize the sequence features for different proteins. This alignment requirement results in a potential source of error in the functional annotation since the alignments are typically automated based on overall sequence similarity. Given the biologically con- served residues along each sequence are expected to be lined up, particular residues should have higher priority to be aligned. This suggests a potential omission of a more careful focus on the key residue motifs in conventional multiple sequence alignment. In addition, multiple sequence alignment can be computationally expensive for large number of long sequences. As a result, other information including protein structure data3,4, genomic con- text and inferred evolutionary relationships5,6, and protein-protein interaction net- works7,8,9 are exploited in order to improve the function prediction. Previously, an at- las of the Peroxiredoxin (Prx) protein family has been created using active site profile- based clustering10. Moreover, the structure-function linkage database (SFLD)11, which is based on expert curation and hidden Markov models using active site fea- tures, also facilitates functional classification of sequences. However, these methods rely on necessary information beyond the sequences. The gain in such additional information is still lagging behind the rate of growth in sequence data. Moreover, the use of such knowledge may result in a bias. In the example of Prx family classifica- tion, both the SFLD and Prx atlas focus only on the active site of Prx, assuming the most discriminating regions of different subgroups of Prx are located near the active site. The identification of residue signatures of the Prx active site also heavily de- pends on limited solved structures of each Prx subgroup. These potentially improper assumptions and foundations suggest the utility of making functional inferences from sequence data alone. The bottleneck of pure sequence-based classification is to capture representative features. It is desirable to figure out how to identify the critical relevant sequence components responsible for the phenotype among a group of sequences without using multiple sequence alignments and extra non-sequence information. BLAST search- ing2 provides an alignment-free method to find heuristics to locate similar sequence regions. In the heuristic BLAST algorithm, before performing local alignment, sliding substrings of length k (referred to as kmers) will be searched during the seeding pro-

14 cess. The frequency of kmers in a sequence has recently been used as a signature that approximately suggests the identity of the sequence in de novo sequence assembly12, alignment2, and sequence classification13,14,15. In particular, based on the k-mer features, Michael Beer’s lab has developed very general sequence classification tools. After encoding sequences into k-mer frequencies, a support-vector machine (SVM) is constructed as a systematic classifier15. To over- come the computational issues with sparse features for large k value, Beer’s lab has proposed gapped kmer features for SVM training in 201316. Along with several other variants of k-mer based sequence classifications17, the performance of pure-sequence based classification can be generally improved. However, the choice of the parame- ter values, such as k and total length of string l, still retains as open questions. In addition, position information that is lost during k-mer encoding can affect some clas- sification when the relative positions of different kmers are critical. Therefore, more systematic and intelligent ways are needed in order to capture the representative features in the sequences.

2.1.1 kmer SVM

To facilitate reading of later chapters, a brief introduction of SVM along its applica- tion in sequence classification is described in this section. SVM is a widely used supervised machine learning method to solve classification problems18. By training on data with numerically encoded features and corresponding class labels, a SVM algorithm constructs a model that can assign class labels to new data. The training processes can be visualized as finding the dot, line, or hyperplane that optimally separates different classes of data points in the feature space. Figure 2.1 illustrates an example of how a SVM algorithm finds the best line that separates the two dimensional data points for class I and II. Basically, the SVM algorithm tunes the parameters of weight w and bias b for the hyperplane equation to maximize the total distance from it to the margins of each class. The data points that define the margins are named as support vectors. This SVM is also called a linear SVM as the hyperplane defined by the linear combinations of feature vectors is used. Once the optimal hyperplane equation is found, the class of the new data can be predicted based on the sign of its SVM score

y = w x + b. (2.1) · When its SVM score is greater than 0, a data point should be classified to the

15 x2 Support vectors

Class I Optimal hyperplane Class II

w • x + b = 1

w • x + b = 0

w • x + b = -1

x1

Maximum margin

Figure 2.1: An example of a linear SVM. Support vectors denoted by the dark color nodes define the class margins. Parameters defining these support vectors are tuned during data training to maximize the marginal distance between instances of the two classes.

16 w1•x1+ w2•x2+ w3•x3+ ... + b >0

(x>=0)

Yes No

+ -

Figure 2.2: A linear SVM classifier. Once a linear SVM is constructed, a sequence with kmer encoded features x( 0) can result in predicted functional-positive or -negative class. ≥

positive set (i.e. the class II in this example). When its SVM score is smaller than 0, then it should be classified to the negative set (i.e. the class I). The larger magnitude of a SVM score indicates more certain classification; a SVM score of 0 indicates it is on the optimal hyperplane and hard to tell which class the data point should belong to. When the data are not linearly separable, a kernel trick can be employed to map the data features into a linearly separable space using non-linear mathemat- ical functions18. In the sequence classification problem, linear SVMs using kmer- representation of sequences have been proposed in literature15. Although different sequences may not necessarily always be linearly separable, the linear SVM provides insightful estimation of each kmer. Given the feature value of each kmer is non- negative, the sign of each feature weight w demonstrates the class preference of the corresponding kmer. A positive w indicates the corresponding kmer is selected by the positive set, while a negative w suggests the kmer is more representative in the negative set (Figure 2.2). Meanwhile, the magnitude of w provides an estimation of relative importance of the corresponding kmer for the class it tends to belong to.

2.2 Motif discovery

Sequence motifs are recurring short sequence patterns that presumably associate with particular properties and functions. The occurrence of motifs is arguably the most

17 typical representative feature among sequences in a biologically motivated group. Therefore, identifying motifs is critical for accurate sequence classification. Motif discovery eventually helps understand the actual sequence components that relate to the phenotype of interest.

2.2.1 Types of Motifs

There are several ways to define and represent sequence motifs, including using a (1) regular expression19, (2) position weight matrix (PWM)20 and (3) hidden Markov model21. In this work, we choose the regular expression way for the sake of easy understanding. In particular, we focus on two types of motifs defined below. Motif type 1: Non-gapped k-mer motif. This type of motif is composed of k characters from the alphabet set, which can be the letter code set for nucleotides ( A, T, C, G, U ) and amino acids (see Table 1.1). To be not trivial, the value of { } k should be greater than one. Two examples of this motif type can be ATAAT and WGKE[EK]VSD, where the amino acids E and K in the square bracket represent either of these amino acids can be selected to form the motif. As such, alternative characters are permitted for any position for this motif type, although alternative characters should be nontrivial – not every possible characters. Motif type 2: Gapped k-mer motif. In contrast with the type 1 motif, the nontrivial k positions in the motif are separated by gaps, which denote arbitrary characters. Examples of this motif type can be AXAAT and WXKEXVSD, where X denotes the gaps of arbitrary characters and it should not present in the head or the tail position of the motif. A gapped k-mer motif with a total length of l can be denoted as the (k, l)-mer. These two types of motifs defined here can be acted by logical operators AND and OR to form a more comprehensive representative feature. Logical AND operations preserve a relative position information among different short motifs. For example, two motifs that are separated by several gaps are called as a dyad motif in the literature22. If the number of gaps within the dyad motif is too large, we may also use M1(nX)M2 to denote there are n gaps separating motifs M1 and M2.

2.2.2 Motif recognition and identification

Given a known motif, a string searching process can tell if such motif pattern occurs in the query sequence. This process is called motif recognition, which facilitates sequence classification and functional predictions as discussed previously. Regular

18 expression searching algorithms can applied in the motif recognition. However, it can be challenging to identify a motif among a large set of sequences. To date, most de novo motif discovery approach is based on the position weight ma- trix (PWM)20. The motifs can be identified based on the positions and characters with high frequencies. However, when computing a PWM, it is common that the input sequences are not initially aligned. As a result, multiple sequence alignment is generally required. As the multiple sequence alignment can be computationally ex- pensive and hard to avoid misalignment, the PWM approach may not always provide accurate motif information among long non-conserved sequences. Fitting mixed models through expectation maximum technique is another widely used but alignment-free method to achieve the PWM for the motif(s). The most popular implementation of this approach is the tool named Multiple EM for Motif Elicitation (MEME)23. Initializing based on a most discriminating ungapped kmer, the PWM of the motif model can be fitted via the expectation maximum technique. To overcome the poor initialization of the PWM from fixed kmer, the best width of the PWM can be optimally found by statistical modeling techniques. However, in practice, this method requires an estimation of the number of motifs as a user specified input and it may fail if the input sequences contain too many noisy sequences. Recently, several studies based on kmer-SVM15 and gkm-SVM24,25 have shown a new way to obtain the PWM model for the motif(s). The significant kmers found by kmer-SVM and gkm-SVM can be used to compute the PWM. However, this approach can not return a meaningful PWM when multiple distinct motif(s) exist. For example, if the length of the kmer features is shorter than the total length of a dyad motif – two motifs separated by many gaps – each of the representative substrings in the head and tail of the dyad motif will be stacked together during the calculation of PWM. This suggests the need for relative position information in the motif discovery. Due to the enormous number of motif discovery algorithms available, this work will only focus on comparisons against other kmer-SVM approaches.

2.2.3 Motif evaluation

The occurrence of a motif should be a representative feature that discriminates se- quences in the biologically motivated group (referred to as “foreground” or “positive”) from sequences in a random or background set (referred to as “background” or “neg- ative”). If the motif is identified in a sequence in the foreground sequence set, such instance is a true positive (TP); otherwise, it is a false negative (FN). If the

19 motif is found in a sequence in the background sequence set, it is a false positive (FP); otherwise, it is called a true negative (TN). The number of occurrences of these instances (denoted as TP,TN,FP,FN) has also been used to evaluate the performance of predictive models in machine learning. In addition to the confusion matrix (i.e. [T P/P, F P/P ; F N/N, T N/N], where P and N are respectively the num- bers of positive and negative instances in the data set), several statistical measures as formulated below can provide a straightforward estimation of how discriminating the motif is for the foreground sequence set in comparison with the background one.

Sensitivity. It is also called recall and it measures the true positive rate (TPR) • according to TPR = T P/P = TP/(TP + FN).

Precision. It is also called positive predictive value (PPV ) and it is calculated • as the probability to have true positive in all predicted positive values PPV = TP/(TP + FP ).

Specificity. It measures the true negative rate (TNR) according to TNR = • T N/N = TN/(TN + FP ).

False discovery rate. It measures the probability to have false positive in all • predicted positive values according to FP/(TP + FP ).

In machine learning, in order to quantify the overall probability to have accurate predictions, accuracy (ACC) is computed according to (TP + TN)/(TP + TN + FP + FN). However, when the foreground and background data sets have very different numbers of sequences, the accuracy is primarily affected by the prediction result for the data set with more sequences. For example, if the foreground sequence set has more sequences than the background, the ACC TP/(TP + FP ). In order ≈ to take advantage of all available data to estimate the accuracy of the prediction, F1-score (F 1) is usually used rather the the accuracy. This metric is the harmonic mean of precision and sensitivity – F 1 = 2 PPV TPR/(PPV + TPR). · · In addition, to evaluate the statistical significance of a motif, P -value is also widely used in a variety of bioinformatics studies. Although there are multiple ways to estimate the P -value26, it essentially indicates the probability to have the motif pattern occur in a randomly generated sequence of the same length. The smaller P -value is for a motif, the less likely it can occur by chance. In other words, a motif is statistically significant if its P -value is small.

20 References

[1] Predrag Radivojac, Wyatt T Clark, Tal Ronnen Oron, Alexandra M Schnoes, Tobias Wittkop, Artem Sokolov, Kiley Graim, Christopher Funk, Karin Ver- spoor, Asa Ben-Hur, et al. A large-scale evaluation of computational protein function prediction. Nature methods, 10(3):221, 2013.

[2] Stephen F Altschul, Thomas L Madden, Alejandto A Sch¨affer,Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped BLAST and PSI- BLAST:a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, 1997. ISSN 13624962.

[3] F Pazos and M J Sternberg. Automated prediction of protein function and detection of functional sites from structure. Proceedings of the National Academy of Sciences of the United States of America, 101(41):14754–14759, 2004.

[4] Janelle B. Leuthaeuser, Stacy T. Knutson, Kiran Kumar, Patricia C. Babbitt, and Jacquelyn S. Fetrow. Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity. Protein Science, 24(9):1423–1439, 2015. ISSN 1469896X. doi: 10.1002/pro.2724.

[5] M Pellegrini, E M Marcotte, M J Thompson, D Eisenberg, and T O Yeates. As- signing protein functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences of the United States of America, 96(8):4285–8, 1999. ISSN 0027-8424.

[6] Barbara E. Ersgelhardt, Michael I. Jordan, Kathryrt E. Muratore, and Stevers E. Brersfser. Protein Molecular Function Prediction by Bayesian Phylogenomics. PLoS Computational Biology, 1(5):0432–0445, 2005. ISSN 1553734X.

[7] Minghua Deng, Kui Zhang, Shipra Mehta, Ting Chen, and Fengzhu Sun. Pre-

21 diction of Protein Function Using ProteinProtein Interaction Data. Journal of Computational Biology, 10(6):947–960, 12 2003. ISSN 1066-5277.

[8] Stanley Letovsky and Simon Kasif. Predicting protein function from protein/pro- tein interaction data: A probabilistic approach. Bioinformatics, 19(SUPPL. 1): 197–204, 2003. ISSN 13674803.

[9] Alexei Vazquez, Alessandro Flammini, Amos Maritan, and Alessandro Vespig- nani. Global protein function prediction from protein-protein interaction net- works. Nature Biotechnology, 21(6):697–700, 6 2003. ISSN 10870156.

[10] Angela F Harper, Janelle B Leuthaeuser, Patricia C Babbitt, John H Morris, Thomas E Ferrin, Leslie B Poole, and Jacquelyn S Fetrow. An atlas of peroxire- doxins created using an active site profile-based approach to functionally relevant clustering of proteins. PLoS computational biology, 13(2):e1005284, 2017.

[11] Eyal Akiva, Shoshana Brown, Daniel E Almonacid, Alan E Barber 2nd, Ashley F Custer, Michael A Hicks, Conrad C Huang, Florian Lauck, Susan T Mashiyama, Elaine C Meng, et al. The structure–function linkage database. Nucleic acids research, 42(D1):D521–D530, 2013.

[12] Ruiqiang Li, Hongmei Zhu, Jue Ruan, Wubin Qian, Xiaodong Fang, Zhong- bin Shi, Yingrui Li, Shengting Li, Gao Shan, Karsten Kristiansen, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome research, 20(2):265–272, 2010.

[13] Daniel E Newburger and Martha L Bulyk. Uniprobe: an online database of pro- tein binding microarray data on protein–dna interactions. Nucleic acids research, 37(suppl 1):D77–D82, 2008.

[14] Daniel Navarro-Gomez, Jeremy Leipzig, Lishuang Shen, Marie Lott, Alphons PM Stassen, Douglas C Wallace, Janey L Wiggs, Marni J Falk, Mannis Van Oven, and Xiaowu Gai. Phy-mer: a novel alignment-free and reference-independent mitochondrial haplogroup classifier. Bioinformatics, 31(8):1310–1312, 2014.

[15] Christopher Fletez-Brant, Dongwon Lee, Andrew S. McCallion, and Michael A. Beer. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets. Nucleic acids research, 41(Web Server issue): 544–556, 2013. ISSN 13624962.

22 [16] Mahmoud Ghandi, Morteza Mohammad-Noori, and Michael A Beer. Robust k k-mer frequency estimation using gapped k k-mers. Journal of mathematical biology, 69(2):469–500, 2014.

[17] Lynne Davis, John Hawkins, Stefan Maetschke, and Mikael Bod´en.Comparing svm sequence kernels: A protein subcellular localization theme. In Proceedings of the 2006 workshop on Intelligent systems for bioinformatics-Volume 73, pages 39–47. Australian Computer Society, Inc., 2006.

[18] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learn- ing, 20(3):273–297, Sep 1995. ISSN 1573-0565.

[19] Cheng Zhou, Boris Cule, and Bart Goethals. Pattern based sequence classifica- tion. IEEE Transactions on Knowledge and Data Engineering, 28(5):1285–1298, 2016.

[20] Rodger Staden. Methods for calculating the probabilities of finding patterns in sequences. Bioinformatics, 5(2):89–96, 1989.

[21] Jing Wu and Jun Xie. Hidden markov model and its applications in motif find- ings. In Statistical Methods in Molecular Biology, pages 405–416. Springer, 2010.

[22] Abha Singh Bais, Naftali Kaminski, and Panayiotis V Benos. Finding subtypes of transcription factor motif pairs with distinct regulatory roles. Nucleic acids research, 39(11):e76–e76, 2011.

[23] Timothy L Bailey, Charles Elkan, et al. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. 1994.

[24] Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, and Michael A Beer. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology, 10(7):e1003711, 2014.

[25] Mahmoud Ghandi, Morteza Mohammad-Noori, Narges Ghareghani, Dongwon Lee, Levi Garraway, and Michael A Beer. gkmsvm: an r package for gapped- kmer svm. Bioinformatics, 32(14):2205–2207, 2016.

[26] Jing Zhang, Bo Jiang, Ming Li, John Tromp, Xuegong Zhang, and Michael Q Zhang. Computing exact p-values for dna motifs. Bioinformatics, 23(5):531–537, 2007.

23 Chapter 3

Kmer based classifiers extract functionally relevant features to support accurate Peroxiredoxin subgroup distinction

Publishing and Author Information

This chapter will be submitted, to the Journal of Protein Science. The primary author is Jiajie Xiao with corresponding author William H. Turkett. This chapter contains stylistic variations (e.g., number of columns and citation style) from the published manuscript.

Abstract

The Peroxiredoxins (Prx) are a family of proteins that play a major role in anti- oxidant defense and peroxide-regulated signaling. Six distinct Prx subfamilies have been proposed based on analysis of the structure and sequence of the Prx active site. In this work, a sequence-based classifier, developed using support vector ma- chines and a kmer sequence representation, is presented which attains 100% 10-fold cross-validation accuracy for classifying Prx proteins by subgroup. Analysis of the classifier’s automatically derived models demonstrate that the classification decision is based on a combination of conserved features, including a significant number of residue regions outside the active site.

24 3.1 Introduction

The Peroxiredoxins (Prx) represent a family of enzymes found in a wide range of organisms that act as anti-oxidant defenses and play a role in managing signaling mediated by peroxide. A highly-conserved cysteine plays a primary role in imparting peroxide sensing functionality. Previous work1,2 has provided evidence for six distinct subgroups of the Prx family - AhpE, AhpC-Prx1, Prx5, Prx6, BCP-PrxQ, and Tpx - and over 38,000 proteins have been annotated to the level of a Prx subgroup3. To discover proteins belonging to Prx subgroups, the state-of-the-art approaches extract sequence fragments containing residues within a 10 angstrom region around the active site of Prxs for which structures are known. Subgroup specific active site sequence profiles can then be aligned against sequences in biological databases, with high scoring matches indicating likely membership of a protein into a subgroup. MISST3, which implements a version of this search process that can iteratively expand and split clusters representing subgroups, resulted in the most recent 38,739 Prx subgroup-specific annotations. These annotations approaches have, to date, limited sequence analysis to the ac- tive site region. The idea of a kmer representation, a list of counts of each length k sliding fragment along a sequence, has been widely adopted for fast approximations of sequence identity4,5,6,7, including applications to protein classification8. Subgroup- distinguishing kmers represent small sequence fragments that are conserved between proteins within a subgroup but are distinct across subgroups. It is shown in this manuscript that training a machine learning classifier on 3mer-encoded subgroup- annotated proteins can allow for accurate Prx subgroup annotation. The distinguish- ing kmers represent both previously known active site sequence fragments as well as additional sequence regions that are functionally relevant.

3.2 Materials and Methods

3.2.1 Data acquisition

The Structure-Function Linkage Database (SFLD)9 provides a highly-curated protein database, organizing proteins by shared chemical function and providing a mapping between a given chemical function and associated active site features as represented in available protein sequences and structures. The SFLD annotates 7,345 proteins to the level of one of six Peroxiredoxin subfamilies and annotates 12,239 (including the 7,345 annotated to the subgroup level) proteins as members of the Peroxiredoxin

25 Superfamily. The recent work of Harper et al.3, using an iterative approach to search Genbank for proteins that have active sites similar in sequence to those of known Prx structures, leads to a hypothesis of a total of 38,739 Prx proteins, of which 6,909 overlap with proteins annotated in the SFLD. The full data set of 38,739 proteins will be referred to as the Harper dataset, and the overlap set will be hereafter referred to as the Harper-SFLD dataset. The proteins in each subgroup of the Harper-SFLD data set were clustered at 95% sequence similarity using the CD-Hit algorithm10,11 to remove instances of proteins with high sequence similarity. The resulting set of 4,751 proteins will be referred to as the 0.95-Harper-SFLD data set. The distributions over subgroups of the proteins in these data sets are shown in Table 3.1. There is imbalance among the subgroups, with up to an order of magnitude difference in the number of examples between subgroups. AhpE Prx1 Prx5 Prx6 PrxQ Tpx Total Harper 1489 9660 5434 5212 12014 4930 38739 Harper-SFLD 152 2130 1039 942 1786 860 6909 0.95-Harper-SFLD 138 1310 725 702 1330 546 4751

Table 3.1: Each column represents the number of examples for a Prx subgroup available in the corresponding data set. The Harper-SFLD data set represents the intersection of the Harper data set with the subgroup-annotated Peroxiredoxins avail- able in SFLD as of December 2017. The 0.95-Harper-SFLD data set represents the representative proteins after clustering each subgroup of the Harper-SFLD data set using the CD-Hit algorithm with a 95% similarity setting.

3.2.2 Model construction

3mers were used to encode protein sequences. With 20 amino acid residue options at each position of the 3mer, this leads to 8,000 potential 3mer features. Six one-versus- all classifiers were constructed, one per subgroup. All classifiers were built using linear-kernel support vector machines (SVM). Rather than other supervised learning methods, support vector machines (SVM) with linear kernels were chosen due to their effectiveness and efficiency in problems with high-dimensional features12,13,14. The SVM technique optimally identifies the maximum-margin hyperplane that separates the positive and negative classes in feature space15. Given the fact that the features for the developed classifier SVM are kmer occurrences, the linear-kernel utilized is also referred to as a spectrum kernel in the literature for binary classifications on biological sequences7,16.

26 The SVM-Light toolkit17 was used for training and classification, with default values, automatically chosen by the SVM-Light implementation, used for training parameters. To classify a given protein sequence, the subgroup annotation associated with the maximum of the scores returned from the six classifiers was used. An implementation of the classifier tool is publicly available at the address http://prxsubfamilyclassif-env.us-east-1.elasticbeanstalk.com/

3.3 Results

3.3.1 Classifier performance

10-fold cross validation was performed on the 0.95-Harper-SFLD data set, with 100% accuracy obtained in the cross validation experiment. To allow for a comparison with the work of Harper et al., a classifier built on all sequences from the 0.95-Harper-SFLD data set was then employed to classify the samples in the 38,739 protein Harper data set. Note that this large data set contains the 4,751 examples used for training. None of these 4,751 were classified incorrectly and these counts have been removed from the rest of the presented results. The confusion matrix comparing annotations generated by the Harper et al. technique to those generated by the 3merSVM approach is shown in Table 3.2.

Harper/3merSVM AhpE Prx1 Prx5 Prx6 PrxQ Tpx AhpE 1348 0 1 0 2 0 Prx1 0 8350 0 0 0 0 Prx5 0 0 4709 0 0 0 Prx6 0 0 0 4509 1 0 PrxQ 0 0 0 0 10684 0 Tpx 0 0 0 0 0 4384

Table 3.2: The confusion matrix represents results from testing on the Harper data set. For a given protein, the row represents the known annotation (as per Harper et al.) and the column represents the annotation suggested by the 3merSVM classifier. The counts represent how many proteins had each pair of annotations, with large values along the diagonal, representing matching annotations, being ideal.

3.3.2 Distinguishing kmers

Using the subgroup models constructed from the complete 0.95-Harper-SFLD data set, an exemplar set of fifteen distinguishing 3mers (shown in Tables 3.3 and 3.4)

27 for each subgroup were extracted. These 3mers were selected based on the ordered weights of the features from the linear kernel SVMs trained for each subgroup. Com- plete lists of 3mers ordered by weight for each subgroup are included in the supple- mentary material. These 3mers were searched for within the active site (pseudo-)signatures for the Harper data set proteins provided in the Supporting Information S2 file of Harper et al. 3. These signatures represent, for proteins identified to be members of each Prx subgroup, the sequence regions that best align with active site signatures for representatives of the subgroup. Repeated signatures for a subgroup were removed before searching for 3mers. The search checked for whether a complete 3mer was found as part of the signature sequence. The percentage of signatures for a subgroup in which each 3mer is fully found is included next to 3mers in the table. A significant proportion of active site residues are represented by the distinguishing kmers. This is particularly true for the Prx1 and Prx6 subgroups, where 8 of the 15 presented 3mers respectively occur at rates greater than 80% in active site regions. These findings are reasonable given the high sequence conservation around the peroxidatic cysteine for these two subgroups as shown by Harper et al3. However, it is also the case that a number of distinguishing kmers are not contained within the active site regions, either resolving to extensions of or being located outside of the published active site regions. For 3mers with low occurrence (less than 5%) in the pseudo-signature sequences, the location of the 3mers is annotated in Tables 3 and 4 as Ext if evidence suggests the 3mer is an extension of the active site sequence fragments published by Harper et al. or as Out if the evidence suggests the 3mer is located outside the active site sequence fragments. The determination of Ext or Out was made by extracting small regions of residues (5-8 residues in both directions) around the 3mers of interest from the sequences containing the 3mer, aligning the regions with ClustalOmega18, generating a Weblogo19, and visually inspecting the Weblogo.

3.4 Discussion

3.4.1 Classification process comparison

With respect to the process of searching, the developed classifier has several advan- tages compared to other methods to classify Prx proteins at the subgroup level. Table 3.5 indicates the features of five different methods that can be used to classify Prx proteins to the subgroup level. These methods include HMM search against the SFLD

28 Rank AhpE AS% Loc Prx1 AS% Loc Prx5 AS% Loc 1 FFP 44.51 VCP 96.29 PGA 93.20 2 ELC 50.46 CPT 84.70 VND 0.00 Ext 3 LAF 31.08 FVC 94.55 GAF 85.32 4 WPH 0.00 Ext PTE 84.91 VPG 58.54 5 PHG 0.00 Ext FTF 89.51 LPG 32.65 6 SDF 2.46 Ext TFV 89.17 AFT 83.39 7 DFW 0.00 Ext VGR 3.24 Out KGV 0.14 Ext 8 FWP 0.00 Ext FYP 82.98 NDP 0.00 Ext 9 VCT 35.28 DFT 89.46 FVM 0.00 Ext 10 FPL 51.49 CPA 0.07 Out HLP 0.11 Ext 11 AFT 56.10 STD 44.67 DPF 0.04 Ext 12 CTF 0.00 Ext GRN 0.00 Out RYA 0.11 Ext 13 TFR 0.00 Ext TEI 0.00 Ext HVP 0.25 Ext 14 FTG 40.31 ALR 37.34 FTP 79.77 15 WVS 0.00 Out PLD 37.74 GVD 0.04 Ext

Table 3.3: Columns represent distinguishing 3mers for the AhpE, Prx1, and Prx5 subgroups, the percentage of corresponding subgroup active site pseudo signatures from the Harper data that each 3mer occurs in, and the location relative to the active site for 3mers with low occurence in the active site profile. For the location data, Ext indicates a location that is an extension of published active site fragments, while Out indicates a location distinct from published active site fragments. The rank ordering is based on weights extracted from the learned subgroup models.

database9, use of the MISST algorithm3 which builds on DASPs20, search against the PREX database2, search against the NCBI Conserved Domains database (CDD)21, and the method described in this work named 3merSVM. All methods except MISST/DASP2 have a web interface through which sequences can be uploaded to be analyzed. SFLD and PREX allow only one sequence to be analyzed at a time, reducing their utility for high throughput analyses, while MISST/- DASP, CDD, and 3merSVM all support batch processing. The sequence databases, models, and annotation techniques behind SFLD and CDD support the ability to provide annotations outside of the six Prx subfamilies, including generalized annota- tions such as a Peroxiredoxin or Thioredoxin-fold annotation. The PREX database only provide annotations to one of the Prx subfamilies or indicates no annotation is appropriate.

29 Rank Prx6 AS% Loc PrxQ AS% Loc Tpx AS% Loc 1 SHP 98.28 GCT 90.93 PFA 0.05 Ext 2 FSH 97.32 YFY 78.35 DLP 0.00 Ext 3 FTP 82.89 FYP 95.39 LPF 0.00 Ext 4 TPV 96.94 PGC 76.04 RFC 67.07 5 VCT 95.75 YPK 66.51 FAQ 0.00 Ext 6 PVC 96.88 TPG 75.17 VPS 41.76 7 TTE 92.63 LYF 0.00 Ext LDT 36.24 8 LFS 0.00 Ext FRD 3.05 Ext PNY 0.00 Out 9 CTT 92.48 GVS 49.78 IDT 37.93 10 HPA 53.01 GIS 45.05 PSI 37.97 11 PAD 52.67 EAC 32.32 PDY 0.00 Out 12 DFT 78.18 PKA 15.22 DTP 31.57 13 GRN 0.00 Out VLY 0.00 Ext LAR 35.68 14 GLS 0.25 Ext YPR 11.22 PSL 35.92 15 PII 49.39 QDG 0.15 Out SLD 36.20

Table 3.4: Columns represent distinguishing 3mers for the Prx6, PrxQ, and Tpx subgroups, the percentage of corresponding subgroup active site pseudo signatures from the Harper data that each 3mer occurs in, and the location relative to the active site for 3mers with low occurence in the active site profile. For the location data, Ext indicates a location that is an extension of published active site fragments, while Out indicates a location distinct from published active site fragments. The rank ordering is based on weights extracted from the learned subgroup models.

3.4.2 Classification performance comparison

As a baseline for analysis of classification performance, six sub-group specific canon- ical sequence motifs were searched for in the proteins of the Harper data set. These six canonical active site motifs, one for each subgroup, are defined by Harper et al. as extensions of the general Prx active site motif (Pxxx(T/S)xxC) and are shown in Table 3.6. The data in Table 3.7 represent for the Harper data set the percentage of proteins in each subgroup containing the representative subgroup motif. These percentages align with the percentages reported in Harper et al. 3, but are further broken down to show the percentage of sequences in each subgroup for which there are matches to more than one subgroup canonical motif. While search for these canonical motifs alone would allow for accurate classification into some subgroups, there would be significant issues with respect to the AhpE and Prx5 subgroups, as instances of the Prx5 motif are a more concrete instance of the AhpE motif. The full classification process employed by Harper et al.3 does not just use these canonical motifs, but makes use of alignment against active site profiles (ASP)22,20,

30 Method Web Batch Prx-specific Hierarchical SFLD Yes No No Yes MISST/DASP No Yes Yes No PREX Yes No Yes No CDD Yes Yes No Yes 3merSVM Yes Yes Yes No

Table 3.5: Each row represents a method that can be employed to annotate Prx proteins to the subgroup level. Each column beyond the first represents a feature of a given method. Web represents whether or not a method is available via web interface. Batch indicates whether more than one protein sequence can be processed at a time. Prx-specific indicates whether an annotation method is specific only to Prx subfamilies. Hierarchical indicates whether more generalized annotations can be returned.

Subgroup Motif AhpE PxAF(T/S)xxC Prx1 PxDF(T/S)FVC Prx5 P(G/A)A(F/Y)(T/S)(P/G)xC Prx6 Px(D/N)(F/Y)TPVC PrxQ P(K/A/R)(D/A)xTxGC Tpx PS(I/L/V)DTx(V/T/I)C

Table 3.6: Each row represents the canonical active site motif for a Prx subgroup as defined by Harper et al. A lowercase ’x’ represents any amino acid residue at that position. Uppercase values in parentheses represent a set of values possible at a given position. where an active site profile consists of multiple sequence fragments for which the residues are within 10 angstroms in structural space of known active site key residues. The Harper annotations are considered the current gold-standard for Prx annotations. In Table 3.8, it is shown that there were four differences between the Harper and the 3mer SVM annotations. These are described by providing the Harper annotation, followed by the 3mer SVM annotation. There was one instance of labeling an AhpE protein as Prx5, two instances of labeling an AhpE protein as PrxQ, and one instance of labeling a Prx6 protein as PrxQ. Additional annotations - from canonical motif search, the SFLD classifier, the PREX classifier, and the CDD classifier - for those four proteins are also shown in Table 3.8. The developed classifier determines the subgroup to suggest for an input protein by selecting the annotation associated with the maximal score across the six sub- group classifier scores. For the four proteins with different annotations between the 3merSVM and Harper approaches, the 3merSVM scores from each subgroup classifier

31 Subgroup Exact Match % Non-Unique Match % No Match % AhpE 83.48 10.68 5.84 Prx1 93.85 0.00 6.15 Prx5 6.90 90.50 2.60 Prx6 96.68 0.00 3.32 PrxQ 89.33 0.92 9.76 Tpx 98.09 0.08 1.83

Table 3.7: Each row represents, for each Prx subgroup, percentages of the sequences in the Harper data set associated with that subgroup that match just the canonical active site motif for that subgroup (Exact Match %), that match one or more addi- tional subgroup motifs beyond the canonical motif for that subgroup (Non-Unique Match %), and that don’t match the canonical motif for that subgroup (No Match %). Summed values across a row equal 100% barring small rounding issues. are shown in Table 3.9. For the protein WP 055763280.1, considering the positive PrxQ score, the negative scores for the other subgroups, and the results from the PREX and CDD searches shown earlier, it is hypothesized that WP 055763280.1 actually belongs to the PrxQ subgroup. The other three proteins with differing annotations exhibit negative scores from all of the 3merSVM subgroup classifiers. Typically, the sign of the score returned from an SVM classifier can be used to indicate the class to which the given input belongs. A possible interpretation of all negative scores is that the proteins do not have characteristics of any of the subgroups. Reviewing the classifier outputs for the 38,739 protein Harper data set, negative scores were returned from all the subgroup classifiers for 63 of the proteins (including the three discussed previously). Even with negative scores returned by all the subgroup classifiers, most of the 3merSVM annotations match the Harper et al. annotation. The three differing annotations are from some of the lowest possible scores returned - these are shown as triangles in Figure 1. Out of the 63 proteins with all negative 3merSVM scores, 53 are annotated as AhpE by Harper et al. 47 of those 53 are not in SFLD; the other 6 are in SFLD, but are not characterized to a subgroup. This is shown in Table 3.10. The AhpE subgroup has the least training data (an order of magnitude smaller than some of the other subgroups) and the signature conservation graph for AhpE in the work of Harper et al. is noisy relative to the other signature conservation graphs, highlighting increased variability in residues located structurally near the active site. Both of these help explain the lower-than-expected maximum scores for these proteins.

32 Protein Harper 3mer Motif SFLD PREX CDD ELY63016.1 AhpE Prx5 AhpE Prx - TRX Prx5 WP 051670221.1 AhpE PrxQ AhpE Prx - TRX Prx5 PrxQ WP 055763280.1 AhpE PrxQ - Prx PrxQ PrxQ KFD50172.1 Prx6 PrxQ - - - -

Table 3.8: Each labeled entry represents a protein for which the 3merSVM approach returned a label different from that provided by Harper3. The first column represents the Genbank id for a protein. The second and third columns represent the Harper and 3merSVM annotations respectively. The fourth column shows which canonical motifs exist in the protein sequence, with multiple motifs possible shown one per row. The fifth column indicates the annotation provided by a search against SFLD using the SFLD HMM tool. The sixth column indicates the annotation provided by a search against the PREX database, taking the annotation of the highest scoring match. The seventh column indicates the conserved domain provided by an NCBI CDD search on the sequence. A - symbol indicates no match or value was returned by a given technique.

Protein AhpE Prx1 Prx5 Prx6 PrxQ Tpx ELY63016.1 -0.475 -0.939 -0.431 -0.898 -0.670 -0.915 WP 051670221.1 -0.353 -0.949 -0.774 -1.221 -0.289 -0.891 WP 055763280.1 -0.606 -1.055 -1.137 -1.003 0.498 -1.121 KFD50172.1 -0.883 -0.896 -0.438 -0.715 -0.409 -0.884

Table 3.9: Each row represents a protein for which the 3merSVM approach returned a label different from that provided in Harper et al. 3. The first column represents the Genbank id for a protein. The remaining columns provide the score returned from each subgroup specific model. The maximum score for each protein is in bold font.

3.4.3 Analysis of distinguishing kmers

Comparison of the distinguishing kmers to the residues of Prx active sites suggests that a significant proportion of active site residues are represented by the distin- guishing kmers. In this work, as described previously, active site residues will be those annotated as being within an active site profile per Harper et al. Importantly, however, some distinguishing kmers map in sequence space to functionally-relevant regions that are either extensions of the active site or are in distinct (non-active site) regions. Three exemplar sets of residues are presented below to highlight this type of information that can be extracted and made use of by the 3merSVM approach. For the AhpE subgroup, the set of 3mers DFW, FWP, WPH, and PHG commonly

33 Figure 3.1: For all 63 proteins in the Harper data set where all 3merSVM classifiers returned a negative score, the maximum score for each protein is plotted. Plus shapes indicate the scores for the 60 proteins where the 3merSVM classification matched the Harper et al. classification, while triangle shapes indicate the 3 proteins where there was a mismatch between the two approaches to classification. occur together. The information in Table 3.11 represents in how many proteins in the 0.95-Harper-SFLD data set and in the Harper data set each 3mer occurs and how often they all occur in the same protein. These 3mers commonly occur as a region of residues DFWPHG that occur as an extension of the active site profile region described as (F/A/Y)(P/D)(L/D)(L/F/V)(S/T/E/A) by Harper et al.3. The image in Figure 3.2 is a Weblogo19 representation of the region +/-5 residues centered on the 3mer FWP extracted from the set of 1055 AhpE sequences that all four 3mers occur in. This set of residues is annotated as a turn in available protein structures for AhpE (1XVW, 4X0X) and has been suggested as playing an important role in the oligomerization interface23. For the Tpx subgroup, the set of 3mers DLP, LPF, and FAQ commonly occur together. The information in Table 3.12 represents in how many proteins in the 0.95-

34 Harper/3merSVM AhpE Prx1 Prx5 Prx6 PrxQ Tpx AhpE 51 0 1 0 1 0 Prx1 0 3 0 0 0 0 Prx5 0 0 0 0 0 0 Prx6 0 0 0 2 1 0 PrxQ 0 0 0 0 3 0 Tpx 0 0 0 0 0 1

Table 3.10: The matrix represents the distribution of subgroup annotations for pro- teins from the Harper data set which received all negative scores from the 3merSVM models. For a given protein, the row represents the known annotation (as per Harper et al.) and the column represents the annotation suggested by the developed classifier. The counts represent how many proteins had each pair of annotations.

3mer 0.95-Harper-SFLD (138) Harper (1489) WPH 84 1068 PHG 84 1067 DFW 84 1071 FWP 83 1056 All 83 1055

Table 3.11: Each row represents a 3mer of interest or the set of all listed 3mers. Each column represents a data set of interest. The counts represent in how many proteins of the data a given 3mer or set of 3mers occurs. The title of each column indicates both the name of and, in parentheses, the total number of proteins in a given data set.

Harper-SFLD data set and in the Harper data set each 3mer occurs and how often they all occur in the same protein. These 3mers commonly appear as an extension of the Tpx active site profile region described as A(Q/A/L/M)(K/A/S/G)R(F/W)C by Harper et al.3. The image in Figure 3.3 is a Weblogo representation of the region +/-5 residues centered on the 3mer LPF extracted from the set of 3571 Tpx sequences that all three 3mers occur in. The set of residues corresponding with these 3mers is annotated as a turn and the start of the alpha-helix containing the Tpx resolving cysteine in available protein structures for Tpx (1Y25, 3HVS). This region has been suggested as being highly conserved in sequence and playing roles as part of the dimer interface and as loop anchors24. For the Prx1 subgroup, the 3mer CPA occurs in 1065 of the 1310 Prx1 samples in the 0.95-Harper-SFLD data set and 7479 of the 9660 Prx1 samples in the Harper data set. The image in Figure 3.4 is a Weblogo representation of the region +/-5 residues centered on the 3mer CPA extracted from the set of 7479 sequences the 3mer occurs

35 Figure 3.2: A Weblogo alignment of the regions +/- 5 residues centered on the 3mer FWP extracted from the sequences that all four AhpE 3mers shown in Table 3.11 occur in.

3mer 0.95-Harper-SFLD (546) Harper (4930) DLP 540 4890 LPF 539 4856 FAQ 356 3602 All 351 3571

Table 3.12: Each row represents a 3mer of interest or the set of all listed 3mers. Each column represents a data set of interest. The counts represent in how many proteins of the data a given 3mer or set of 3mers occurs. The title of each column indicates both the name of and, in parentheses, the total number of proteins in a given data set. in. The C in this 3mer represents the resolving cysteine25.

3.4.4 Limitations in analysis

This work demonstrates that the use of 3mers supports high accuracy subgroup anno- tation of Prx sequences. The classifiers have been constructed under the assumption that a sequence to be annotated is already known to be a Prx sequence. To remove this constraint, the use of a hierarchical classification mechanism26 could be devel- oped to first annotate a protein as a Peroxiredoxin or not, and then to annotate to the subgroup level. A check for the presence of the Prx canonical active site motif Pxxx(T/S)xxC could play this role. It is possible that a protein can receive negative scores from all six subgroup clas- sifiers. A traditional approach to handling this scenario is to suggest that annotating the protein to one of the six subgroups is inappropriate when none of the scores is 0 or above. However, given the number of correct predictions made on the proteins in the Harper data set using the 3merSVM approach by using the annotation with the highest score, it may be suitable to adjust the threshold for when to suggest not

36 Figure 3.3: A Weblogo alignment of the regions +/- 5 residues centered on the 3mer LPF extracted from the sequences that all three Tpx 3mers shown in Table 3.12 occur in.

Figure 3.4: A Weblogo alignment of the regions +/- 5 residues centered on the 3mer CPA extracted from the Prx1 sequences that contain the CPA 3mer. providing an annotation to a score below 0. While the discovered 3mers highlight sequence regions that distinguish between Prx subgroups, the use of 3mers is a fairly low resolution technique. A given 3mer maps to a small portion of a given Prx sequence. The SVM classifier takes into account the presence of multiple 3mers. The use of kmers with larger k-values (4mers, 5mers) and the use of gapped kmers27, where wildcard (’x’) positions are allowed in the kmer, could potentially support accurate prediction with fewer and more interpretable features. To date, only a small number of the highest weighted 3mers that occur outside of active site have been explored with respect to the role that the 3mer residue regions play mechanistically. The use of more rigorous feature selection methods, such as recursive feature elimination (RFE)28, to determine the subset of features to analyze beyond the exemplar 15 per subgroup provided is important, followed by analysis with respect to biochemical and biophysical features of the involved residues and the location of 3mers in known protein structures.

3.5 Conclusions

In this work, a new high-accuracy classifier that can annotate Prx proteins to the sub- group level has been developed. The classifier, which encodes sequences as 3mers, is publicly available (http://prxsubfamilyclassif-env.us-east-1.elasticbeanstalk.

37 com/) and supports high-throughput analyses. Comparison to the state-of-the-art approach to Prx subgroup annotation shows only four differences in over 38,000 an- notations. Examination of a subset of 3mers that the developed classifier uses to distinguish between Prx subgroups reveals functionally relevant sequence fragments, including sequences that extend or are outside the active site regions used in previous Prx subgroup analyses.

38 References

[1] Kimberly J Nelson, Stacy T Knutson, Laura Soito, Chananat Klomsiri, Leslie B Poole, and Jacquelyn S Fetrow. Analysis of the peroxiredoxin family: using active-site structure and sequence information for global classification and residue analysis. Proteins: Structure, Function, and Bioinformatics, 79(3):947–964, 2011.

[2] Laura Soito, Chris Williamson, Stacy T Knutson, Jacquelyn S Fetrow, Leslie B Poole, and Kimberly J Nelson. Prex: Peroxiredoxin classification index, a database of subfamily assignments across the diverse peroxiredoxin family. Nu- cleic acids research, 39(suppl 1):D332–D337, 2010.

[3] Angela F Harper, Janelle B Leuthaeuser, Patricia C Babbitt, John H Morris, Thomas E Ferrin, Leslie B Poole, and Jacquelyn S Fetrow. An atlas of peroxire- doxins created using an active site profile-based approach to functionally relevant clustering of proteins. PLoS computational biology, 13(2):e1005284, 2017.

[4] Stephen F Altschul, Thomas L Madden, Alejandto A Sch¨affer,Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped BLAST and PSI- BLAST:a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, 1997. ISSN 13624962.

[5] L. J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H. H. Stærfeldt, K. Rapacki, C. Workman, C. A F Andersen, S. Knudsen, A. Krogh, A. Valencia, and S. Brunak. Prediction of human protein function from post-translational modifications and localization features. Journal of Molecular Biology, 319(5):1257–1265, 2002. ISSN 00222836.

[6] Shweta Bhandare, Debra S. Goldberg, and Robin Dowell. Discriminating be- tween hur and ttp binding sites using the k-spectrum kernel method. PLOS ONE, 12(3):1–14, 03 2017.

39 [7] Christopher Fletez-Brant, Dongwon Lee, Andrew S. McCallion, and Michael A. Beer. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets. Nucleic acids research, 41(Web Server issue): 544–556, 2013. ISSN 13624962.

[8] Christina Leslie, , and William Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Biocomputing 2002, pages 564–575. World Scientific, 2001.

[9] Eyal Akiva, Shoshana Brown, Daniel E Almonacid, Alan E Barber 2nd, Ashley F Custer, Michael A Hicks, Conrad C Huang, Florian Lauck, Susan T Mashiyama, Elaine C Meng, et al. The structure–function linkage database. Nucleic acids research, 42(D1):D521–D530, 2013.

[10] Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13): 1658–1659, 2006.

[11] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23):3150–3152, 2012.

[12] NLMM Pochet and JAK Suykens. Support vector machines versus logistic regres- sion: improving prospective performance in clinical decision-making. Ultrasound in Obstetrics & Gynecology, 27(6):607–608, 2006.

[13] Diego Alejandro Salazar, Jorge Iv´anV´elez,and Juan Carlos Salazar. Compar- ison between svm and logistic regression: Which one is better to discriminate? Revista Colombiana de Estad´ıstica, 35(SPE2):223–237, 2012.

[14] Yang Shao and Ross S Lunetta. Comparison of support vector machine, neu- ral network, and cart algorithms for the land-cover classification using limited training data points. ISPRS Journal of Photogrammetry and Remote Sensing, 70:78–87, 2012.

[15] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learn- ing, 20(3):273–297, Sep 1995. ISSN 1573-0565.

[16] Lynne Davis, John Hawkins, Stefan Maetschke, and Mikael Bod´en.Comparing svm sequence kernels: A protein subcellular localization theme. 2006.

40 [17] T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11, pages 169–184. MIT Press, Cambridge, MA, 1999.

[18] Fabian Sievers, Andreas Wilm, David Dineen, Toby J Gibson, Kevin Karplus, Weizhong Li, Rodrigo Lopez, Hamish McWilliam, Michael Remmert, Johannes S¨oding,et al. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Molecular systems biology, 7(1):539, 2011.

[19] Gavin E Crooks, Gary Hon, John-Marc Chandonia, and Steven E Brenner. We- blogo: a sequence logo generator. Genome research, 14(6):1188–1190, 2004.

[20] Janelle B Leuthaeuser, John H Morris, Angela F Harper, Thomas E Ferrin, Patricia C Babbitt, and Jacquelyn S Fetrow. Dasp3: identification of protein sequences belonging to functionally relevant groups. BMC bioinformatics, 17(1): 458, 2016.

[21] Aron Marchler-Bauer and Stephen H Bryant. Cd-search: protein domain anno- tations on the fly. Nucleic acids research, 32(suppl 2):W327–W331, 2004.

[22] Stephen A Cammer, Brian T Hoffman, Jeffrey A Speir, Mary A Canady, Melanie R Nelson, Stacy Knutson, Marijo Gallina, Susan M Baxter, and Jacque- lyn S Fetrow. Structure-based active site profiles for genome analysis and func- tional family subclassification. Journal of molecular biology, 334(3):387–401, 2003.

[23] Simon Li, Neil A Peterson, Min-Young Kim, Chang-Yub Kim, Li-Wei Hung, Minmin Yu, Timothy Lekin, Brent W Segelke, J Shaun Lott, and Edward N Baker. Crystal structure of ahpe from mycobacterium tuberculosis, a 1-cys per- oxiredoxin. Journal of molecular biology, 346(4):1035–1046, 2005.

[24] Andrea Hall, Banumathi Sankaran, Leslie B Poole, and P Andrew Karplus. Structural changes common to catalysis in the tpx peroxiredoxin subfamily. Jour- nal of molecular biology, 393(4):867–881, 2009.

[25] Derek Parsonage, Kimberly J Nelson, Gerardo Ferrer-Sueta, Samantha Alley, P Andrew Karplus, Cristina M Furdui, and Leslie B Poole. Dissecting peroxire- doxin catalysis: separating binding, peroxidation, and resolution for a bacterial ahpc. Biochemistry, 54(7):1567–1575, 2015.

41 [26] Carlos N Silla and Alex A Freitas. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1-2): 31–72, 2011.

[27] Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, and Michael A Beer. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology, 10(7):e1003711, 2014.

[28] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene se- lection for cancer classification using support vector machines. Machine learning, 46(1-3):389–422, 2002.

42 Chapter 4

Finding gapped and ungapped motifs using alignment-free two-round machine learning

Publishing and Author Information

This chapter will be submitted, in part, to the Nucleic Acids Research or other rele- vant journals. The primary author is Jiajie Xiao with corresponding author William H. Turkett. This chapter contains stylistic variations (e.g., number of columns and citation style) from the published manuscript.

Abstract

Sequence motifs are recurring patterns that are often associated with a biological significance. Identification of sequence motifs is critical for functional classification and sequence annotation. Here, we propose a machine learning based algorithm to uncover the distinguishable patterns among two sets of sequences. Using a second round of feature extraction guided by the support vector machines in the first round of feature searching, we are able to identify both gapped and continuous patterns within sequence data sets. Three computational synthesized and one experimentally-derived sequence sets were tested by the proposed method. We illustrate that the proposed method can be utilized for de novo motif discovery and sequence-based classification, which can be useful for annotating biological sequences and lead optimization in drug discovery.

43 4.1 Introduction

Sequence motifs are short recurring patterns in one-dimensional biopolymer’s se- quences (i.e., DNA, RNA and protein). The motifs are often associated with a bio- logical significance and presumably correspond to specific properties and functions. For example, transcription factor binding sites and telomeric repeats are common motifs in DNA that are respectively responsible for gene regulation1,2 and tumor emergence3. Sequence motifs are also conserved for different proteins with similar functions in the different subgroups and species. Functional motifs have been used as essential signatures for functional classification4,5 and annotation6. In addition, motif identifications can further facilitate drug discovery and lead optimization of encoded-molecules such as aptamers7. However, de novo motif detection is an open challenging task8,9,10 as it requires systematically capturing critical and common combinations of characters from noisy datasets with tens to thousands base-long sequences. Most current methodologies rely on finding an optimal position weight matrix11 (PWM) that represents the mo- tif(s). For example, for short and well conserved sequences, after performing multiple sequence alignment, a PWM can be achieved by computing the frequency of each char- acter at each position. However, for long and diverse sequences, multiple sequence alignment is computationally expensive and hard to avoid misalignment, which causes troubles in finding the actual motif(s) among sequences. Fitting mixed model via ex- pectation maximum12 provides an alignment-free method to achieve the PWM for the motif(s). Initializing from a most discriminating ungapped kmer, the PWM of the motif model can be fitted via the expectation maximum technique. However, such method requires an estimation of the number of motifs as a user specified input and it may fail if the input sequences contains to many noisy sequences. Recently, several studies have shown a new way to obtain the PWM model for the motif(s). After classifying sequences via kmer-SVM and gkm-SVM, a PWM for discriminating motif(s) can be constructed by aligning the representative kmer and gapped kmer features. Inspired from these kmer-SVM and gkm-SVM works, we propose a support vector machine (SVM) guided de novo motif discovery algorithm. The motifs being detected by our proposed algorithm can be both ungapped and gapped patterns that differen- tiate two sets of sequences. These two sequence sets are referred to as the positive and negative sets. Each of these two sequence sets should respectively associate with par- ticular properties, such as protein binding/unbinding, disease-present/free or protein

44 subgroup I/II, corresponding to their underlying discriminating motifs. Therefore, our method helps identify critical sequence components related to the functions and properties of interest among noise. Meanwhile, the gapped and ungapped motif fea- tures identified by our motif discovery algorithm serve as discriminating features for a second-round of training. The resultant classifier provides the ability to predict if a new input sequence present relevant properties.

4.2 Materials and Methods

4.2.1 Approach

Distinct from previous works of sequence classification and motif detection that require to fix the length of kmers or gapped-kmers, our method overcomes this parameter-dependent problem via two-round machine learning. The overall algorithm has three components as illustrated below.

Initial learning

Multiple kmer-SVMs using a range of k values are constructed to roughly locate the likely distinguishable regions in the sequences. Instead of choosing a single fixed length of kmers, iterations over a range of different kmer lengths add more flexibility to handle deletions, insertions, and . Each of these kmer-SVMs (i.e. 3mer- SVM, 4mer-SVM, etc.) returns a list of kmers (i.e. respectively 3mers , 4mers , { } { } etc.) with corresponding SVM feature weights for each k-mer. Important kmers that differentiate positive and negative sequence sets can be identified if their normalized SVM feature weights have a magnitude greater than a user-defined significance threshold wc. The normalized SVM feature weight (denoted as wi) for a kmer (denoted as k-meri) can be obtained according

a w = i (4.1) i P a j∈{l|al·ai>0,|ai|>ac,|aj |>ac} j where ai is the SVM feature weight of k-meri from the kmer-SVM and ac is a sig- nificance threshold. Note that the linear kernel is used in each kmer-SVM. As the features are nonnegative kmer frequencies, the sign of the normalized SVM feature weight wi indicates whether the positive or negative sequence set is more likely to have particular occurrence of the kmeri. Meanwhile, the magnitude of wi suggests the relative importance of kmeri in distinguishing the two sets.

45 Important kmers identified from different kmer-SVMs should have many overlaps when being mapped back to each original sequences. Therefore, the initial learning based on multiple kmer-SVMs roughly indicates what regions in each sequence may contain a discriminating motif for one sequence set against the other.

Motif candidate extraction

After stacking all significant kmers according to the reference of each original se- quence, an accumulated weight (denoted as aW ) for an character (denoted as A) at position p can be computed via adding up the normalized SVM feature weights of the kmers overlapping with the character A at position p. This accumulated weight aW estimates the overall contribution of character A to classifying the current se- quence to one sequence set against the other. Therefore, the accumulated weight for each character offers quantitative heuristics for extracting motif candidate. The larger magnitude of accumulated weight aW is, the more significant the corresponding character A at position p is as a part of the discriminating motif. Motif candidates can be generated by combining characters with large aW . By | | making decisions if to keep or mask out the current character, a region with l charac- ters can result in 2l motif candidates at most. Although this number is tremendously smaller than all possible combinations of the characters in the original sequence, it can causes some computational burdens when the length of a significant region iden- tified by initial learning is large. As a result, one can choose to start growing the motif candidate from the position with the largest aW (Figure 4.1). If adding one | | more character to the longest motif candidate generated so far cause the user specified metric (e.g. accuracy, F1 score, etc.) to become smaller than an acceptance thresh- old, the expansion of motif candidates will stop. This feature extraction process is repeated for all sequences in the training set. As such, a list of discriminating motif candidates can be extracted using the heuristics originated from the initial learning.

Motif candidate evaluation

The resultant motif candidates from the previously described step can be both gapped and ungapped. However, the accumulated weights used for extracting candidates are estimated from multiple independent kmer-SVMs, which just consider ungapped kmer features. The noise characters inside kmers can cause some disturbance of character weighting. A second round of machine learning using these motif candidates enables a systematic evaluation on all of these motif candidates and returns a simpler and

46 P L I S W R A K A E H E D

0.60

0.32

0.29

0.36 + 0.60

0.20

0.10

0.30

0.92 0.92 1.21 0.97 1.25 1.26 0.90 1.20 0.40

A K

W

E

……. Grow Smartly A list of motif candidate from this sequence { AK, WXAK, WXAKXE, ...}

Figure 4.1: An example of feature extraction guided by the first-round SVM. Differ- ent kmers and their normalized feature weights are stacked according to the original sequence. Accumulated weight aW can be computed for each character accordingly. As a result, motif candidates can be extracted from the positions with hight aW .

47 more accurate classifier. There are several choices of learning algorithms that can be employed in the second round of learning. In this work, we choose SVM and decision tree learning as an example to demonstrate the usefulness of our proposed approach. SVM ranks the significance of each motif candidate in differentiating the positive sequence set from the negative one. The decision tree learning presents a simple interpretable classifier for two different sets of sequences.

4.2.2 Implementations

In the first-round learning, kmer-SVM was constructed using the linear kernel named SVC (C-Support Vector Classification) in scikit-learn13’s SVM API with default val- ues for training parameters. Given a fixed value of k, original sequences in both positive and negative sequence sets were encoded into kmer representation, which is the histogram of each possible sliding kmer in a sequence. The kmer counts were then treated as input features. The average SVM feature weights in the SVM models obtained from ten-fold cross validations were treated as the SVM feature weights in later calculations. Similar to the first-round learning, SVM with ten-fold cross validation was em- ployed in the second-round learning. To save computational time, features – frequen- cies of the kmer or motif candidates – with a variance less than 1.0 10−6 were × removed in both rounds of SVM learning.

4.2.3 Data acquisition

To demonstrate the usefulness of the proposed methodology, three computationally synthesized data sets and one experimentally collected data set were used in this proof-of-principle study. Each of these synthesized data set contains positive and negative sets with the same distribution of sequence length. Sequences in both the positive and negative set were originally randomly generated. Then, for each sequence in the positive set, characters in a random region were replaced by a predefined motif pattern. As such, we obtained two sequence sets that allow us to test if the proposed method could find the actual motif we set. The exact forms of those predefined motif patterns can be seen in the following sections. In addition, experimentally collected sequences were also employed. 2,500 se- quences from the aptamer-thrombin complex window and 2,500 negative sequences from the free-DNA window in a caliper electrophoresis selection experiment14 – were respectively treated as positive and negative sets.

48 4.3 Results

4.3.1 Case study 1: CCAXXXGGXG

200 DNA sequences with a common embedded pattern of CCAXXXGGXG were generated as the positive set. The character “X” represents any DNA base code in A, C, T,G . Meanwhile, 200 random DNA sequences were generated as the negative { } set. During the generation of these sequences, the Gaussian distribution with a mean of 100 and a standard deviation of 1 was set as the length of sequences in the positive and negative sets. In the first round learning, multiple kmer-SVMs were constructed using different

k values ranging from 3 to 11. By mapping significant kmers (ac > 0.5) onto each positive sequence, a list of gapped and ungapped motif candidates were extracted. Here, we used the F1 score as the suspension criterion for motif expansion. If adding a base results in a candidate motif with less than 0.5 F1 score, the motif expansion process will stop.

Table 4.1: Evaluation of motif candidates in case study 1. Motif candidates that have top 5 greatest SVM weights in the second-round SVM learning are listed.

Motif candidate Weig. Acc. Sen. Spec. Prec. F1 CCAXXXGGXG 2.684 0.993 1.000 0.985 0.985 0.993 CCAXXXGG 2.602 0.975 1.000 0.950 0.952 0.976 CCXXXXGG 2.490 0.875 1.000 0.750 0.800 0.889 CCAXXXG 2.476 0.898 1.000 0.795 0.830 0.907 CAXXXXGG 2.410 0.870 1.000 0.740 0.794 0.885 * The abbreviations on the first row respectively denote weight, sensitivity, specificity, and precision.

As seen in Table 4.1, the embedded pattern CCAXXXGGXG was identified as the top ranked motif candidate. The rank was based on the mean value of the SVM weight in the 10-fold cross validation in the second-round learning. Meanwhile, because of the way we employ heuristics for motif candidates, gapped substrings of the embedded string were also found during the SVM-guided feature extraction. In addition, Table 4.1 reveals that the random negative set can also contain these motif candidates by chance at low rates.

49 4.3.2 Case study 2: CTCCC(8X)ATT

In this example, we tested if the proposed methodology could identify a dyad motif – a motif containing many gaps in the middle. 100 positive random sequences with CTCCCXXXXXXXXATT (denoted as CTCCC(8X)ATT) at random position and 100 negative random sequence were generated as input for testing. The sequence length of all sequences satisfies a Gaussian distribution with a mean and standard deviation of 100 and 1 respectively. To see if our proposed method allows us to find long motifs, we constructed mul- tiple kmer-SVM using k from 3 to 8. These k values would not cover the whole dyad motif to be found. After testing different SVM feature weight threshold, ac > 0.3 was used to select significant kmers to map back onto each positive sequence. This rela- tively small value preserves more regions of the sequence to extract motif candidates. The suspension criterion for motif expansion is the same as case study 1.

Table 4.2: Evaluation of motif candidates in case study 2. Motif candidates that have top 5 greatest SVM weights in the second-round SVM learning are listed.

Motif candidate Weig. Acc. Sen. Spec. Prec. F1 CTCCC(8X)A 1.804 0.990 1.000 0.980 0.980 0.990 TCCC 1.768 0.890 1.000 0.780 0.820 0.901 CTCCC(8X)ATT 1.743 1.000 1.000 1.000 1.000 1.000 CTCCC(9X)TT 1.743 1.000 1.000 1.000 1.000 1.000 CTCCC(10X)T 1.737 0.995 1.000 0.990 0.990 0.995 * The abbreviations on the first row respectively denote weight, sensitivity, specificity, and precision.

As shown in Table 4.2, all these top 5 ranked motif candidates present similar aver- age SVM weight in the second-round 10-fold cross validation learning. The embedded pattern was found as the the third ranked motif candidate. Other motif candidates are also substrings of the embedded pattern. As seen in Table 4.2, CTCCC(8X)ATT and CTCCC(9X)TT result in no false positives and false negatives. Moreover, although the kmers used in the first round learning are much smaller than the embedded dyad motif, we can see that the actual motif can be extracted based on our motif expansion algorithm.

4.3.3 Case study 3: SXPWK[AK]XP

In the third case study, we tested long protein sequences with alternative amino acids in the pattern. 100 positive and 100 negative protein sequences with a Gaussian-

50 distributed sequence length were synthesized. As protein sequences have a larger alphabet size than DNA/RNA, a protein sequence will be more informative – not likely to occur by chance – than a DNA/RNA sequence with the same length. To increase the difficulty in this case study and demonstrate the ability of find short mo- tifs in long sequences, the mean and standard deviation of their sequence lengths are 5000 and 100 respectively. The randomly embedded pattern in the positive sequence is SXPWK[AK]XP, whose sixth amino acid can be either A or K.

Table 4.3: Evaluation of motif candidates in case study 3. Motif candidates that have top 10 greatest SVM weights in the second-round SVM learning are listed.

Motif candidate Weig. Acc. Sen. Spec. Prec. F1 PXK 1.613 0.500 1.000 0.000 0.500 0.667 PWK 1.239 0.785 1.000 0.570 0.699 0.823 PW 1.113 0.500 1.000 0.000 0.500 0.667 SXP 1.003 0.500 1.000 0.000 0.500 0.667 IS 0.652 0.500 1.000 0.000 0.500 0.667 SXPWKA 0.642 0.760 0.520 1.000 1.000 0.684 PWKK 0.618 0.750 0.510 0.990 0.981 0.671 GK 0.605 0.500 1.000 0.000 0.500 0.667 PWKA 0.603 0.750 0.550 0.950 0.917 0.687 SXPWKK 0.594 0.740 0.480 1.000 1.000 0.649 * The abbreviations on the first row respectively denote weight, sensitivity, specificity, and precision.

The initial learning, feature extraction, and the second-round learning procedure were the same as the case study 2. As seen in Table 4.3, eight out of the ten top ranked motif candidates match a portion of the pattern we set for the positive sequences. The 5th and 8th motif candidates IS and GK don’t appear matching any portion of SXPWK[AK]XP. Along with other short motif candidates (i.e. PXK, PWK, PW, and SXP), these motif candidates present low specificity, which means they can occur by chance in most of the about 5000-residue-long random protein sequences in the negative set. However, the four remaining motif candidates in Table 4.3, i.e. SXPWKA, PWKK, PWKA, and SXPWKK, were hardly found in the negative sequences. These four mo- tif candidates match the majority part of SXPWK[AK]XP. The alternative residues were respectively identified by two groups of motif candidates, which are SXPWKA, { PWKA and PWKK, SXPWKK . In particular, SXPWKA and SXPWKK present } { } 100% precesion. Meanwhile, these two motif candidates also respectively exhibit 52% and 48% sensitivity. These evaluation scores demonstrate that, without the need of

51 the missing P in the tail of the set pattern, the positive and negative sequence sets can still be perfectly distinguished based on the occurrence of SXPWKA and SXPWKK.

4.3.4 Case study 4: Binding motifs in aptamer selection ex- periments

In the last case study presented in this work, we examined the proposed methodology with data from an aptamer selection experiment. As reported by Riley et al. 14, capillary electrophoresis experiments were performed using a mixture of a known 29- base-long thrombin-binding aptamer (29-TBA), random 29mer ssDNA, and thrombin. Two samples of DNAs that collected in the thrombin-DNA complex window and unbound DNA window were sequenced. In order to check if our method could identify the known aptamer in a de novo way and provide an insight into the common binding component, we treated 2,500 sequences sampled from the complex window and 2,500 sequences sampled from the unbound DNA window as the positive and negative sequence sets respectively. As suggested by the Riley et al. 14, the complex window should contain a lot of non-specifically bound randomers except for the known 29- TBA. Table 4.4: Evaluation of motif candidates in case study 4. Motif candidates with top 10 greatest enrichment factor are listed.

Motif candidate Complex% Unbound% EF GTCACCCCAAC 0.339 0.004 77.000 GTCACCCCAACC 0.337 0.004 76.545 GTCACCCCXA 0.342 0.005 71.167 GTCACCCCAA 0.341 0.005 71.083 GTCAC(4X)ACC 0.339 0.005 70.667 AGTCACCC 0.366 0.005 70.462 GTCACCCXXA 0.342 0.005 65.769 GTCA(5X)ACC 0.340 0.005 65.385 CCAACC(4X)C 0.338 0.005 65.077 ATT(14X)ACCXCA 0.338 0.005 64.923 * Complex% and Unbound% on the first row respectively denote the frequency of the motif candidate in the complex and unbound sequence sets. EF represents the enrichment factor, which is the ratio between Complex% and Unbound%.

Using the same work flow we described in case study 1, we identified a list of discriminating motif candidates in the positive set. To compare with the reported enrichment factor, we evaluated these motif candidates via the ratio of the frequency of the motif candidate in the positive and negative sets.

52 Figure 4.2: A complex of thrombin and aptamer. The structure is visualized using VMD 1.9.215 according to PDB 4I6Y16. The thrombin is colored in green. The ap- tamer is colored in yellow. Bases corresponding to GTCACCCCAAC are highlighted by the red Licorice representation.

As indicated in Table 4.4, the greatest enrichment factor is consistent with the reported value in previous literature14. It can be also seen that the most enriched motifs only occurs in about 33% sequences in the complex set, suggesting a large amount of noise from nonspecific binding activities. Note that the sequencer used in the work by Riley et al. 14 reported reverse com- plement of the actual sequence and the reversed complement of the 29-TBA is 3’- AGTCACCCCAACCTGCCCTACCACGGACT-5’. We see the enriched motif can- didates match the 3’-end of the reversed complement of 29-TBA. That indicates the most enriched ten motif sequences should be 3’ end of the actual 29-TBA sequence. Structurally, as displayed in Figure 4.2, the most enriched motif GTCACCCCAAC is corresponding to the bases highlighted in red, which forms the the majority of the binding interface between the thrombin in green and the aptamer in yellow.

53 Figure 4.3: Accuracy comparisons of different sequence classifiers. Accuracies in 10- fold cross validations of each classifier on sequences from case study 1 are displayed as box plots. kmer-SVMs are indicated by first 9 entries. The next three entries present gkm-SVMs using l=10 and k=4,5,6. The last entry indicates the second-round SVM using the features chosen via guidance from the first-round SVM learning.

4.4 Discussion

4.4.1 Sequence classifier comparisons

The feature extraction based on the first-round SVM learning provides a list of dis- criminating motif candidates. These candidates can be both ungapped and gapped. As seen in all three case studies using synthesized data, the embedded gapped motifs can be usually identified as a list of motif candidates. The number of these motif candidates is usually much smaller than the number of kmers as well as the number of support vectors. As demonstrated using the case study 1, the second round learning using the extracted motif candidates results in a significantly more accurate classifier than any kmer-SVMs from the first-round learning (Figure 4.3). Meanwhile, although previous literature reported that the gkm-SVM17,18 can provide fairly accurate classification regardless of the choice of l and k (i.e. the total length of a gapped kmer and its

54 number of non-gap characters)17, we see that our SVM-guided classifier outperforms than gkm-SVM with respect to the classification accuracy (Figure 4.3). There are two major tunable parameters in our proposed methods. They are the kmers’ SVM weight cutoff and the evaluation metric threshold of motif candidates. To preserve the potential gapped motifs, we recommend to pick small values for these thresholds. Although such a relatively loose strategy can result in more false positive motif candidates, the second-round learning should eliminate their influences.

4.4.2 Limitations

We have seen where the actual embedded motifs are not the top ranking motif can- didate. This suggests a need of performing recursive feature elimination19 before training the final classifier model. In addition, the feature extraction algorithm pre- sented in this work may not find the complete actual motif. As seen in case study 3, head/tail characters may be missed as their accumulated weights tend to have smaller values than the characters in the middle. A robust correction is needed to eliminate this artificial effect caused by the summation of the stacked kmers’ SVM weights. Similar to other de novo motif discovery tools, our proposed method may not be able to find discriminating motifs if the signal-to-noise ratio of the input sequences is too low. For example, if a motif is not informative enough, meaning it occurs, by chance, frequently in the background set, the method presented in this work may not be able to detect it as a motif candidate.

4.4.3 Robustness of the methods

The motifs identified by our two-round learning algorithm should distinguish the pos- itive sequences against the negative ones. When there is no negative sequence input, in order to identify the representative motifs embedded in the positive sequences, we have to generate a random sequence set as the negative set for training. In this situ- ation, it is expected that the motifs identified should be independent of the negative set. However, in our tests, the resulting motifs from our algorithm can be sensitive to the training negative set. We tested another way to generate the random negative set by shuffling the posi- tive sequences using FisherYates shuffle algorithm20. Such algorithm randomly places each character along a sequence into a new position and results in a random sequence. The motif candidates identified using these negative motifs for case study 1-3 are listed in Tables 4.5,4.6, 4.7.

55 Table 4.5: Evaluation of motif candidates in case study 1 (shuffled negative se- quences). Motif candidates that have top 5 greatest SVM weights in the second-round SVM learning are listed.

Motif candidate Weig. Acc. Sen. Spec. Prec. F1 CCA(3X)GG 5.449 0.920 1.000 0.840 0.862 0.926 CCA(6X)G 3.893 0.783 1.000 0.565 0.697 0.821 CCA(4X)G 3.890 0.783 1.000 0.565 0.697 0.821 CCA 2.320 0.563 1.000 0.125 0.533 0.696 T(7X)CCA(6X)G 1.655 0.638 0.345 0.930 0.831 0.488 * The abbreviations on the first row respectively denote weight, sensitivity, specificity, and precision.

Table 4.6: Evaluation of motif candidates in case study 2 (shuffled negative se- quences). Motif candidates that have top 5 greatest SVM weights in the second-round SVM learning are listed.

Motif candidate Weig. Acc. Sen. Spec. Prec. F1 CTCCC(8X)A 2.149 0.980 1.000 0.960 0.962 0.980 CTCCC(8X)AXT 2.133 1.000 1.000 1.000 1.000 1.000 CTCCC(9X)TT 2.112 1.000 1.000 1.000 1.000 1.000 CTCCC(8X)ATT 2.112 1.000 1.000 1.000 1.000 1.000 CTCCC(8X)AT 2.112 1.000 1.000 1.000 1.000 1.000 * The abbreviations on the first row respectively denote weight, sensitivity, specificity, and precision.

The motif candidates and their orders in Table 4.5,4.6, 4.7 are not exactly the same as the ones in Table 4.1,4.2, and 4.3, although they are matching the known motifs we specified. As the significant motifs we set in case studies 1-3 are actually much shorter than the length of each sequence, the positive and negative sequences should have the same or approximately the same distributions regardless of the ways we generate the negative sets. Therefore, the differences in these tables (Table 4.5 vs. Table 4.1, Table 4.6 vs. Table 4.2, and Table 4.7 vs. Table 4.3) suggest the uncertainty of our machine learning based de novo motif identification. Further studies are needed to improve its robustness.

4.5 Conclusions

In this work, we illustrate a new de novo motif identification approach using machine learning. Taking advantage of the heuristics from the first-round kmer-SVM learning, ungapped and gapped kmer features can be obtained. These features not only reveal

56 Table 4.7: Evaluation of motif candidates in case study 3 (shuffled negative se- quences). Motif candidates that have top 10 greatest SVM weights in the second- round SVM learning are listed.

Motif candidate Weig. Acc. Sen. Spec. Prec. F1 PW 1.111 0.500 1.000 0.000 0.500 0.667 CH 1.087 0.500 1.000 0.000 0.500 0.667 PWK 0.939 0.785 1.000 0.570 0.699 0.823 PN 0.819 0.500 1.000 0.000 0.500 0.667 LV 0.685 0.500 1.000 0.000 0.500 0.667 KG 0.674 0.500 1.000 0.000 0.500 0.667 PXK 0.639 0.500 1.000 0.000 0.500 0.667 PWKA 0.554 0.765 0.550 0.980 0.965 0.701 PWKK 0.505 0.740 0.510 0.970 0.944 0.662 IXC 0.485 0.500 1.000 0.000 0.500 0.667 * The abbreviations on the first row respectively denote weight, sensitivity, specificity, and precision. discriminating motifs in the input sequences but also result in a more accurate and more efficient sequence classifier.

57 References

[1] Fran¸coisSpitz and Eileen E. M. Furlong. Transcription factors: from enhancer binding to developmental control. Nature Reviews Genetics, 13(9):613–626, 2012. ISSN 1471-0056.

[2] G L Semenza and G L Wang. A nuclear factor induced by hypoxia via de novo protein synthesis binds to the human erythropoietin gene enhancer at a site required for transcriptional activation. Molecular and Cellular Biology, 12(12): 5447–5454, 1992. ISSN 0270-7306.

[3] T. Aschacher, B. Wolf, F. Enzmann, P. Kienzl, B. Messner, S. Sampl, M. Svo- boda, D. Mechtcheriakova, K. Holzmann, and M. Bergmann. LINE-1 induces hTERT and ensures telomere maintenance in tumour cell lines. Oncogene, 35 (1):94–104, 2016. ISSN 14765594. doi: 10.1038/onc.2015.65.

[4] Angela F Harper, Janelle B Leuthaeuser, Patricia C Babbitt, John H Morris, Thomas E Ferrin, Leslie B Poole, and Jacquelyn S Fetrow. An atlas of peroxire- doxins created using an active site profile-based approach to functionally relevant clustering of proteins. PLoS computational biology, 13(2):e1005284, 2017.

[5] Kimberly J Nelson, Stacy T Knutson, Laura Soito, Chananat Klomsiri, Leslie B Poole, and Jacquelyn S Fetrow. Analysis of the peroxiredoxin family: using active-site structure and sequence information for global classification and residue analysis. Proteins: Structure, Function, and Bioinformatics, 79(3):947–964, 2011.

[6] Christian J. A. Sigrist, Edouard de Castro, Lorenzo Cerutti, Batrice A. Cuche, Nicolas Hulo, Alan Bridge, Lydie Bougueleret, and Ioannis Xenarios. New and continuing developments at prosite. Nucleic Acids Research, 41(D1):D344–D347, 2013.

58 [7] L C Bock, L C Griffin, J a Latham, E H Vermaas, and J J Toole. Selection of single-stranded DNA molecules that bind and inhibit human thrombin. Nature, 355(6360):564–566, 1992. ISSN 0028-0836.

[8] Jianjun Hu, Bin Li, and Daisuke Kihara. Limitations and potentials of current motif discovery algorithms. Nucleic acids research, 33(15):4899–4913, 2005.

[9] Modan K Das and Ho-Kwok Dai. A survey of dna motif finding algorithms. In BMC bioinformatics, volume 8, page S21. BioMed Central, 2007.

[10] Eran Eden, Doron Lipson, Sivan Yogev, and Zohar Yakhini. Discovering motifs in ranked lists of dna sequences. PLoS computational biology, 3(3):e39, 2007.

[11] Rodger Staden. Methods for calculating the probabilities of finding patterns in sequences. Bioinformatics, 5(2):89–96, 1989.

[12] Timothy L Bailey, Charles Elkan, et al. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. 1994.

[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[14] Kathryn R Riley, Jason Gagliano, Jiajie Xiao, Kara Libby, Shingo Saito, Guo Yu, Roger Cubicciotti, Jed Macosko, Christa L Colyer, Martin Guthold, et al. Combining capillary electrophoresis and next-generation sequencing for aptamer selection. Analytical and bioanalytical chemistry, 407(6):1527–1532, 2015.

[15] William Humphrey, Andrew Dalke, and Klaus Schulten. Vmd: visual molecular dynamics. Journal of molecular graphics, 14(1):33–38, 1996.

[16] C Nicklaus Steussy, Chandra J Critchelow, Tim Schmidt, Jung-Ki Min, Louise V Wrensford, John W Burgner, Victor W Rodwell, and Cynthia V Stauffacher. A novel role for coenzyme a during hydride transfer in 3-hydroxy-3-methylglutaryl- coenzyme a reductase. Biochemistry, 52(31):5195–5205, 2013.

[17] Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, and Michael A Beer. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology, 10(7):e1003711, 2014.

59 [18] Mahmoud Ghandi, Morteza Mohammad-Noori, Narges Ghareghani, Dongwon Lee, Levi Garraway, and Michael A Beer. gkmsvm: an r package for gapped- kmer svm. Bioinformatics, 32(14):2205–2207, 2016.

[19] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene se- lection for cancer classification using support vector machines. Machine learning, 46(1-3):389–422, 2002.

[20] Ronald Aylmer Fisher, Frank Yates, et al. Statistical tables for biological, agri- cultural and medical research. Statistical tables for biological, agricultural and medical research., (Ed. 3.), 1949.

60 Chapter 5

Future directions

5.1 Algorithm limitations and improvements

In this thesis, we have illustrated the effectiveness of the proposed kmer-based ma- chine learning methods in the problems of sequence classification and de novo motif identification. Regardless of using the naive kmer-SVM approach or the SVM-guided two-round learning, training on protein sequences usually results in more accurate model than the nucleic acids. This observation is likely due to the lower signal-to- noise ratio in nucleic acid sequence set. Notice that nucleic acid sequences have fewer choices of characters than protein. Given the same length of kmers, the likelihood to have a combination of k characters by chance is much higher for DNA/RNA sequences than protein. As a result, the kmer occurrence features may be hard to distinguish the foreground nucleic acid sequence against the background sequence. Therefore, when dealing with nucleic acid sequences, long motifs should be looked for and large k are necessary to be considered in order to achieve good classifica- tion. However, when the k value increases, the kmer feature space becomes sparse. This sparse feature space may not just increase the space complexity but also cause improper estimations on the importance of different regions along the nucleic acid se- quences due to over-fitting. Therefore, the gapped-kmer kernel introduced by Beer’s lab1,2,3 may be needed in the first round learning when faced with DNA/RNA se- quences. Meanwhile, we have observed that the motif identified in the second round learning could be only a portion of the exact predefined motif. The missing character usually occurs in the head or/and tail of the actual motif. This result indicates that the way we obtain heuristics of motif candidates may results in some unwanted artificial

61 influences. The operation of stacking kmers with large magnitude of SVM weights tends to lead to smaller accumulated weights for the head/tail characters than middle characters in candidate motifs. Note that a candidate motif is grown from high weighted to low weighted residues until adding a new character causes a decrease in a selected quality metric. Such artificial lower weights in the head and tail positions may lead to a missing critical gapped kmer feature. In future work, it is necessary to come up with a robust way to eliminate this un- wanted effect and achieve a more accurate estimation of each character’s importance. Recursive feature elimination may be needed to remove insignificant features. In addition, systematic comparisons of the SVM-guided motif discovery approach against probabilistic approaches such as MEME4 are needed. In some initial tests with MEME on our toy case study data sets, we observe that MEME generates plausible motifs. In our test with MEME on Prx data, we see the detected motifs are usually much noisier than the motifs identified by our approach. Moreover, as seen in other studies, MEME may not always detect significant DNA motifs5. This results in a concern of convergence in a local maximum during the sampling in the expectation maximization process. Therefore, for the open pressing de novo motif identification problem, additional comparisons between the our machine learning based method and statistical modeling based ones need to be made.

5.2 Potential biological applications

Our kmer based sequence classification provide a quantitative ranking of kmers along sequences based on these kmers’ ability to differentiate the sequences in the foreground and background. In the problem of Peroxiredoxin subgroup classification, we have identified important kmers beyond the active sites. Considering the conservation of these sites, it is very likely that these sites serve as regulatory sites. As such, our kmer based sequence classification may provide a new way to identify the functionally conserved sites, which may be also useful inputs for structure-based drug discovery. Similarly, the proposed methodology should help understand gene regulation by identifying the binding motifs in various chromatin immunoprecipitation (ChIP) ex- periments. In high-throughput drug screening against combinatorial drug candidates, the identification of common target-binding motifs may also facilitate further drug design and optimization. Although the methodology proposed in this thesis is based on supervised learning, the novel feature extraction approach guided by the first round SVM learning reveals

62 discriminating motif features that differentiate the foreground and background se- quences. These motif features are not necessary to represent the whole foreground set. Therefore, constructing the background set with all random sequences, our two- round learning approach should be able to extract all representative motif features for all potential subgroups in the foreground sequences. In other words, it is possible to identify motif features for subgroup of a given sequence set. These features can be useful input for a clustering analysis. As such, our methodology can be applied in the unsupervised learning task such as identifying the family of each Prx subgroup.

5.3 Other potential applications

Although most multiple sequence alignment methods nowadays employ a heuristic search known as progressive alignment, pairwise alignments based on the full se- quences are performed first to identify regions with similar sequences6,7. However, the pairwise alignments involved based on full sequence do not take into account the sequence conservation among all sequences. Therefore, we propose a SVM-guided progressive alignment method for future study. The kmer-based sequence classifica- tion offers a quantitative estimation of critical regions along each sequence in a set based on their conservation. These critical regions should be aligned with high pri- ority and the multiple sequence alignment should grow progressively. Such proposed SVM-guided multiple sequence alignment is similar to BLAST8, which uses kmer similarity as heuristic to achieve efficient local alignment. Furthermore, the proposed sequence-classification method can be generalized to other types of one-dimensional data. For example, a time series can be treated as a sequence of continuous or discretized values. By defining a time window τ, our proposed two-round learning should be helpful to uncover the kinetic patterns among these time series.

63 References

[1] Mahmoud Ghandi, Morteza Mohammad-Noori, Narges Ghareghani, Dongwon Lee, Levi Garraway, and Michael A Beer. gkmsvm: an r package for gapped- kmer svm. Bioinformatics, 32(14):2205–2207, 2016.

[2] Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, and Michael A Beer. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology, 10(7):e1003711, 2014.

[3] Mahmoud Ghandi, Morteza Mohammad-Noori, and Michael A Beer. Robust k k-mer frequency estimation using gapped k k-mers. Journal of mathematical biology, 69(2):469–500, 2014.

[4] Timothy L Bailey, Charles Elkan, et al. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. 1994.

[5] Jianjun Hu, Bin Li, and Daisuke Kihara. Limitations and potentials of current motif discovery algorithms. Nucleic acids research, 33(15):4899–4913, 2005.

[6] Desmond G Higgins and Paul M Sharp. Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene, 73(1):237–244, 1988.

[7] C´edricNotredame, Desmond G Higgins, and Jaap Heringa. T-coffee: a novel method for fast and accurate multiple sequence alignment1. Journal of molecular biology, 302(1):205–217, 2000.

[8] Stephen F Altschul, Thomas L Madden, Alejandto A Sch¨affer,Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped BLAST and PSI- BLAST:a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, 1997. ISSN 13624962.

64 Jiajie Xiao | Curriculum Vitae 2551A Owen Dr – Winston-Salem, NC 27106 H (336) 734 8738 B [email protected] • Í https://www.linkedin.com/in/jiajie-xiao/

Education Ph.D. Physics Winston-Salem, NC Wake Forest University 2018 Specialized in Computational Biophysics. (Advisor: Freddie R. Salsbury, Jr.) M.S. Computer Science Winston-Salem, NC Wake Forest University 2018 Specialized in Bioinformatics and Machine Learning. (Advisor: William H. Turkett, Jr.) Bachelors of Science Physics Guangzhou, China Sun Yat-sen University 2012 Computational skills Python, C/C++, CUDA, R, Matlab, TCL, LATEX, BASH-like shells, Pandas, Seaborn, Matplotlib, Scikit-learn, TensorFlow, Biopython, Bowtie2, Glaxy, ACEMD, NAMD, CHARMM, HTMD, VMD, AutoDock Vina, PyEmma, MSMBuilder, Cytoscape, clustered computing environments, AWS, Jupiter Notebook, and etc. Related Work Experience Computational Biophysics research fellow Winston Salem, NC Wake Forest University 2013-present Machine learning and Bioinformatics research fellow Winston Salem, NC Wake Forest University 2016-present Research Assistant in Bioinformatics and Data Science Winston Salem, NC NanoMedica, LLC 2013-2014 Teaching Assistant Winston Salem, NC Wake Forest University 2012-present Honors and Awards 2017 - 2018: Center of Molecular Signaling Fellowship, awarded at Wake Forest Univ. 2017: First place in the University Photography Contest, awarded in Winston Salem 2017: Graduate Research Fellowship, awarded at Wake Forest Univ. 2017: Elected Sigma Pi Sigma (National society), awarded at Wake Forest Univ. 2015: Travel Award for APS March Meeting (International conference), awarded by Wake Forest Univ. Relevant coursework Physics: Quantum Mechanics, Statistical Mechanics, Classical Mechanics and Mathematical Meth- ods, Electromagnetism, and Drug Discovery, Design & Development. CS: GPU Programming, Machine Learning, Artificial Intelligence, Markov Chains & Algorithmic Applications, Theory of Algorithms, Theory of Computation, Software Engineering, Operating Systems, Bioinformatics, Computational Systems Biology. Publications and selected Manuscripts 2018: Xiao, J., Melvin, R. L., & Salsbury Jr, F. R.. Probing light chain mutation effects on thrombin via molecular dynamics simulations and machine learning. Journal of Biomolecular Structure and Dynamics. (In press). 2018: Xiao, J., & Salsbury Jr, F. R.. New Na+-binding mode induces significant allosteric regulation on thrombin, as revealed by molecular dynamics simulations, correlation networks and hidden Markov modeling. Journal of Physical Chemistry B. (In submission). 2018: Xiao, J., & Turkett, W. H. IntelligentMotifFinder: A tool for general sequence classifica- tion and motif identification without multiple sequence alignment. Nucleic Acids Research. (In preparation). 2018: Xiao, J., & Turkett, W. H. K-mer based classifiers extract functionally relevant features to support accurate Peroxiredoxin sub-family distinction. Protein Science. (In preparation). 2018: Melvin, R. L., Xiao, J., Berenhaut, K. S., Godwin, R. C., & Salsbury Jr, F. R. Using correlated motions to determine suf icient sampling times for molecular dynamics. Physics Review Letter E. (In revision). 2018: Xiao, J.*, Melvin, R. L.*, Berenhaut, K. S., & Salsbury Jr, F. R. Quantify allosteric sub-community of thrombin via molecular dynamics simulation and network analysis. Journal of Chemical Theory and Computation. (In preparation). 2018: Melvin, R. L., Xiao, J., Godwin, R. C., Berenhaut, K. S. & Salsbury Jr, F. R. Visualizing correlated motion with HDBSCAN clustering. Protein Science. 27(1) 62-75. 2017: Xiao, J., Melvin, R. L., & Salsbury Jr, F. R. Mechanistic insights into thrombin’s switch between “slow” and “fast” forms. Physical Chemistry Chemical Physics. 19(36) 24522-24533. 2017: Xiao, J., & Salsbury Jr., F. R. Molecular dynamics simulations of aptamer-binding reveal generalized allostery in thrombin. Journal of Biomolecular Structure and Dynamics, 35(15), 3354- 3369. 2016: Melvin, R. L., Godwin, R. C., Xiao, J., Thompson, W. G., Berenhaut, K. S., & Salsbury Jr., F. R. Uncovering large-scale conformational change in molecular dynamics without prior knowledge. Journal of Chemical Theory and Computation, 12(12), 6130-6146. 2015: Riley, K. R., Gagliano, J., Xiao, J., Libby, K., Saito, S., Yu, G., ... & Bonin, K. Combining capillary electrophoresis and next-generation sequencing for aptamer selection. Analytical and bioanalytical chemistry, 407(6), 1527-1532. Presentations 2018: J. Xiao and W. H. Turkett. “SVM-guided de novo motif identification and sequence classification." Talk at Structural and Computational Biophysics Discussion group, Winston-Salem NC. March, 2018. 2017: J. Xiao and F. R. Salsbury Jr. “Multiple Na+ binding modes revealed by molecular dynamics simulations." Poster at research retreat for the Center for Molecular Signaling, Winston-Salem NC. December, 2017. 2017: J. Xiao and W. H. Turkett. “K-mer Analysis Of Peroxiredoxin Subgroups." Poster at research retreat for the Center for Molecular Signaling, Winston-Salem NC. December, 2017. 2017: R. L. Melvin, R. C. Godwin, J. Xiao, W. G. Thompson, K.S. Berenhaut, and F. R. Salsbury Jr. “A Modern Approach to Determining and Displaying Conformational Ensembles." Poster at Conformational Ensembles from Experimental Data and Computer Simulations, Berlin, Germany. August 2017. 2017: J. Xiao, R. L. Melvin, and F. R. Salsbury Jr. “Probing the mechanism of the switch between the fast and slow forms of thrombin via molecular dynamics simulations." Poster at research retreat for the Center for Molecular Signaling, Winston-Salem NC. May 2017. 2016: J. Xiao and F. R. Salsbury Jr. “Generalized Allostery in Thrombin." Poster at Biophysical Society 60th annual meeting, Los Angles CA. February 2016. 2015: J. Xiao and F. R. Salsbury Jr. “Aptamer-binding leads a generalized allosteric effect on thrombin." Poster at Structural and Computational Biophysics symposium at Wake Forest University, Winston Salem NC. September 2015. 2015: J. Xiao, K. Bonin, M. Guthold, and F. R. Salsbury Jr. “Structure and Sequence Search on Aptamer-Protein Docking." Talk at American Physical Society March meeting 2015, San Antonio TX. March 2015. 2014: J. Xiao and F. R. Salsbury Jr. “Computational Studies of the Formation of Peroxiredoxin Dimers." Poster at Biophysical Society 58th annual meeting, San Fransisco CA. February 2014.