Text Skip-Gram Model [BGJM17] Was Trained to Learn the Individual Word Vectors for Which the Dimensionality of the Vector Was Set to 1

PREDICTING TRANSPORTER PROTEINS AND THEIR SUBSTRATE SPECIFICITY Munira Alballa A Thesis in the department of Computer Science and Software Engineering Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy (Computer Science) Concordia University Montreal,´ Quebec,´ Canada June, 2020 c Munira Alballa, 2020 Concordia University School of Graduate Studies This is to certify that the thesis prepared By: Munira Alballa Entitled: Predicting transporter proteins and their substrate specificity and submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science) complies with the regulations of this University and meets the accepted standards with respect to originality and quality. Signed by the final examining committee: Chair Dr. Ciprian Alecsandru External Examiner Dr. Anthony J. Kusalik Examiner Dr. Leila Kosseim Examiner Dr. Adam Krzyzak Examiner Dr. Jia Yuan Yu Supervisor Dr. Gregory Butler Approved Dr. Leila Kosseim, Graduate Program Director June 5, 2020 Dr. Amir Asif, Dean Faculty of Engineering and Computer Science Abstract Predicting transporter proteins and their substrate specificity Munira Alballa, Ph.D. Concordia University, 2020 The publication of numerous genome projects has resulted in an abundance of protein sequences, a significant number of which are still unannotated. Membrane proteins such as transporters, receptors, and enzymes are among the least characterized proteins due to their hydrophobic surfaces and lack of conformational stability. This research aims to build a proteome-wide system to determine transporter substrate specificity, which involves three phases: 1) distinguishing membrane proteins, 2) differentiating transporters from other functional types of membrane proteins, and 3) detecting the substrate specificity of the transporters. To distinguish membrane from non-membrane proteins, we propose a novel tool, TooT-M, that combines the predictions from transmembrane topology prediction tools and a selective set of classifiers where protein samples are represented by pseudo position-specific scoring matrix (Pse-PSSM) vectors. The results suggest that the proposed tool outperforms all state-of-the-art methods in terms of the overall accuracy and Matthews correlation coefficient (MCC). To distinguish transporters from other proteins, we propose an ensemble classifier, TooT-T, that is trained to optimally combine the predictions from homology annotation transfer and machine learning methods. The homology annotation transfer components detect transporters by searching against the transporter classification database (TCDB) using different thresholds. The machine learning methods include three models wherein the protein sequences are encoded using a novel encoding psi-composition. The results show that TooT-T outperforms all state-of-the-art de novo transporter predictors in terms of the overall accuracy and MCC. To detect the substrate specificity of a transporter, we propose a novel tool, TooT-SC, that combines compositional, evolutionary, and positional information to represent protein samples. TooT-SC can efficiently classify transport proteins into eleven classes according to their transported substrate, which is the highest number of predicted substrates offered by any de novo prediction tool. Our results indicate that TooT-SC significantly outperforms iii all of the state-of-the-art methods. Further analysis of the locations of the informative positions reveals that there are more statistically significant informative positions in the transmembrane segments (TMSs) than the non-TMSs, and there are more statistically significant informative positions that occur close to the TMSs compared to regions far from them. iv Acknowledgments First and foremost, I would like to thank Almighty Allah for His guidance, mercy and grace throughout my life and during this journey. My deepest gratitude goes to my academic supervisor, Dr. Gregory Butler, for his time, valuable advice, and encouragement during this research project. I have been extremely lucky to work with a supervisor who cared about my work and my well-being as much as he did. He has demonstrated every aspect of what a good supervisor should be, and I aspire to do the same for my students one day. I am forever indebted to my parents Suliman and Aljowhara, who instilled in me the value of education. I thank them for believing in me and always having my back. This research would not be possible without their prayers, sacrifice and support. I owe them everything. I am sincerely and heartily grateful to my siblings: my eldest sister Hailah for her invaluable suggestions on my topic and her expertise, my sister Norah for always being there for me, my younger siblings for their sweetness and humor that helped me go through the tough times. I would like to thank my partners in this journey, my beloved husband Maan for his encouragement and support and my sweet son Sulaiman for putting up with me being a part-time mommy all these years. Thank you for making my days shine and giving me a reason to smile every day. I love you. Last but not least, I would like to express my deepest appreciation to King Saud University and the Saudi Cultural Bureau for giving me the opportunity to continue my education and providing me with the financial support. v Contents List of Figures xi List of Tables xiii Glossary xv Abbreviations xviii 1 Introduction 1 1 Biologicalbackground.............................. 1 1.1 Membrane protein functional classes .................. 4 2 Motivation.................................... 5 3 Problemdefinition................................ 6 4 Objectives..................................... 9 5 Outline...................................... 10 2 Literature review 13 1 Identifyingmembraneproteins......................... 13 1.1 Transmembranetopologyprediction.................. 13 1.2 Predictionofthemembraneproteinstructuraltype......... 15 2 Identifyingtransporterproteins......................... 18 3 Predictingthesubstratespecificityoftransporters.............. 19 3 Background 22 1 Proteincomposition............................... 22 vi 1.1 Aminoacidcomposition(AAC)..................... 23 1.2 Pairaminoacidcomposition(PAAC)................. 23 1.3 Pseudo-aminoacidcomposition(PseAAC)............... 24 2 Databases..................................... 26 2.1 Transporterclassificationdatabase(TCDB).............. 26 2.2 UniversalProteinResourceKnowledgebase(UniProtKB)...... 27 3 Ontologies..................................... 28 3.1 Chemicalentitiesofbiologicalinterest(ChEBI)............ 28 3.2 GeneOntology(GO)........................... 30 3.3 EvidenceandConclusionOntology(ECO)............... 30 4 Substitution matrix ............................... 31 5 BasicLocalAlignmentSearchTool(BLAST)................. 33 6 Multiplesequencealignment(MSA)...................... 34 4 Identifying membrane proteins 38 1 Introduction.................................... 38 2 Materialsandmethods.............................. 41 2.1 Dataset.................................. 41 2.2 Topologypredictiontools........................ 42 2.3 Proteinsequenceencoding ....................... 43 2.3.1 Pseudo position-specific scoring matrix (Pse-PSSM) .... 43 2.3.2 Splitaminoacidcomposition(SAAC)............ 44 2.4 Machinelearningalgorithms...................... 45 2.4.1 K-nearestneighbor(KNN).................. 45 2.4.2 Optimized evidence-theoretic k-nearest neighbor (OET-KNN) 45 2.4.3 Support vector machine (SVM) . .............. 46 2.4.4 Gradient-boostingmachine(GBM)............. 46 2.5 Ensembleclassifier............................ 47 2.5.1 Allvoting............................ 47 2.5.2 Selectivevoting........................ 48 vii 2.6 Performancemeasurement........................ 49 3 Experimentaldesign............................... 50 4 Resultsanddiscussion.............................. 51 4.1 Evaluationofdifferentproteinencodings................ 51 4.2 Evaluationoftheensembletechniques................. 53 4.3 Evaluationoftransmembranetopologypredictiontools....... 56 4.4 Performance of TooT-M ......................... 56 4.5 Comparisonwiththestate-of-the-artmethods............. 57 5 Conclusion.................................... 59 5 Transporter protein detection 60 1 Introduction.................................... 60 2 TooT-T overview................................. 62 3 Materialsandmethods.............................. 63 3.1 Dataset.................................. 63 3.2 Position-specificiterativeencodings.................. 65 3.3 SVM.................................... 68 3.4 Annotationtransferbyhomology ................... 69 3.5 TooT-T methodology.......................... 69 3.6 Performanceevaluation......................... 71 4 Resultsanddiscussion.............................. 72 4.1 Performanceofthedifferentencodings................. 72 4.2 Performanceofannotationtransferbyhomology........... 74 4.3 TooT-T ensembleperformance..................... 75 4.4 Comparativeperformance........................ 77 5 Conclusion.................................... 78 6 Ontology-based transporter substrate annotation for benchmark datasets 79 1 Introduction.................................... 80 2 Challenges..................................... 81 2.1 Differenttransporterclassifications................... 81 viii 2.2 Lackofdocumentationformanualclassassignment......... 81 3 The proposed tool: Ontoclass .......................... 83 3.1 Overview................................. 83 3.2 Substrate classes to ChEBI-ID....................

Text Skip-Gram Model [BGJM17] Was Trained to Learn the Individual Word Vectors for Which the Dimensionality of the Vector Was Set to 1

Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform

EMBO Encounters Issue43.Pdf

Bioinformatic Methods in Applied Genomic Research

The 4Th Bologna Winter School: Hot Topics in Structural Genomics†

Refining Neural Network Predictions for Helical Transmembraneproteins by Dynamic Programming

A Day in the Life of Dr K. Or How I Learned to Stop Worrying and Love Lysozyme: a Tragedy in Six Acts Gunnar Von Heijne

Dear Delegates,History of Productive Scientiﬁc Discussions of New Challenging Ideas and Participants Contributing from a Wide Range of Interdisciplinary ﬁelds

I S C B N E W S L E T T

Optimization and Machine Learning Applications to Protein Sequence and Structure

Glycoprotein Hormone Receptors in Sea Lamprey

Emidio Capriotti Phd CURRICULUM VITÆ

Predicting Protein Folding Pathways Using Ensemble Modeling and Sequence Information