Title of Dissertation Goes Here in All Caps
Total Page:16
File Type:pdf, Size:1020Kb
TOWARD STRUCTURAL AND FUNCTIONAL PREDICTIONS FROM BIOLOGICAL SEQUENCES APPROVED BY SUPERVISORY COMMITTEE Nick V. Grishin, Ph.D.: Advisor Zbyszek Otwinowski, Ph.D.: Committee Chair Philip J. Thomas, Ph.D. Daniel Rosenbaum, Ph.D. DEDICATION To those who care TOWARD STRUCTURAL AND FUNCTIONAL PREDICTIONS FROM BIOLOGICAL SEQUENCES by WENLIN LI DISSERTATION Presented to the Faculty of the Graduate School of Biomedical Sciences The University of Texas Southwestern Medical Center at Dallas In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY The University of Texas Southwestern Medical Center at Dallas Dallas, Texas May, 2018 Copyright by Wenlin Li, 2018 All Rights Reserved TOWARD STRUCTURAL AND FUNCTIONAL PREDICTIONS FROM BIOLOGICAL SEQUENCES Publication No. Wenlin Li, Ph.D. The University of Texas Southwestern Medical Center at Dallas, 2018 Supervising Professor: Nick V. Grishin, Ph.D. Biological sequences, including DNA and protein sequences, are believed to encode sufficient information to determine the structure and function of biological molecules, which in turn decide the phenotypic traits of animals. Deciphering the biological sequences is an important and multiscale problem that connecting the information flow from genotypes to phenotypes. Current advances in next-generation sequence technology provided tons of sequencing data, demanding innovations in computational algorithm for better interpretation. I developed computational methodologies to understand the biological sequences in various v levels. In the primary sequence level, I analyzed the evolutionary information encoded in protein families and predicted the function (and active sites) of the proteins. To aid my sequence analysis, I developed a set of computational methodologies and deployed them as public web-servers. In the protein structure level, I studied the plasticity of the 3D structures, as well as demonstrated its effect on the uncertainty of computational scoring algorithms. In the organism level, I innovated the computational methodology to assemble and analyze complete genomes of butterflies and discovered convergence evolution in butterfly wing patterns. In conclusion, I advanced the knowledge of biological sequences in multi-layers by computational approaches. ACKNOWLEDGEMENTS I am so grateful to have Dr. Nick Grishin as my mentor. Nick showed very strong support, which is far more than what I can imagine, for both my research and my personal development. He encouraged and inspired me to become a better man. He was always so patient and tolerant when I made mistakes, and walked me through those dark days. I felt so lucky to be able to do researches based on my personal interests and participated in multiple projects. He is more like the sincerest friend who always be there to have my back. I also thank my committee members, Drs. Phillip Thomas, Denial Rosenbaum, and Zbyszek Otwinowski for kind supports. You guys are so good that I always received more than what I expected. I also thank Dr. Dominika Borek for her patience to explain the naïve problems I asked in great details. Great thanks to everyone in the Grishin lab, particularly Lisa Kinch, Jimin Pei, Qian Cong, Jeremy Semeiks, Bong-Hyun Kim, Jinhui Shen, Jing Zhang and Ming Tang, for their friendship, collaboration, help with many questions, and being a wonderful team of colleagues. I especially acknowledged Dr. Qian Cong, who is the model of successful graduate student I am running after and the closest colleague helping me a lot, Drs. Lisa Kinch and Jimin Pei, who selflessly offer me help in both scientific topics and everyday life. Thanks to everybody in UT Southwestern Medical Center, and particularly, people located in ND10, for maintaining such an open, friendly, and inspiring working environment. Finally, I am grateful to my parents, Zongchi Li and ShuiLi Li, for their constant support and complete trust on my choices in life. Special thanks to my wife, Dr. Fang Zhang, who listens to my words in my sleepless night and embraces me with warm arms. TABLE OF CONTENTS TABLE OF CONTENTS ....................................................................................................... viii PRIOR PUBLICATIONS ......................................................................................................... 1 CHAPTER 1 GENERAL INTRODUCTION .......................................................................... 3 CHAPTER 2 SEQ2REF: A WEB SERVER TO FACILITATE FUNCTIONAL INTERPRETATION............................................................................................................... 11 CHAPTER 3 PCLUST: PROTEIN NETWORK VISUALIZATION HIGHLIGHTING EXPERIMENTAL DATA ...................................................................................................... 29 CHAPTER 4 THE ABC TRANSPORTERS IN CANDIDATUS LIBERIBACTER ASIATICUS ............................................................................................................................ 36 CHAPTER 5 CONSERVED EVOLUTIONARY UNITS IN THE HEME‐COPPER OXIDASE SUPERFAMILY REVEALED BY NOVEL HOMOLOGOUS PROTEIN FAMILIES .............................................................................................................................. 82 CHAPTER 6 ESTIMATION OF UNCERTAINTIES IN THE GLOBAL DISTANCE TEST (GDT_TS) FOR CASP MODELS ........................................................................................ 124 CHAPTER 7 CHSEQ: A DATABASE OF CHAMELEON SEQUENCES ...................... 152 CHAPTER 8 ASSESSMENT OF CASP11 CONTACT‐ASSISTED PREDICTIONS .... 188 CHAPTER 9 GENOMES OF 250 SKIPPER BUTTERFLIES REVEAL RAMPANT CONVERGENCE IN WING PATTERNS .......................................................................... 234 viii PRIOR PUBLICATIONS 1. Li W, Cong Q, Pei J, Kinch LN, Grishin N V. The ABC transporters in Candidatus Liberibacter asiaticus. Proteins Struct Funct Bioinforma 2012;80(11):2614–2628. 2. Cong Q, Kinch LN, Pei J, Shi S, Grishin VN, Li W, Grishin N V. An automatic method for CASP9 free modeling structure prediction assessment. Bioinformatics 2011;27(24):3371–3378. 3. Li W, Kinch LN, Grishin N V. Pclust: protein network visualization highlighting experimental data. Bioinformatics 2013;29(20):2647–2648. 4. Li W, Kinch LN, Karplus PA, Grishin N V. ChSeq: a database of chameleon sequences. Protein Sci 2015;24(7):1075–1086. 5. Pei J*, Li W*, Kinch LN, Grishin N V. Conserved evolutionary units in the heme- copper oxidase superfamily revealed by novel homologous protein families. Protein Sci 2014;23(9):1220–1234. 6. Li W, Schaeffer RD, Otwinowski Z, Grishin N V. Estimation of uncertainties in the global distance test (GDT_TS) for CASP Models. PLoS One 2016;11(5):e0154786. 7. Ji R, Cong Q, Li W, Grishin N V. M2SG: mapping human disease-related genetic variants to protein sequences and genomic loci. Bioinformatics 2013;29(22):2953– 2954. 8. Kinch LN, Li W, Monastyrskyy B, Kryshtafovych A, Grishin N V. Evaluation of free modeling targets in CASP11 and ROLL. Proteins Struct Funct Bioinforma 2016;84(S1):51–66. 1 9. Cong Q, Shen J, Li W, Borek D, Otwinowski Z, Grishin N V. The first complete genomes of Metalmarks and the classification of butterfly families. Genomics 2017. 10. Kinch LN*, Li W*, Monastyrskyy B, Kryshtafovych A, Grishin N V. Assessment of CASP11 Contact-Assisted Predictions. Proteins 2016. 11. Kinch LN, Li W, Schaeffer RD, Dunbrack RL, Monastyrskyy B, Kryshtafovych A, Grishin N V. CASP 11 Target Classification. Proteins 2016. 12. Li W, Cong Q, Kinch LN, Grishin N V. Seq2Ref: a web server to facilitate functional interpretation. BMC Bioinformatics 2013;14(1):30. * Authors contributed equally CHAPTER 1 GENERAL INTRODUCTION Biological sequences, including DNA and protein sequences, are believed to encode all the information needed to determine the phenotype of an organism. Understanding what nature says in such biological words and how the words define the molecular function has drawn board attention in biological sciences. Recent advances in the next-generation sequencing technology generated avalanche of sequencing data and motivated biologists to use computational methods to decipher the sequence information. Achievements have been made in different levels, including predicting protein tertiary structures, predicting the functional site of a protein, and understanding the gene determinants for morphological traits, but yet far from satisfactory. I decoded the information in the biological sequences, both protein and DNA, for structural and functional prediction using computational approaches. To better process the biological sequences, I innovated computational data mining methodologies, which were applied on protein sequences to bring medical insights from evolutionary perspectives. I also studied the conformation ambiguity in the protein structures and was invited to assess the performance of the 11st Critical Assessment of Structure Prediction experiment, the community-wide blind test to evaluate advances in structure prediction. Using the butterfly as the model system for evolution studies, I revolutionized the next- generation sequencing analysis algorithms and discovered the butterfly wing pattern divergence. 3 The size of the protein sequence database has been exponentially increasing due to advances in genome sequencing. However, experimentally characterized proteins only constitute a small portion of the database, such that the majority of sequences have been annotated by computational approaches. Current automatic annotation pipelines