Parameter Advising for Multiple Sequence Alignment

PARAMETER ADVISING FOR MULTIPLE SEQUENCE ALIGNMENT by Daniel Frank DeBlasio Copyright c Daniel Frank DeBlasio 2016 A Dissertation Submitted to the Faculty of the DEPARTMENT OF COMPUTER SCIENCE In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY In the Graduate College THE UNIVERSITY OF ARIZONA 2016 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Daniel Frank DeBlasio, entitled Parameter Advising for Multiple Sequence Alignment and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy. Date: 15 April 2016 John Kececioglu Date: 15 April 2016 Alon Efrat Date: 15 April 2016 Stephen Kobourov Date: 15 April 2016 Mike Sanderson Final approval and acceptance of this dissertation is contingent upon the candidate's submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. Date: 15 April 2016 Dissertation Director: John Kececioglu 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of the source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the copyright holder. SIGNED: Daniel Frank DeBlasio 4 ACKNOWLEDGEMENTS The work presented in this dissertation would not have been possible without the influences of so many of the people around me. Most importantly I would not have been able to do any of the work presented here without the constant support of my advisor Dr. John Kececioglu. Dr. Kececioglu has supported me since we first met at the RECOMB meeting here in Tucson in 2009. He has both encouraged me in my research but also provided emotional (and financial) support along the way. In addition, I would like to thank my dissertation committee Dr. Alon Efrat Dr. Stephen Kobourov and Dr. Michael Sanderson who was my advisor for my minor in Ecology and Evolutionary Biology. Dr. Kobourov in particular gave me the opportunity to work on projects in his courses that were very much outside the focus of my dissertation but were nonetheless very helpful to my comprehensive knowledge of computer science. I would also like to acknowledge the faculty and staff that were part of the Integrative Graduate Research and Education Traineeship (IGERT) in Genomics. In particular Dr. Noah Whiteman and Dr. Matt Sullivan who allowed me to do rotations in their labs. Additionally, I would like to acknowledge Dr. Mike Barker who along with Dr. Whiteman advised me in teaching the computational component of their class. I would also like to thank the IGERT students who helped me learn to work very closely with biologists, in particular Dr. Jen Wisecaver and Andy Gloss. The students of the Department of Computer Science also deserve a large thank you. They have all provided feedback on talks and been open to discuss ongoing problems with projects even if it was not their specific subfield. Finally, I would like to thank all of my friends and family, who continuously provided support over the years. Most importantly they helped keep me grounded and remind me that there is a world outside the walls of Gould-Simpson. I have received significant financial support throughout my dissertation. My initial funding came from the University of Arizona's National Science Foundation Integrative Graduate Research and Education Traineeship (IGERT) in Genomics, grant number DGE-0654435. Later funding came from NSF grant IIS-1217886 and fellowship funding from the University of Arizona Department of Computer Science. I have received several travel grants from the University of Arizona Graduate and Professional Student Council, the International Society for Computational Biology, the National Science Foundation, and Association for Computing Machinery. I would also like to acknowledge the University of Arizona Research Computing, without an allocation of resources from High Performance Computing (HPC) and High Throughput Computing (HTC) my research would have been more difficult. 5 TABLE OF CONTENTS LIST OF FIGURES................................9 LIST OF TABLES................................. 11 ABSTRACT.................................... 12 CHAPTER 1 Introduction and Background.................. 14 1.1 Introduction................................ 14 1.2 Parameter advising............................ 17 1.3 Survey of related work.......................... 22 1.3.1 Accuracy estimation....................... 23 1.3.2 A priori advising........................ 27 1.3.3 Meta-alignment.......................... 28 1.3.4 Column confidence scoring.................... 29 1.4 Review of protein secondary structure.................. 30 1.5 Plan of the dissertation.......................... 31 CHAPTER 2 Accuracy Estimation....................... 33 2.1 Introduction................................ 33 2.2 The estimator............................... 34 2.3 Learning the estimator from examples.................. 36 2.3.1 Fitting to accuracy values.................... 36 2.3.2 Fitting to accuracy differences.................. 38 CHAPTER 3 The Facet Estimator....................... 44 3.1 Introduction................................ 44 3.2 Estimator features............................ 45 3.2.1 Secondary Structure Blockiness................. 46 3.2.2 Secondary Structure Agreement................. 51 3.2.3 Gap Coil Density......................... 52 3.2.4 Gap Extension Density...................... 52 3.2.5 Gap Open Density........................ 53 3.2.6 Gap Compatibility........................ 53 3.2.7 Substitution Compatibility.................... 54 3.2.8 Amino Acid Identity....................... 55 3.2.9 Secondary Structure Identity................... 55 6 TABLE OF CONTENTS { Continued 3.2.10 Average Substitution Score.................... 56 3.2.11 Core Column Density....................... 56 3.2.12 Information Content....................... 57 3.2.13 Results............................... 57 3.3 Software.................................. 58 CHAPTER 4 The Optimal Advisor Problem.................. 62 4.1 Introduction................................ 62 4.2 Learning an optimal advisor....................... 63 4.2.1 Optimal Advisor......................... 66 4.2.2 Advisor Set............................ 67 4.2.3 Advisor Estimator........................ 68 4.3 Complexity of learning optimal advisors................ 69 CHAPTER 5 Constructing Advisor....................... 74 5.1 Introduction................................ 74 5.2 Constructing optimal advisors by integer linear programming..... 75 5.2.1 Modeling the Advisor Set Problem............... 76 5.2.2 Finding optimal Oracle Sets................... 79 5.2.3 Modeling the Advisor Estimator Problem............ 80 5.2.4 Modeling the Optimal Advisor Problem............ 81 5.3 Approximation algorithm for learning advisor sets........... 81 CHAPTER 6 Parameter Advising for Opal ................... 88 6.1 Introduction................................ 88 6.2 Experimental methods.......................... 90 6.3 Comparison of advisor estimators.................... 94 6.3.1 Finding an estimator....................... 94 6.3.2 Comparing estimators to true accuracy............. 95 6.4 Comparison of advisor sets........................ 96 6.4.1 Shared structure across advisor sets............... 98 6.5 Application to parameter advising.................... 100 6.5.1 Learning advisor sets by different approaches.......... 100 6.5.2 Varying the exact set for the greedy algorithm......... 105 6.5.3 Varying the error tolerance for the greedy algorithm...... 105 6.5.4 Learning advisor sets for different estimators.......... 108 6.6 Software.................................. 110 6.6.1 Opal version 3........................... 110 7 TABLE OF CONTENTS { Continued CHAPTER 7 Aligner Advising for Ensemble Alignment............ 113 7.1 Introduction................................ 114 7.1.1 Related work........................... 116 7.2 Constructing the universe for aligner advising............. 117 7.2.1 Determining the universe of aligners............... 118 7.2.2 Determining the universe of parameter settings........ 118 7.3 Evaluating ensemble alignment..................... 121 7.3.1 Parameter advising........................ 122 7.3.2 Aligner advising.......................... 123 7.3.3 Comparing ensemble alignment to meta-alignment....... 125 7.3.4 Advising accuracy within difficulty bins............. 127 7.3.5 Generalization of aligner advising................ 127 7.3.6 Theoretical limit on advising accuracy............. 128 7.3.7 Composition of advisor sets................... 128 7.3.8 Running time for advising.................... 129 7.4 Software.................................. 130 CHAPTER 8 Adaptive Local Realignment................... 132 8.1 Introduction................................ 132 8.2 Adaptive local realignment........................ 135 8.2.1 Identifying local realignment regions.............. 136 8.2.2 Local parameter advising on a region.............. 138 8.2.3 Iterative local realignment...................

Parameter Advising for Multiple Sequence Alignment

Bioinformatics 1: Lecture 3

BASS: Approximate Search on Large String Databases

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

B.Sc. (Hons.) Biotech BIOT 3013 Unit-5 Satarudra P Singh

Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices

Deriving Amino Acid Exchange Matrices (II) and Multiple Sequence Alignment (I) Summarysummary Dayhoff’Sdayhoff’S PAMPAM--Matricesmatrices

Pairwise Sequence Alignment Algorithm by a New Measure Based

Scoring Matrices for Sequence Comparisons

Rotamer-Specific Statistical Potentials for Protein Structure Modeling

Multiple Sequence Alignment

Practical Considerations of Working with Sequencing Data File Types

Amino Acid Substitution Matrices from Protein Blocks (Amino Add Sequence/Alignment Algorithms/Data Base Srching) STEVEN HENIKOFF* and JORJA G