Fast, Sensitive Protein Sequence Searches Using Iterative Pairwise Comparison of Hidden Markov Models
Total Page:16
File Type:pdf, Size:1020Kb
Dissertation zur Erlangung des Doktorgrades der Fakultät für Chemie und Pharmazie der Ludwig–Maximilians–Universität München Fast, sensitive protein sequence searches using iterative pairwise comparison of hidden Markov models Michael Remmert aus Köln 2011 Erklärung: Diese Dissertation wurde im Sinne von §13 Abs. 3 bzw. 4 der Promotionsordnung vom 29. Januar 1998 (in der Fassung der sechsten Änderungssatzung vom 16. August 2010) von Herrn Professor Dr. Patrick Cramer betreut. Ehrenwörtliche Versicherung: Diese Dissertation wurde selbstständig, ohne unerlaubte Hilfe erarbeitet. München, am 16. September 2011 Michael Remmert Dissertation eingereicht am: 16. September 2011 1. Gutachter: Prof. Dr. Patrick Cramer 2. Gutachter: Prof. Dr. Dmitrij Frishman Mündliche Prüfung am: 23. November 2011 Acknowledgements I owe my gratitude to all those people who have made this dissertation possible by support- ing and encouraging me throughout the last years, and I can only try to acknowledge them here. First and foremost, my deepest gratitude is to my advisor Dr. Johannes Söding for given me the opportunity to work in his group and to contribute to many fascinating projects. Furthermore, I would like to thank Johannes for all the fruitful discussions, the constant support, and for making the last years to such a great experience. I would like to thank Prof. Dr. Patrick Cramer for being my doctoral supervisor, and Prof. Dr. Dmitrij Frishman for being my second PhD examiner. I am also very grateful to Prof. Dr. Klaus Förstemann, Prof. Dr. Roland Beckmann, Prof. Dr. Ulrike Gaul and Prof. Dr. Karl-Peter Hopfner for offering their time as members of my dissertation commit- tee. Furthermore, I deeply appreciate the critical reading of this thesis by Dr. Johannes Söding and Theresa Niederberger. I also would like to thank all members of the Söding and Tresch group for the great atmosphere in our office, for all their help and discussions, and for all the fun at the social activities. Thanks to Andreas, Maria and Andy for the help in integrating their tools in this project. Last but not least, I am deeply grateful to my parents, my sister Ulrike with Jürgen, my nieces Christiane and Caroline and all my good friends for all the support, patience, trust and encouragement. iii Summary Most sequence-based methods for protein structure or function prediction construct a mul- tiple sequence alignment (MSA) of homologs as a first step. The standard search tool to generate multiple sequence alignments is PSI-BLAST (> 25 000 citations), an extension of BLAST to profile-sequence comparison. PSI-BLAST owes its sensitivity to its use of se- quence profiles and to its iterative search scheme. Significant sequence hits are added to the evolving multiple alignment from which a sequence profile is generated for the next search iteration. HMMER3 is similar to PSI-BLAST, but uses a profile hidden Markov model (HMM) instead of a sequence profile to represent the evolving query multiple alignment. The gain in sensitivity over PSI-BLAST is paid by a factor 3 to 4 reduction in speed. In my thesis work, I have developed HHblits, the first method for iterative sequence searching based on the comparison of profile HMMs. It is faster than PSI-BLAST and HMMER3, more sensitive, and constructs multiple alignments of significantly better quality. The method builds on the HHsearch algorithm for pairwise comparison of profile HMMs, to which it owes its high sensitivity and alignment quality. In parallel to HHblits, our group developed a fast clustering method that can generate a covering set of HMMs for the entire UniProt database in a few weeks time. To speed up the search by a factor ∼ 2000, I developed a prefilter that is based on a novel algorithm for profile-profile comparison designed for maximum speed. The algo- rithm effectively reduces profile-profile alignment to profile-sequence alignment by coding the database profiles by “column state sequences“ in which profile columns are represented by an alphabet of 219 discretized column states. This permits a fast implementation of the profile-profile comparison that employs the SIMD (single instruction multiple data) instruc- tion sets available on modern CPUs. In this way, HHblits performs 16 byte operations in parallel in a single clock cycle. Several further filtering steps reduce the amount of slow, full-blown HMM-HMM comparisons to a fraction of < 1/1000. Our tests show that the loss of sensitivity due to the prefilter is negligible. On a standard SCOP20 ROC benchmark (SCOP1.73 proteins filtered to 20% maximum sequence identity), HHblits detects twice as many true positives as PSI-BLAST and 54% more than HMMER3 at 1% error rate in the first iteration. Two search iterations HHblits detect significantly more true positives than five PSI-BLAST iterations. Alignment quality is likewise improved significantly. Furthermore, we are able to make confident fold predictions (E-value < 10−3) and build structural models for 394 Pfam domains for which no fold prediction has been possible. iv HHblits is a robust, general-purpose protein sequence search tool that has the potential to replace PSI-BLAST as state-of-the-art method for the generation of MSAs. Due to the high alignment quality, HHblits alignments are able to improve secondary structure prediction methods such as PSIPRED. Furthermore, these better alignments facilitates the function and structure prediction for proteins for which nothing is yet known. In the second part of this thesis we study the evolution of outer membrane β-barrels (OMBBs). These proteins are the major class of outer membrane proteins (OMPs) from Gram-negative bacteria, mitochondria and plastids. Their transmembrane domains consist of 8 to 24 β-strands forming a closed, barrel-shaped β-sheet around a central pore. Despite their obvious structural regularity, evidence for an origin by duplication or for a common ancestry had not been found. We use three complementary approaches to show that all OMBBs from Gram-negative bacteria evolved from a single, ancestral ββ hairpin. First, we link almost all families of known single-chain bacterial OMBBs with each other through transitive profile searches. Second, we identify a clear repeat signature in the sequences of many OMBBs in which the repeating sequence unit coincides with the structural ββ hairpin repeat. Third, we show that the observed sequence similarity between OMBB hairpins cannot be explained by structural or membrane constraints on their sequences. The third approach addresses a longstanding problem in protein evolution: how to distinguish between a very remotely homologous relationship and the opposing scenario of ”sequence convergence“. The origin of a diverse group of proteins from a single hairpin module supports the hypothesis that, around the time of transition from the RNA to the protein world, proteins arose by amplification and recombination of short peptide modules that had previously evolved as cofactors of RNAs. This research provides the basis for the identification and classification of outer membrane β-barrels and explains the evolutionary origin of this important bacterial protein class. v Contents Acknowledgements iii Summary iv 1. Motivation and overview 1 I. HHblits - Iterative HMM-based homology searches 3 2. Introduction to remote homology searches 4 2.1. Scoring models and gap penalties . .4 2.2. Pairwise sequence alignment . .6 2.3. Profile alignment . .8 2.4. HMM-HMM alignment . .9 2.4.1. Log-sum-of-odds score . .9 2.4.2. Pairwise alignment of HMMs....................... 10 2.4.3. HHsearch . 12 2.5. Iterative sequence search methods . 12 2.5.1. PSI-BLAST ................................. 13 2.5.2. HMMER3 .................................. 13 3. Material and methods 15 3.1. Workflow of HHblits . 16 3.2. Generation of HHblits databases with kClust . 18 3.3. Fast prefiltering of HHblits . 20 3.3.1. Column state alphabet for sequence profile encoding . 20 3.3.2. Generation of the column state alphabet . 22 3.3.3. Translation of sequence profiles in column state sequences . 23 3.3.4. SSE2 .................................... 25 3.3.5. Fast prefilter algorithms . 25 3.3.6. Gapless local alignment with SSE2 ................... 27 3.3.7. Smith-Waterman with SSE2 ....................... 28 3.3.8. Smith-Waterman with backtrace . 32 vi Contents 3.4. Additional filter steps . 33 3.4.1. Early stopping . 34 3.4.2. Tube shading . 35 3.4.3. Avoid aligning the same templates in multiple iterations . 36 3.5. P -value calculation with neural networks . 36 3.6. Realignment of matches by MAC algorithm . 39 3.6.1. Consecutive merging of best alignments . 40 3.7. Multi-threading support . 41 3.8. Compact database by FFINDEX ......................... 41 4. Results 43 4.1. Benchmarks . 43 4.1.1. Homology detection benchmark . 45 4.1.2. Homology detection benchmark with multi-domain proteins . 47 4.1.3. Alignment quality . 48 4.1.4. Runtime . 49 4.1.5. Quality of E-values ........................... 53 4.1.6. Comparison of HHblits to HHsearch . 54 4.1.7. Comparison of profiles built by HHblits or buildali.pl . 55 4.1.8. UniProt20 versus NR20 database . 57 4.1.9. Using HMMER3-formatted databases . 58 4.1.10. Prefilter parameters . 60 4.1.11. Parameter optimization . 61 4.1.12. Things that did not work . 62 4.2. Improved PSIPRED secondary structure prediction with HHblits profiles . 63 4.3. Structure prediction for selected Pfam domains . 64 4.3.1. CTK3 - a possible new CID . 65 4.3.2. PIP49 - a possible Ca2+-activated protein kinase . 67 4.3.3. The HAP complex . 70 4.4. Web server . 72 5. Conclusion and Outlook 74 II. Evolution of outer membrane β-barrels 79 6. Introduction to the evolution of outer membrane β-barrels 80 6.1. Cell envelope of Gram-negative bacteria . 80 6.2. Outer membrane β-barrels . 82 6.3. Protein evolution .