HMMER User's Guide
Total Page:16
File Type:pdf, Size:1020Kb
HMMER User’s Guide Biological sequence analysis using profile hidden Markov models Sean R. Eddy and the HMMER development team http://hmmer.org Version 3.3.2; Nov 2020 Copyright (C) 2020 Howard Hughes Medical Institute. HMMER and its documentation are freely distributed under the 3-Clause BSD open source license. For a copy of the license, see opensource.org/licenses/BSD-3-Clause. HMMER development is supported in part by the National Human Genome Research Institute of the US National Institutes of Health under grant number R01HG009116. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Contents Introduction 7 How to avoid reading this manual.......................... 7 Background and brief history............................. 8 Problems HMMER is designed for.......................... 9 HMMER uses ensemble algorithms, not optimal alignment............ 10 Assumptions and limitations of profile HMMs................... 12 How to learn more................................... 12 How to cite HMMER.................................. 12 How to report a bug.................................. 13 When’s HMMER4 coming?.............................. 13 What’s still missing................................... 15 How to avoid using this software (links to similar software)........... 15 Installation 17 Quickest: install a precompiled binary package.................. 17 Quick-ish: compile the source code.......................... 17 Geeky: compile source from our github repository................. 18 Gory details....................................... 19 System requirements............................... 19 Multicore parallelization is default....................... 21 MPI cluster parallelization is optional..................... 21 Using build directories.............................. 22 Makefile targets.................................. 22 Compiling the user guide............................ 22 What gets installed by make install, and where?................ 23 Installing both HMMER2 and HMMER3 ................... 24 Seeing more output from make .......................... 24 Staged installations in a buildroot, for a packaging system......... 25 Workarounds for unusual configure/compilation problems........ 25 Tutorial 27 Tap, tap; is this thing on?............................... 27 The programs in HMMER............................... 27 Running a HMMER program............................. 28 4 sean r. eddy Files used in the tutorial................................ 29 On sequence file formats, briefly........................... 30 Searching a sequence database with a profile.................... 30 Step 1: build a profile with hmmbuild ....................... 31 Step 2: search the sequence database with hmmsearch........... 32 Single sequence protein queries using phmmer................... 41 Iterative protein searches using jackhmmer..................... 42 Searching a profile database with a query sequence................ 44 Step 1: create a profile database file...................... 44 Step 2: compress and index the flatfile with hmmpress........... 46 Step 3: search the profile database with hmmscan.............. 46 Summary statistics for a profile database: hmmstat............. 47 Creating multiple alignments with hmmalign................... 49 Searching DNA sequences............................... 51 Step 1: build a profile with hmmbuild..................... 52 Step 2: search the DNA sequence database with nhmmer.......... 52 The HMMER profile/sequence comparison pipeline 57 Null model........................................ 58 MSV filter........................................ 59 Biased composition filter................................ 60 Viterbi filter....................................... 61 Forward filter/parser.................................. 62 Domain definition.................................... 62 Modifications to the pipeline as used for DNA search............... 65 SSV, not MSV.................................... 65 There are no domains, but there are envelopes................ 66 Biased composition................................. 66 Tabular output formats 67 The target hits table................................... 67 The domain hits table (protein search only)..................... 70 Manual pages for HMMER programs 73 alimask - calculate and add column mask to a multiple sequence alignment.. 73 hmmalign - align sequences to a profile........................ 77 hmmbuild - construct profiles from multiple sequence alignments......... 79 hmmc2 - example client for the HMMER daemon.................. 85 hmmconvert - convert profile file to various formats................. 86 hmmemit - sample sequences from a profile...................... 87 hmmfetch - retrieve profiles from a file......................... 90 hmmlogo - produce a conservation logo graphic from a profile........... 92 hmmpgmd - daemon for database search web services................. 93 hmmpgmd_shard - sharded daemon for database search web services........ 95 hmmpress - prepare a profile database for hmmscan................. 97 hmmer user’s guide 5 hmmscan - search sequence(s) against a profile database............... 98 hmmsearch - search profile(s) against a sequence database............. 103 hmmsim - collect profile score distributions on random sequences......... 108 hmmstat - summary statistics for a profile file.................... 114 jackhmmer - iteratively search sequence(s) against a sequence database...... 116 makehmmerdb - build nhmmer database from a sequence file............ 125 nhmmer - search DNA queries against a DNA sequence database......... 126 nhmmscan - search DNA sequence(s) against a DNA profile database....... 133 phmmer - search protein sequence(s) against a protein sequence database..... 138 Manual pages for Easel miniapps 145 esl-afetch - retrieve alignments from a multi-MSA database........... 145 esl-alimanip - manipulate a multiple sequence alignment............. 147 esl-alimap - map two alignments to each other................... 151 esl-alimask - remove columns from a multiple sequence alignment....... 153 esl-alimerge - merge alignments based on their reference (RF) annotation.... 159 esl-alipid - calculate pairwise percent identities for all sequence......... 161 esl-alirev - reverse complement a multiple alignment............... 162 esl-alistat - summarize a multiple sequence alignment file........... 164 esl-compalign - compare two multiple sequence alignments............ 167 esl-compstruct - calculate accuracy of RNA secondary structure predictions... 169 esl-construct - describe or create a consensus secondary structure........ 171 esl-histplot - collate data histogram, output xmgrace datafile.......... 173 esl-mask - mask sequence residues with X’s (or other characters)......... 174 esl-mixdchlet - fitting mixture Dirichlets to count data............... 176 esl-reformat - convert sequence file formats..................... 178 esl-selectn - select random subset of lines from file................ 181 esl-seqrange - determine a range of sequences for one of many parallel..... 182 esl-seqstat - summarize contents of a sequence file................ 183 esl-sfetch - retrieve (sub-)sequences from a sequence file............. 184 esl-shuffle - shuffling sequences or generating random ones........... 187 esl-ssdraw - create postscript secondary structure diagrams............ 190 esl-translate - translate DNA sequence in six frames into individual...... 201 esl-weight - calculate sequence weights in MSA(s)................. 204 Input files and formats 205 Reading from files, compressed files, and pipes.................. 205 .gz compressed files............................... 207 HMMER profile HMM files.............................. 208 header section................................... 209 main model section................................ 212 Stockholm, the recommended multiple sequence alignment format....... 214 syntax of Stockholm markup.......................... 215 semantics of Stockholm markup........................ 215 6 sean r. eddy recognized #=GF annotations.......................... 216 recognized #=GS annotations.......................... 216 recognized #=GC annotations.......................... 217 recognized #=GR annotations.......................... 217 A2M multiple alignment format........................... 219 An example A2M file............................... 219 Legal characters.................................. 220 Determining consensus columns........................ 220 hmmpgmd sequence database format........................ 221 Fields in header line............................... 221 FASTA-like sequence format........................... 221 Creating a file in hmmpgmd format...................... 222 Score matrix files.................................... 223 Acknowledgements and history 225 Introduction Most protein sequences are composed from a relatively small number of ancestral protein domain families. Our sampling of common pro- tein domain families has become comprehensive and deep, while raw sequence data continues to accumulate explosively. It has become ad- vantageous to compare sequences against all known domain families, instead of all known sequences. This makes protein sequence analysis more like speech recogni- tion. When you talk to your smartphone, it doesn’t compare your digitized speech to everything that’s ever been said. It compares what you say to a prebuilt dataset of statistical models of common words and phonemes. Using machine learning techniques,