HMMER User's Guide
Total Page:16
File Type:pdf, Size:1020Kb
HMMER User’s Guide Biological sequence analysis using profile hidden Markov models Sean R. Eddy and the HMMER development team http://hmmer.org Version 3.3; Nov 2019 Copyright (C) 2019 Howard Hughes Medical Institute. HMMER and its documentation are freely distributed under the 3-Clause BSD open source license. For a copy of the license, see opensource.org/licenses/BSD-3-Clause. HMMER development is supported in part by the National Human Genome Research Institute of the US National Institutes of Health under grant number R01HG009116. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Contents Introduction 7 How to avoid reading this manual.......................... 7 Background and brief history............................. 8 Problems HMMER is designed for.......................... 9 HMMER uses ensemble algorithms, not optimal alignment............ 10 Assumptions and limitations of profile HMMs................... 12 How to learn more................................... 12 How to cite HMMER.................................. 12 How to report a bug.................................. 13 When’s HMMER4 coming?.............................. 13 What’s still missing................................... 15 How to avoid using this software (links to similar software)........... 15 Installation 17 Quickest: install a precompiled binary package.................. 17 Quick-ish: compile the source code.......................... 17 Geeky: compile source from our github repository................. 18 Gory details....................................... 19 System requirements............................... 19 Multicore parallelization is default....................... 20 MPI cluster parallelization is optional..................... 21 Using build directories.............................. 21 Makefile targets.................................. 22 Compiling the user guide............................ 22 What gets installed by make install, and where?................ 23 Installing both HMMER2 and HMMER3 ................... 24 Seeing more output from make .......................... 24 Staged installations in a buildroot, for a packaging system......... 24 Workarounds for unusual configure/compilation problems........ 25 Tutorial 27 Tap, tap; is this thing on?............................... 27 The programs in HMMER............................... 27 Running a HMMER program............................. 28 4 sean r. eddy Files used in the tutorial................................ 29 On sequence file formats, briefly........................... 30 Searching a sequence database with a profile.................... 30 Step 1: build a profile with hmmbuild ....................... 31 Step 2: search the sequence database with hmmsearch........... 32 Single sequence protein queries using phmmer................... 41 Iterative protein searches using jackhmmer..................... 42 Searching a profile database with a query sequence................ 44 Step 1: create a profile database file...................... 44 Step 2: compress and index the flatfile with hmmpress........... 46 Step 3: search the profile database with hmmscan.............. 46 Summary statistics for a profile database: hmmstat............. 47 Creating multiple alignments with hmmalign................... 49 Searching DNA sequences............................... 51 Step 1: build a profile with hmmbuild..................... 52 Step 2: search the DNA sequence database with nhmmer.......... 52 The HMMER profile/sequence comparison pipeline 57 Null model........................................ 58 MSV filter........................................ 59 Biased composition filter................................ 60 Viterbi filter....................................... 61 Forward filter/parser.................................. 62 Domain definition.................................... 62 Modifications to the pipeline as used for DNA search............... 65 SSV, not MSV.................................... 65 There are no domains, but there are envelopes................ 66 Biased composition................................. 66 Tabular output formats 67 The target hits table................................... 67 The domain hits table (protein search only)..................... 70 Manual pages for HMMER programs 73 alimask - calculate and add column mask to a multiple sequence alignment.. 73 hmmalign - align sequences to a profile........................ 77 hmmbuild - construct profiles from multiple sequence alignments......... 79 hmmc2 - example client for the HMMER daemon.................. 85 hmmconvert - convert profile file to various formats................. 86 hmmemit - sample sequences from a profile...................... 87 hmmfetch - retrieve profiles from a file......................... 90 hmmlogo - produce a conservation logo graphic from a profile........... 92 hmmpgmd - daemon for database search web services................. 93 hmmpgmd_shard - sharded daemon for database search web services........ 95 hmmpress - prepare a profile database for hmmscan................. 97 hmmer user’s guide 5 hmmscan - search sequence(s) against a profile database............... 98 hmmsearch - search profile(s) against a sequence database............. 103 hmmsim - collect profile score distributions on random sequences......... 108 hmmstat - summary statistics for a profile file.................... 114 jackhmmer - iteratively search sequence(s) against a sequence database...... 116 makehmmerdb - build nhmmer database from a sequence file............ 125 nhmmer - search DNA queries against a DNA sequence database......... 126 nhmmscan - search DNA sequence(s) against a DNA profile database....... 134 phmmer - search protein sequence(s) against a protein sequence database..... 139 Manual pages for Easel miniapps 147 esl-afetch - retrieve alignments from a multi-MSA database........... 147 esl-alimanip - manipulate a multiple sequence alignment............. 149 esl-alimap - map two alignments to each other................... 153 esl-alimask - remove columns from a multiple sequence alignment....... 155 esl-alimerge - merge alignments based on their reference (RF) annotation.... 161 esl-alipid - calculate pairwise percent identities for all sequence......... 163 esl-alirev - reverse complement a multiple alignment............... 164 esl-alistat - summarize a multiple sequence alignment file........... 166 esl-compalign - compare two multiple sequence alignments............ 169 esl-compstruct - calculate accuracy of RNA secondary structure predictions... 171 esl-construct - describe or create a consensus secondary structure........ 173 esl-histplot - collate data histogram, output xmgrace datafile.......... 175 esl-mask - mask sequence residues with X’s (or other characters)......... 176 esl-mixdchlet - fitting mixture Dirichlets to count data............... 178 esl-reformat - convert sequence file formats..................... 180 esl-selectn - select random subset of lines from file................ 183 esl-seqrange - determine a range of sequences for one of many parallel..... 184 esl-seqstat - summarize contents of a sequence file................ 185 esl-sfetch - retrieve (sub-)sequences from a sequence file............. 186 esl-shuffle - shuffling sequences or generating random ones........... 189 esl-ssdraw - create postscript secondary structure diagrams............ 192 esl-translate - translate DNA sequence in six frames into individual...... 203 esl-weight - calculate sequence weights in MSA(s)................. 206 Input files and formats 207 Reading from files, compressed files, and pipes.................. 207 .gz compressed files............................... 209 HMMER profile HMM files.............................. 210 header section................................... 211 main model section................................ 214 Stockholm, the recommended multiple sequence alignment format....... 216 syntax of Stockholm markup.......................... 217 semantics of Stockholm markup........................ 217 6 sean r. eddy recognized #=GF annotations.......................... 218 recognized #=GS annotations.......................... 218 recognized #=GC annotations.......................... 219 recognized #=GR annotations.......................... 219 A2M multiple alignment format........................... 221 An example A2M file............................... 221 Legal characters.................................. 222 Determining consensus columns........................ 222 hmmpgmd sequence database format........................ 223 Fields in header line............................... 223 FASTA-like sequence format........................... 223 Creating a file in hmmpgmd format...................... 224 Score matrix files.................................... 225 Acknowledgements and history 227 Introduction Most protein sequences are composed from a relatively small number of ancestral protein domain families. Our sampling of common pro- tein domain families has become comprehensive and deep, while raw sequence data continues to accumulate explosively. It has become ad- vantageous to compare sequences against all known domain families, instead of all known sequences. This makes protein sequence analysis more like speech recogni- tion. When you talk to your smartphone, it doesn’t compare your digitized speech to everything that’s ever been said. It compares what you say to a prebuilt dataset of statistical models of common words and phonemes. Using machine learning techniques,