Hmmer and Applications
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR. -
HMMER User's Guide
HMMER User's Guide Biological sequence analysis using pro®le hidden Markov models http://hmmer.wustl.edu/ Version 2.1.1; December 1998 Sean Eddy Dept. of Genetics, Washington University School of Medicine 4566 Scott Ave., St. Louis, MO 63110, USA [email protected] With contributions by Ewan Birney ([email protected]) Copyright (C) 1992-1998, Washington University in St. Louis. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. The HMMER software package is a copyrighted work that may be freely distributed and modi®ed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Some versions of HMMER may have been obtained under specialized commercial licenses from Washington University; for details, see the ®les COPYING and LICENSE that came with your copy of the HMMER software. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Appendix for a copy of the full text of the GNU General Public License. 1 Contents 1 Tutorial 5 1.1 The programs in HMMER . 5 1.2 Files used in the tutorial . 6 1.3 Searching a sequence database with a single pro®le HMM . 6 HMM construction with hmmbuild . 7 HMM calibration with hmmcalibrate . 7 Sequence database search with hmmsearch . 8 Searching major databases like NR or SWISSPROT . -
Apply Parallel Bioinformatics Applications on Linux PC Clusters
Tunghai Science Vol. : 125−141 125 July, 2003 Apply Parallel Bioinformatics Applications on Linux PC Clusters Yu-Lun Kuo and Chao-Tung Yang* Abstract In addition to the traditional massively parallel computers, distributed workstation clusters now play an important role in scientific computing perhaps due to the advent of commodity high performance processors, low-latency/high-band width networks and powerful development tools. As we know, bioinformatics tools can speed up the analysis of large-scale sequence data, especially about sequence alignment. To fully utilize the relatively inexpensive CPU cycles available to today’s scientists, a PC cluster consists of one master node and seven slave nodes (16 processors totally), is proposed and built for bioinformatics applications. We use the mpiBLAST and HMMer on parallel computer to speed up the process for sequence alignment. The mpiBLAST software uses a message-passing library called MPI (Message Passing Interface) and the HMMer software uses a software package called PVM (Parallel Virtual Machine), respectively. The system architecture and performances of the cluster are also presented in this paper. Keywords: Parallel computing, Bioinformatics, BLAST, HMMer, PC Clusters, Speedup. 1. Introduction Extraordinary technological improvements over the past few years in areas such as microprocessors, memory, buses, networks, and software have made it possible to assemble groups of inexpensive personal computers and/or workstations into a cost effective system that functions in concert and posses tremendous processing power. Cluster computing is not new, but in company with other technical capabilities, particularly in the area of networking, this class of machines is becoming a high-performance platform for parallel and distributed applications [1, 2, 11, 12, 13, 14, 15, 16, 17]. -
Homology & Alignment
Protein Bioinformatics Johns Hopkins Bloomberg School of Public Health 260.655 Thursday, April 1, 2010 Jonathan Pevsner Outline for today 1. Homology and pairwise alignment 2. BLAST 3. Multiple sequence alignment 4. Phylogeny and evolution Learning objectives: homology & alignment 1. You should know the definitions of homologs, orthologs, and paralogs 2. You should know how to determine whether two genes (or proteins) are homologous 3. You should know what a scoring matrix is 4. You should know how alignments are performed 5. You should know how to align two sequences using the BLAST tool at NCBI 1 Pairwise sequence alignment is the most fundamental operation of bioinformatics • It is used to decide if two proteins (or genes) are related structurally or functionally • It is used to identify domains or motifs that are shared between proteins • It is the basis of BLAST searching (next topic) • It is used in the analysis of genomes myoglobin Beta globin (NP_005359) (NP_000509) 2MM1 2HHB Page 49 Pairwise alignment: protein sequences can be more informative than DNA • protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time • DNA sequences can be translated into protein, and then used in pairwise alignments 2 Find BLAST from the home page of NCBI and select protein BLAST… Page 52 Choose align two or more sequences… Page 52 Enter the two sequences (as accession numbers or in the fasta format) and click BLAST. -
Assembly Exercise
Assembly Exercise Turning reads into genomes Where we are • 13:30-14:00 – Primer Design to Amplify Microbial Genomes for Sequencing • 14:00-14:15 – Primer Design Exercise • 14:15-14:45 – Molecular Barcoding to Allow Multiplexed NGS • 14:45-15:15 – Processing NGS Data – de novo and mapping assembly • 15:15-15:30 – Break • 15:30-15:45 – Assembly Exercise • 15:45-16:15 – Annotation • 16:15-16:30 – Annotation Exercise • 16:30-17:00 – Submitting Data to GenBank Log onto ILRI cluster • Log in to HPC using ILRI instructions • NOTE: All the commands here are also in the file - assembly_hands_on_steps.txt • If you are like me, it may be easier to cut and paste Linux commands from this file instead of typing them in from the slides Start an interactive session on larger servers • The interactive command will start a session on a server better equipped to do genome assembly $ interactive • Switch to csh (I use some csh features) $ csh • Set up Newbler software that will be used $ module load 454 A norovirus sample sequenced on both 454 and Illumina • The vendors use different file formats unknown_norovirus_454.GACT.sff unknown_norovirus_illumina.fastq • I have converted these files to additional formats for use with the assembly tools unknown_norovirus_454_convert.fasta unknown_norovirus_454_convert.fastq unknown_norovirus_illumina_convert.fasta Set up and run the Newbler de novo assembler • Create a new de novo assembly project $ newAssembly de_novo_assembly • Add read data to the project $ addRun de_novo_assembly unknown_norovirus_454.GACT.sff -
HMMER User's Guide
HMMER User’s Guide Biological sequence analysis using profile hidden Markov models http://hmmer.org/ Version 3.0rc1; February 2010 Sean R. Eddy for the HMMER Development Team Janelia Farm Research Campus 19700 Helix Drive Ashburn VA 20147 USA http://eddylab.org/ Copyright (C) 2010 Howard Hughes Medical Institute. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. HMMER is licensed and freely distributed under the GNU General Public License version 3 (GPLv3). For a copy of the License, see http://www.gnu.org/licenses/. HMMER is a trademark of the Howard Hughes Medical Institute. 1 Contents 1 Introduction 5 How to avoid reading this manual . 5 How to avoid using this software (links to similar software) . 5 What profile HMMs are . 5 Applications of profile HMMs . 6 Design goals of HMMER3 . 7 What’s still missing in HMMER3 . 8 How to learn more about profile HMMs . 9 2 Installation 10 Quick installation instructions . 10 System requirements . 10 Multithreaded parallelization for multicores is the default . 11 MPI parallelization for clusters is optional . 11 Using build directories . 12 Makefile targets . 12 3 Tutorial 13 The programs in HMMER . 13 Files used in the tutorial . 13 Searching a sequence database with a single profile HMM . 14 Step 1: build a profile HMM with hmmbuild . 14 Step 2: search the sequence database with hmmsearch . 16 Searching a profile HMM database with a query sequence . 22 Step 1: create an HMM database flatfile . 22 Step 2: compress and index the flatfile with hmmpress . -
The Uniprot Knowledgebase BLAST
Introduction to bioinformatics The UniProt Knowledgebase BLAST UniProtKB Basic Local Alignment Search Tool A CRITICAL GUIDE 1 Version: 1 August 2018 A Critical Guide to BLAST BLAST Overview This Critical Guide provides an overview of the BLAST similarity search tool, Briefly examining the underlying algorithm and its rise to popularity. Several WeB-based and stand-alone implementations are reviewed, and key features of typical search results are discussed. Teaching Goals & Learning Outcomes This Guide introduces concepts and theories emBodied in the sequence database search tool, BLAST, and examines features of search outputs important for understanding and interpreting BLAST results. On reading this Guide, you will Be aBle to: • search a variety of Web-based sequence databases with different query sequences, and alter search parameters; • explain a range of typical search parameters, and the likely impacts on search outputs of changing them; • analyse the information conveyed in search outputs and infer the significance of reported matches; • examine and investigate the annotations of reported matches, and their provenance; and • compare the outputs of different BLAST implementations and evaluate the implications of any differences. finding short words – k-tuples – common to the sequences Being 1 Introduction compared, and using heuristics to join those closest to each other, including the short mis-matched regions Between them. BLAST4 was the second major example of this type of algorithm, From the advent of the first molecular sequence repositories in and rapidly exceeded the popularity of FastA, owing to its efficiency the 1980s, tools for searching dataBases Became essential. DataBase searching is essentially a ‘pairwise alignment’ proBlem, in which the and Built-in statistics. -
Multiple Alignments, Blast and Clustalw
Multiple alignments, blast and clustalW 1. Blast idea: a. Filter out low complexity regions (tandem repeats… that sort of thing) [optional] b. Compile list of high-scoring strings (words, in BLAST jargon) of fixed length in query (threshold T) c. Extend alignments (highs scoring pairs) d. Report High Scoring pairs: score at least S (or an E value lower than some threshold) 2. Multiple Sequence Alignments: a. Attempts to extend dynamic programming techniques to multiple sequences run into problems after only a few proteins (8 average proteins were a problem early in 2000s) b. Heuristic approach c. Idea (Progressive Approach): i. homologous sequences are evolutionarily related ii. Build multiple alignment by series of pairwise alignments based off some phylogenetic tree (the initial tree or the guide tree ) iii. Add in more distantly related sequences d. Progressive Sequence alignment example: 1. NYLS & NKYLS: N YLS N(K|-)YLS NKYLS 2. NFS & NFLS: N YLS NF S NF(L|-)S NKYLS NFLS 3. N(K|-)YLS & NF(L|-)S N YLS N(K|-)(Y|F)(L|-)S NKYLS N YLS N F S N FLS e. Assessment: i. Works great for fairly similar sequences ii. Not so well for highly divergent ones f. Two Problems: i. local minimum problem: Algorithm greedily adds sequences based off of tree— might miss global solution ii. Alignment parameters: Mistakes (misaligned regions) early in procedure can’t be corrected later. g. ClustalW does multiple alignments and attempts to solve alignment parameter problem i. gap costs are dynamically varied based on position and amino acid ii. weight matrices are changed as the level of divergence between sequence increases (say going from PAM30 -> PAM60) iii. -
Software List for Biology, Bioinformatics and Biostatistics CCT
Software List for biology, bioinformatics and biostatistics v CCT - Delta Software Version Application short read assembler and it works on both small and large (mammalian size) ALLPATHS-LG 52488 genomes provides a fast, flexible C++ API & toolkit for reading, writing, and manipulating BAMtools 2.4.0 BAM files a high level of alignment fidelity and is comparable to other mainstream Barracuda 0.7.107b alignment programs allows one to intersect, merge, count, complement, and shuffle genomic bedtools 2.25.0 intervals from multiple files Bfast 0.7.0a universal DNA sequence aligner tool analysis and comprehension of high-throughput genomic data using the R Bioconductor 3.2 statistical programming BioPython 1.66 tools for biological computation written in Python a fast approach to detecting gene-gene interactions in genome-wide case- Boost 1.54.0 control studies short read aligner geared toward quickly aligning large sets of short DNA Bowtie 1.1.2 sequences to large genomes Bowtie2 2.2.6 Bowtie + fully supports gapped alignment with affine gap penalties BWA 0.7.12 mapping low-divergent sequences against a large reference genome ClustalW 2.1 multiple sequence alignment program to align DNA and protein sequences assembles transcripts, estimates their abundances for differential expression Cufflinks 2.2.1 and regulation in RNA-Seq samples EBSEQ (R) 1.10.0 identifying genes and isoforms differentially expressed EMBOSS 6.5.7 a comprehensive set of sequence analysis programs FASTA 36.3.8b a DNA and protein sequence alignment software package FastQC -
PTIR: Predicted Tomato Interactome Resource
www.nature.com/scientificreports OPEN PTIR: Predicted Tomato Interactome Resource Junyang Yue1,*, Wei Xu1,*, Rongjun Ban2,*, Shengxiong Huang1, Min Miao1, Xiaofeng Tang1, Guoqing Liu1 & Yongsheng Liu1,3 Received: 15 October 2015 Protein-protein interactions (PPIs) are involved in almost all biological processes and form the basis Accepted: 08 April 2016 of the entire interactomics systems of living organisms. Identification and characterization of these Published: 28 April 2016 interactions are fundamental to elucidating the molecular mechanisms of signal transduction and metabolic pathways at both the cellular and systemic levels. Although a number of experimental and computational studies have been performed on model organisms, the studies exploring and investigating PPIs in tomatoes remain lacking. Here, we developed a Predicted Tomato Interactome Resource (PTIR), based on experimentally determined orthologous interactions in six model organisms. The reliability of individual PPIs was also evaluated by shared gene ontology (GO) terms, co-evolution, co-expression, co-localization and available domain-domain interactions (DDIs). Currently, the PTIR covers 357,946 non-redundant PPIs among 10,626 proteins, including 12,291 high-confidence, 226,553 medium-confidence, and 119,102 low-confidence interactions. These interactions are expected to cover 30.6% of the entire tomato proteome and possess a reasonable distribution. In addition, ten randomly selected PPIs were verified using yeast two-hybrid (Y2H) screening or a bimolecular fluorescence complementation (BiFC) assay. The PTIR was constructed and implemented as a dedicated database and is available at http://bdg.hfut.edu.cn/ptir/index.html without registration. The increasing number of complete genome sequences has revealed the entire structure and composition of proteins, based mainly on theoretical predictions utilizing their corresponding DNA sequences. -
BLAST Practice
Using BLAST BLAST (Basic Local Alignment Search Tool) is an online search tool provided by NCBI (National Center for Biotechnology Information). It allows you to “find regions of similarity between biological sequences” (nucleotide or protein). The NCBI maintains a huge database of biological sequences, which it compares the query sequences to in order to find the most similar ones. Using BLAST, you can input a gene sequence of interest and search entire genomic libraries for identical or similar sequences in a matter of seconds. The amount of information on the BLAST website is a bit overwhelming — even for the scientists who use it on a frequent basis! You are not expected to know every detail of the BLAST program. BLAST results have the following fields: E value: The E value (expected value) is a number that describes how many times you would expect a match by chance in a database of that size. The lower the E value is, the more significant the match. Percent Identity: The percent identity is a number that describes how similar the query sequence is to the target sequence (how many characters in each sequence are identical). The higher the percent identity is, the more significant the match. Query Cover: The query cover is a number that describes how much of the query sequence is covered by the target sequence. If the target sequence in the database spans the whole query sequence, then the query cover is 100%. This tells us how long the sequences are, relative to each other. FASTA format FASTA format is used to represent either nucleotide or peptide sequences. -
Scaling HMMER Performance on Multicore Architectures
2011 International Conference on Complex, Intelligent, and Software Intensive Systems Scaling HMMER Performance on Multicore Architectures Sebastian Isaza∗, Ernst Houtgast∗, Friman Sanchezy, Alex Ramirezyz and Georgi Gaydadjiev∗ ∗Computer Engineering Laboratory, Delft University of Technology yComputer Architecture Department, Technical University of Catalonia zBarcelona Supercomputing Center Abstract—In bioinformatics, protein sequence alignment is one of the fundamental tasks that scientists perform. Since the growth of biological data is exponential, there is an ever- increasing demand for computational power. While current processor technology is shifting towards the use of multicores, the mapping and parallelization of applications has become a critical issue. In order to keep up with the processing demands, applications’ bottlenecks to performance need to be found and properly addressed. In this paper we study the parallelism and performance scalability of HMMER, a bioinformatics application to perform sequence alignment. After our study of the bottlenecks in a HMMER version ported to the Cell processor, we present two optimized versions to improve scalability in a larger multicore architecture. We use a simulator that allows us to model a system with up to 512 processors and study the performance of the three parallel versions of HMMER. Results show that removing the I/O bottleneck improves performance by 3× and 2:4× for a short Fig. 1. Swiss-Prot database growth. and a long HMM query respectively. Additionally, by offloading the sequence pre-formatting to the worker cores, larger speedups of up to 27× and 7× are achieved. Compared to using a single worker processor, up to 156× speedup is obtained when using growth is stagnating because of frequency, power and memory 256 cores.