Sequence Alignments

Total Page:16

File Type:pdf, Size:1020Kb

Sequence Alignments Sequence Alignments Felix Sappelt Irina Wagner Table of Content Pairwise Alignments Multiple Alignments ± FASTA ± ClustalW ± BLAST ± MAFFT ± HHSEARCH ± Muscle ± Cobalt ± T-Coffee ± 3D-Coffee ± JalView PAIRWISE SEQUENCE ALIGNMENTS Pairwise Sequence Alignment Methods Dynamic Programming Global alignment (Needleman-Wunsch) Local alignment (Smith-Waterman) Heuristic Methods FASTA BLAST Heuristic Methods Try only most likely alignments and skip all others Much faster than dynamic programming methods, but less sensitive For large databases, such as whole genomes, speed is extremely important In some cases, heuristic methods are the only possibility; exact algorithms take too long. FASTA One of the earliest widely used database searching tools (Lipman&Pearson in 1985) Heuristic method approximating Smith Waterman Search time is proportional to size of DB Fasta-Algorithm Find identical substrings Re-Score and keep only high-scoring identities Discard substrings that cannot be easily joined Optimize using dynamic programming around diagonal Substitution Matrix PAM (Point Accepted Mutation) ± Created by Margaret Dayhoff in 1970 ± Based on an explicit evolutionary model ± PAM1 estimate using 1572 changes in 71 groups of protein sequences that were at least 85% similar ± PAM 250 (20% SIMILARITY) obtained by multiplying PAM1 by itself 250 times BLOSUM (Block Substitution Matrix) ± Deals with sequence changes over long timespans ± Based on multiple protein alignments ± Used 500 families of related proteins ± not based on explicit evolutionary model, but from considering all amino acids changes observed in an aligned region from a related family of proteins when the correct scoring matrix is used, alignment statistics are meaningful PAM250 Matrix BLAST BLAST is an improvement over FASTA ± Greater speed by pre-indexing the database ± More accurate results BLAST is the centerpiece of many bioinformatics assays, because it makes genome-scale sequences accessible The original paper was the most cited paper of the 1990s (Altschul et al. 1990) BLAST Mask low-complexity regions Over 50% of genomic DNA is repetitive Retrotransposons Repeats ALU regions Microsatellites UTRs BLAST Most widely used program to look at sequence alignments and similarities Instead of relying on global alignments, BLAST compare by locating short matches between the two sequences BLAST creats a list of ͞words͟ that have a certain ͞treshold͟ score when compared with the query sequence The database is searched for occurrences of this words Uses hashTable that contains neighborhood words BLAST List all k-tuples in the query sequence the lower k-tup value the more background you will have the higher the k-tup value the faster analysis Find all matching words in the database Keep only the high-scoring words ! difference to FASTA Build search tree from remaining words BLAST Extend the match until the match score decreases or the end of the sequence has been reached Extended matches are called High-Scoring Segment Pairs BLAST List all HSPs in the database whose score is high enough to be considered Assess statistical significance via the Gumbel Extreme Value Distribution, which describes the distribution of Smith-Waterman scores Join HSPs into a longer alignment Output BLAST Results The raw score: is calculated by summing the scores for each aligned position and the scores for gaps Bit scores: Bit scores are raw scores converted from the log base of the scoring matrix that creates the alignment to log base 2, this rescaling allows scores to be compared between the alignments E-value: Expected number of chance alignments; the lower the E value, the more significant the score. An expect value of 10.0 is the default value of statistical significance, but this number can be adjusted by the user P-value: The P-value represents the probablity (in the range of 0-1) of a given sequence occuring by chance. It is less accurate than the E-value Other BLAST variants BLASTN Nucleotide seqeunece comparison BLASTP General protein comparison TBLASTN compares a protein sequence to a translated DNA DB Use if homolog not found in protein DB TBLASTN, TBLASTX compares a translated DNA sequence to a translated DNA DB Identify new orthologs in closely related species BLASTX Compares a translated nucleotide query to a protein DB Other BLAST variants PSI-BLAST PHI-BLAST Position Specific Iterative Pattern Hit Initiated BLAST BLAST Uses protein motifs to is used to find distant relatives of a protein increase the chance of Easy to use version of a finding biologically ͣprofile͞ search significant matches Uses an iterative alignment procedur to develop position specific scoring matrices which increases its capability to detect weak pattern matches HHSearch Represents query and database by profile Hidden Markov Models Database profiles derived from multiple sequence alignments Before searching the HMM database, a MSA of related sequences is compiled using CSI- Blast From this MSA, a profile is calculated Search is being done with this profile as the query BLAST Algorithm Speed: pre-indexing the DB before the search, parallel processing ± Mask low-complexity regions (repeats) ± Make k-subtring wordlist of sequence ± List common words between DB and query; care only about high-scoring (fasta: all; main diff) ± Build efficient search tree ± Repeat 3 and 4 foreach k-letter substring of query ± Scan db sequences for exact matches w/ remaining highscoring words ± Extend exact matches to High Scoring Segment Pairs HSP: verlängern von alignment nach links und rechts bis score sich verschlechtert (Blast; blast2: lower neighborhood score threshold, dadurch wörter länger; spart zeit; da is noch mehr, angucken (wiki)) ± List all HSP in DB whose score is high enough to be considered; use cutoff score S (empirically determined) to find out which ones to consider ± Evaluate HSP score significance using Gumbel Extreme Value Distribution (formel in wiki) ± Two or more HSPs in one db sequence -> make into one alignment; compare significance of newly combined regions using poisson method or sum of scores method ± Original blast: one alignment per hsp (multiple pairwise alignmnents if more than one hsp found); blast2: one alignment, Smith-Waterman, gapped. ± Report matches with expect score lower than E threshold MULTIPLE SEQUENCE ALIGNMENTS The MSA problem Correctly align more than two sequences NP-complete problem ± For k sequences of length n, complexity is O(nk) ± For 10 sequences of length 50, nk is about 1017 ± For 50 sequences of length 500, nk is about 10136 World͚s biggest supercomputer: 2.5 TFLOPS (1012) Since Planet Earth will be around for just 6 billion years, all current approaches are heuristic. What are MSAs good for? Assess evolutionary history and sequence homology of a set of sequences Useful for ... ± Homology modelling ± Phylogenic research ± Illustrating mutation events and evolutionary processes MSA Workflow Methods ClustalW: Basic Tree-Based approach (1994) ProbCons: Probabilistic approach (2005) MAFFT: Fast Fourier Transformation (2002) Muscle: K-Substring counting, Profiles (2004) Cobalt: Proteins, user input (2007) T-Coffee: Library-Based (2000) ͙ and many more. ClustalW Published in 1994 How it works: ± First, do all possible pairwise alignments ± Build a guide tree Neighbor Joining Method ± Progressively align according to branching order in guide tree Starting from leafs, build pairwise alignments towards root ClustalW Pros: ± It͚s fast ± Results are good for highly similar sequences ± Position-specific gaps protect hydrophobic core Cons: ± Simple approach ± Errors in pairwise alignment stage propagate, cannot be corrected Probcons Probabilistic Idea of consistency ± Prevents misalignments due to ͣfaulty͞ pairwise alignments ± Sequences x, y, z: x i if xi aligns with yj, and yj aligns with zk, y j then x aligns with z . i k z k Probcons Compute posterior probability matrix using HMM Construct pairwise alignments that maximize ͣexpected accuracy͞ Probabilistic consistency transformation of posterior matrix ± Incorporate similarity to other sequences into pairwise comparisons Build guide tree Progressive alignment MAFFT Uses Fast Fourier Transformation to identify homologous regions Uses polarity and volume information for amino acids Can run in progressive and iterative mode Extremely fast Very accurate Muscle Iterative method K-mer counting ± Approximate distance between two sequences by number of common k-substrings ± Very fast Log expectation ± Profile function used to iteratively improve alignments Cobalt Specializes in Proteins Designed to exploit three strategies: ± Using biological information by deriving constraints from protein databases ± Using pairwise similarity present in multiple pairs ± Allowing the user to specify regions that are to be aligned T-Coffee Tree-based Consistency Objective Function for Alignment Evaluation Cédric Notredame, 2000 Derives constraints from libraries of pairwise alignments Slow, but accurate 3D-Coffee: Extends T-Coffee with structure information from PDB files Libraries contain pair- wise alignments Each AA-pair in them is a constraint Weights: Percent identity of alignments Fitting a set of weighted constrains onto a MSA is NP-complete ! heuristic solution: Extension What to use For small numbers of sequences (<20) with relatively high identity (>40%), any tool works Large number of sequences may require fast methods: MAFFT (progressive)
Recommended publications
  • (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat Et Al
    US 2003O21, 1987A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat et al. (43) Pub. Date: Nov. 13, 2003 (54) METHODS AND MATERIALS RELATING TO Apr. 7, 2000 (US)........................................... O9545,714 STEM CELL GROWTH FACTOR-LIKE Apr. 11, 2000 (US)........................................... O9547358 POLYPEPTIDES AND POLYNUCLEOTDES Publication Classification (76) Inventors: Ivan Labat, Mountain View, CA (US); Y Tom Tang, San Jose, CA (US); Radoje T. Drmanac, Palo Alto, CA (51) Int. Cl." ....................... A61K 38/18; CO7K 14/475; (US); Chenghua Liu, San Jose, CA C12O 1/68; CO7H 21/04; (US); Juhi Lee, Fremont, CA (US); C12M 1/34; C12P 21/02; Nancy K Mize, Mountain View, CA C12N 5/08 (US); John Childs, Sunnyvale, CA (52) U.S. Cl. ......... 514/12; 435/69.1; 435/6; 435/320.1; (US); Cheng-Chi Chao, Cupertino, CA 435/366; 530/399; 536/23.5; (US) 435/287.2 Correspondence Address: MARSHALL, GERSTEIN & BORUN LLP (57) ABSTRACT 6300 SEARS TOWER 233 S. WACKER DRIVE The invention provides novel polynucleotides and polypep CHICAGO, IL 60606 (US) tides encoded by Such polynucleotides and mutants or variants thereof that correspond to a novel human Secreted (21) Appl. No.: 10/168,365 Stem cell growth factor-like polypeptide. These polynucle otides comprise nucleic acid Sequences isolated from cDNA (22) PCT Filed: Dec. 23, 2000 libraries prepared from human fetal liver Spleen, ovary, adult (86) PCT No.: PCT/US00/35260 brain, lung tumor, Spinal cord, cervix, ovary, endothelial cells, umbilical cord, lymphocyte, lung fibroblast, fetal (30) Foreign Application Priority Data brain, and testis.
    [Show full text]
  • To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
    UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR.
    [Show full text]
  • UNIVERSITY of CALIFORNIA, SAN DIEGO Use Solid K-Mers In
    UNIVERSITY OF CALIFORNIA, SAN DIEGO Use Solid K-mers In MinHash-Based Genome Distance Estimation A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science by An Zheng Committee in charge: Professor Pavel Pevzner, Chair Professor Vikas Bansal Professor Melissa Gymrek 2017 Copyright An Zheng, 2017 All rights reserved. The thesis of An Zheng is approved, and it is acceptable in quality and form for publication on microfilm and electroni- cally: Chair University of California, San Diego 2017 iii TABLE OF CONTENTS Signature Page . iii Table of Contents . iv List of Figures . v List of Tables . vi Acknowledgements . vii Abstract of the Thesis . viii Chapter 1 Introduction and background . 1 1.1 Genome distance estimation . 1 1.2 Current methods . 2 1.3 MinHash . 3 1.4 Solid k-mer powered MinHash . 5 Chapter 2 Method . 7 2.1 General scheme . 7 2.2 Identification of overlapping read pairs . 8 2.2.1 Workflow . 8 2.2.2 Data . 9 2.2.3 Implementation . 9 2.3 Genome identification . 10 2.3.1 Workflow . 10 2.3.2 Data . 10 2.3.3 Implementation . 10 Chapter 3 Result . 15 3.1 Identification of overlapping read pairs . 15 3.1.1 Performance comparison between solid k-mer pow- ered MinHash and regular MinHash . 15 3.1.2 Selecting the solid k-mer threshold . 17 3.2 Genome identification . 19 Chapter 4 Discussion and future work . 21 Bibliography . 23 iv LIST OF FIGURES Figure 1.1: An example of how to use MinHash to compute the resemblance of two genome sequences.
    [Show full text]
  • Homology & Alignment
    Protein Bioinformatics Johns Hopkins Bloomberg School of Public Health 260.655 Thursday, April 1, 2010 Jonathan Pevsner Outline for today 1. Homology and pairwise alignment 2. BLAST 3. Multiple sequence alignment 4. Phylogeny and evolution Learning objectives: homology & alignment 1. You should know the definitions of homologs, orthologs, and paralogs 2. You should know how to determine whether two genes (or proteins) are homologous 3. You should know what a scoring matrix is 4. You should know how alignments are performed 5. You should know how to align two sequences using the BLAST tool at NCBI 1 Pairwise sequence alignment is the most fundamental operation of bioinformatics • It is used to decide if two proteins (or genes) are related structurally or functionally • It is used to identify domains or motifs that are shared between proteins • It is the basis of BLAST searching (next topic) • It is used in the analysis of genomes myoglobin Beta globin (NP_005359) (NP_000509) 2MM1 2HHB Page 49 Pairwise alignment: protein sequences can be more informative than DNA • protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time • DNA sequences can be translated into protein, and then used in pairwise alignments 2 Find BLAST from the home page of NCBI and select protein BLAST… Page 52 Choose align two or more sequences… Page 52 Enter the two sequences (as accession numbers or in the fasta format) and click BLAST.
    [Show full text]
  • Assembly Exercise
    Assembly Exercise Turning reads into genomes Where we are • 13:30-14:00 – Primer Design to Amplify Microbial Genomes for Sequencing • 14:00-14:15 – Primer Design Exercise • 14:15-14:45 – Molecular Barcoding to Allow Multiplexed NGS • 14:45-15:15 – Processing NGS Data – de novo and mapping assembly • 15:15-15:30 – Break • 15:30-15:45 – Assembly Exercise • 15:45-16:15 – Annotation • 16:15-16:30 – Annotation Exercise • 16:30-17:00 – Submitting Data to GenBank Log onto ILRI cluster • Log in to HPC using ILRI instructions • NOTE: All the commands here are also in the file - assembly_hands_on_steps.txt • If you are like me, it may be easier to cut and paste Linux commands from this file instead of typing them in from the slides Start an interactive session on larger servers • The interactive command will start a session on a server better equipped to do genome assembly $ interactive • Switch to csh (I use some csh features) $ csh • Set up Newbler software that will be used $ module load 454 A norovirus sample sequenced on both 454 and Illumina • The vendors use different file formats unknown_norovirus_454.GACT.sff unknown_norovirus_illumina.fastq • I have converted these files to additional formats for use with the assembly tools unknown_norovirus_454_convert.fasta unknown_norovirus_454_convert.fastq unknown_norovirus_illumina_convert.fasta Set up and run the Newbler de novo assembler • Create a new de novo assembly project $ newAssembly de_novo_assembly • Add read data to the project $ addRun de_novo_assembly unknown_norovirus_454.GACT.sff
    [Show full text]
  • The Uniprot Knowledgebase BLAST
    Introduction to bioinformatics The UniProt Knowledgebase BLAST UniProtKB Basic Local Alignment Search Tool A CRITICAL GUIDE 1 Version: 1 August 2018 A Critical Guide to BLAST BLAST Overview This Critical Guide provides an overview of the BLAST similarity search tool, Briefly examining the underlying algorithm and its rise to popularity. Several WeB-based and stand-alone implementations are reviewed, and key features of typical search results are discussed. Teaching Goals & Learning Outcomes This Guide introduces concepts and theories emBodied in the sequence database search tool, BLAST, and examines features of search outputs important for understanding and interpreting BLAST results. On reading this Guide, you will Be aBle to: • search a variety of Web-based sequence databases with different query sequences, and alter search parameters; • explain a range of typical search parameters, and the likely impacts on search outputs of changing them; • analyse the information conveyed in search outputs and infer the significance of reported matches; • examine and investigate the annotations of reported matches, and their provenance; and • compare the outputs of different BLAST implementations and evaluate the implications of any differences. finding short words – k-tuples – common to the sequences Being 1 Introduction compared, and using heuristics to join those closest to each other, including the short mis-matched regions Between them. BLAST4 was the second major example of this type of algorithm, From the advent of the first molecular sequence repositories in and rapidly exceeded the popularity of FastA, owing to its efficiency the 1980s, tools for searching dataBases Became essential. DataBase searching is essentially a ‘pairwise alignment’ proBlem, in which the and Built-in statistics.
    [Show full text]
  • Multiple Alignments, Blast and Clustalw
    Multiple alignments, blast and clustalW 1. Blast idea: a. Filter out low complexity regions (tandem repeats… that sort of thing) [optional] b. Compile list of high-scoring strings (words, in BLAST jargon) of fixed length in query (threshold T) c. Extend alignments (highs scoring pairs) d. Report High Scoring pairs: score at least S (or an E value lower than some threshold) 2. Multiple Sequence Alignments: a. Attempts to extend dynamic programming techniques to multiple sequences run into problems after only a few proteins (8 average proteins were a problem early in 2000s) b. Heuristic approach c. Idea (Progressive Approach): i. homologous sequences are evolutionarily related ii. Build multiple alignment by series of pairwise alignments based off some phylogenetic tree (the initial tree or the guide tree ) iii. Add in more distantly related sequences d. Progressive Sequence alignment example: 1. NYLS & NKYLS: N YLS N(K|-)YLS NKYLS 2. NFS & NFLS: N YLS NF S NF(L|-)S NKYLS NFLS 3. N(K|-)YLS & NF(L|-)S N YLS N(K|-)(Y|F)(L|-)S NKYLS N YLS N F S N FLS e. Assessment: i. Works great for fairly similar sequences ii. Not so well for highly divergent ones f. Two Problems: i. local minimum problem: Algorithm greedily adds sequences based off of tree— might miss global solution ii. Alignment parameters: Mistakes (misaligned regions) early in procedure can’t be corrected later. g. ClustalW does multiple alignments and attempts to solve alignment parameter problem i. gap costs are dynamically varied based on position and amino acid ii. weight matrices are changed as the level of divergence between sequence increases (say going from PAM30 -> PAM60) iii.
    [Show full text]
  • Developing Bioinformatics Computer Skills.Pdf
    Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Cynthia Gibas Per Jambeck Publisher: O'Reilly First Edition April 2001 ISBN: 1-56592-664-1, 446 pages Buy Print Version Developing Bioinformatics Computer Skills will help biologists, researchers, and students develop a structured approach to biological data and the computer skills they'll need to analyze it. The book covers Copyright the Unix file system, building tools and databases for bioinformatics, Table of Contents computational approaches to biological problems, an introduction to Index Perl for bioinformatics, data mining, data visualization, and tips for Full Description tailoring data analysis software to individual research needs. About the Author Reviews Reader reviews Errata Delivered for Maurice ling Last updated on 10/30/2001 Swap Option Available: 7/15/2002 Developing Bioinformatics Computer Skills, © 2002 O'Reilly © 2002, O'Reilly & Associates, Inc. http://safari.oreilly.com/main.asp?bookname=bioskills [6/2/2002 8:49:35 AM] Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Copyright © 2001 O'Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information contact our corporate/institutional sales department: 800-998-9938 or [email protected].
    [Show full text]
  • The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM
    The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM Volume 19 | Issue 16 | Page 21 | Aug. 29, 2005 Previous | Issue Contents | Next FEATURE | SEVEN TECHNOLOGIES How 90,000 lines of code helped spark the bioinformatics explosion By Anne Harding You've just cloned and sequenced a gene, but you don't know what it does. Now Sponsored by: what do you do? In the absence of functional clues, it's hard to know where to start. One approach is to ask what other known sequences are similar to yours, thereby inferring function from homology. Each weekday, some 200,000 or so researchers do just that, asking a server at the National Center for Biotechnology Information (NCBI) in Bethesda, Md., to compare their particular sequence against GenBank, a DNA database that, at the end of 2004, held more than 40 million sequences totaling 44.5 billion nucleotides. The NCBI devotes 158 two-processor computers to those queries, 75% of which return within 22 seconds. The software these servers use, a sturdy 15-year-old program known as the Basic Local Alignment Search Tool, or BLAST, remains, for many, bioinformatics' "killer app." It wasn't the first DNA database search tool, but it was fast, and it provided metrics to assess the significance of the matches it found--all in 90,000 lines of C code. "The fact that every biologist has been using BLAST tells everything," says Jin Billy Li of the Washington University Genome Sequencing Center in St. Louis, who has used BLASTP (a protein homology tool) to identify flagellar genes in several species, including the human gene that causes Bardet-Biedl syndrome, a ciliation disorder.
    [Show full text]
  • BLAST Practice
    Using BLAST BLAST (Basic Local Alignment Search Tool) is an online search tool provided by NCBI (National Center for Biotechnology Information). It allows you to “find regions of similarity between biological sequences” (nucleotide or protein). The NCBI maintains a huge database of biological sequences, which it compares the query sequences to in order to find the most similar ones. Using BLAST, you can input a gene sequence of interest and search entire genomic libraries for identical or similar sequences in a matter of seconds. The amount of information on the BLAST website is a bit overwhelming — even for the scientists who use it on a frequent basis! You are not expected to know every detail of the BLAST program. BLAST results have the following fields: E value: The E value (expected value) is a number that describes how many times you would expect a match by chance in a database of that size. The lower the E value is, the more significant the match. Percent Identity: The percent identity is a number that describes how similar the query sequence is to the target sequence (how many characters in each sequence are identical). The higher the percent identity is, the more significant the match. Query Cover: The query cover is a number that describes how much of the query sequence is covered by the target sequence. If the target sequence in the database spans the whole query sequence, then the query cover is 100%. This tells us how long the sequences are, relative to each other. FASTA format FASTA format is used to represent either nucleotide or peptide sequences.
    [Show full text]
  • Msc THESIS Genetic Sequence Alignment on a Supercomputing Platform
    Computer Engineering 2011 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ MSc THESIS Genetic sequence alignment on a supercomputing platform Erik Vermij Abstract Genetic sequence alignment is an important tool for researchers. It lets them see the differences and similarities between two genetic sequences. This is used in several fields, like homology research, auto immune disease research and protein shape estimation. There are various algorithms that can perform this task and several hard- ware platforms suitable to deliver the necessary computation power. CE-MS-2011-02 Given the large volume of the datasets used, throughput is nowadays the major bottleneck in sequence alignment. In this thesis we discuss some of the existing solutions for high throughput genetic sequence alignment and present a new one. Our solution implements the well known Smith-Waterman optimal local alignment algorithm on the HC-1 hybrid supercomputer from Convey Computer. This platform features four FPGAs which can be used to accelerate the problem in question. The FPGAs, and the CPU that controls them, live in the same virtual memory space and share one large memory. We developed a hardware description for the FPGAs and a software program for the CPU. Some focus points were: a sustainable peak performance, being able to align sequences of any length, FPGA area efficient computations and the cancellation of unnecessary workload. The result is a Smith-Waterman FPGA core that can run at 100% utilization for many alignments long. They are packed per six on a FPGA running on 150 MHz, which results in a full system performance of 460 GCUPS (billion elementary operations per second).
    [Show full text]
  • L11: Alignments 5 Evolution: MEGA
    Biochem 711 – 2008 1 L11: Alignments Evolution: MEGA Table of Contents Introduction............................................................................................. 3 Acknowledgements.................................................................................. 4 L11 Exercise A: Set up............................................................................. 4 1. Launch MEGA............................................................................................... 4 2. Retrieve Sequence ........................................................................................ 5 L11 Exercise B: BLAST and Align within MEGA ....................................... 6 1. Launch MEGA web browser........................................................................... 6 2. BLAST search within MEGA .......................................................................... 6 2.1. Paste sequence........................................................................................ 6 2.2. Select the database to be searched ............................................................ 6 2.3. Optimization algorithm: blastn................................................................... 7 2.4. Press BLAST ............................................................................................ 7 2.5. BLAST results .......................................................................................... 7 2.6. Selecting results for alignment................................................................... 9 3. Preparing
    [Show full text]