Protein Sequence Databases …And Your Mass Spectrometry-Based Proteomics Experiment

Total Page:16

File Type:pdf, Size:1020Kb

Load more

Protein Sequence Databases …and your Mass Spectrometry-based Proteomics Experiment © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Outline Terminology • Protein Database (DB) • FASTA • Origin • Database repository • Sources • Format • NCBI database • Size • UniProtKB • Composition • Swiss Prot • Selecting a database for mass spec search • Ref Seq (reference • Effect of DB on mass spec sequence) search results • Homology • Post MS analysis: protein • Contaminants DB annotation, ontology, alignment • Ontology © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 FASTA Protein Sequence • Name and Origin • FASTA (pronounced ‘fast-aye’) • ORIGIN: for sequence similarity alignment tool (1985) • REF: DJ Lipman, WR Pearson (1985) PMID: 2983426 "The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC)." • Stands for “fast all” – the file format worked with ‘all’ alphabets (amino acid and nucleotide) © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 FASTA Protein Sequence Format • Structure: TEXT file • Line 1: description line with sequence identifier • Line 2: single amino acid letter protein sequence 80 characters wide • Allowed characters: • AMINO ACID ONE-LETTER CODE • X • * • - • Custom one-letter amino acid codes © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Line 1: description line with sequence identifier FASTA Format Header Line Sequence Identifiers © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Line 2 FASTA Protein Sequence from NCBI- example Line 1 Line 2 NOTE: In Sept 2016, gi numbers were replaced with accession.version identifiers © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Selecting a Protein Sequence Database • Public repositories, such as • NCBI • UniProtKB • Swiss Prot: manually annotated and reviewed • TrEMBL: Automatically annotated and not reviewed • Custom (from customer) • NOTE: format is important! • Represent species (1 or more) from which protein sample originated • Example: Mouse protein expressed in E. coli • Ideal size range ~ 2000 to < 1 million entries © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Selecting a Protein Database: UniProtKB repository © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Selecting a Protein Database: NCBI Ref Seq repository © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Choose Your Taxonomy or Taxonomies NOTES: • If recombinant protein expressed in host cell, include host proteins & expressed protein(s) • If protein database for your species has <2000 proteins, merge with another protein database (yeast) for statistical reasons • Protein sequence headers must be parsed correctly © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Taxonomy specification - UniProtKB (19996) © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Taxonomy specification - NCBI © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Protein Database repository content for Thirteen-lined Ground Squirrel Database Source Number of Proteins Swiss-Prot* reviewed 20 TrEMBL* unreviewed 20,076 UniProt Reference Proteome 19,966 NCBI (‘non-redundant’) 30,130 NCBI Reference Sequence 29,842 * From UniProt © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Protein Database Characteristics …related to your mass spectrometry experiment © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 SPLICES FORM variants Sequence alignments: Protein Cytochrome P450 2D6 © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Protein Sequence Variants Natural variants) SNP’s (single nucleotide polymporphisms) https://hive.biochemistry.gwu.edu © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 In silico trypsin digest, ‘native’ protein © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 In silico trypsin digest, with VARIANTS 1 2 © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Effect of Variant on Peptide Mass Peptide example Peptide Mass * Peptide Sequence 1 – native 1730.8443 SELEEQLTPVAEETR 1 – variant (Q -> K) 1730.8806 SELEEKLTPVAEETR 1 – variant (Q -> K) 734.3566 SELEEK 1 – variant (Q -> K) 1015.5418 LTPVAEETR 2 – native 830.4366 EQVAEVR 2 – variant (V -> E) 860.4108 EQEAEVR * Monoisotopic [M + H]+1 © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Proteomics Search Program Meets Protein Sequence Database • Protein sequence file is downloaded to local computer • Merge with common lab contaminants (keratins and more) database • http://www.thegpm.org/crap/ • Protein database is imported or indexed in the proteomics search program (sequence format is critical) • REVERSED sequences are generated for False Discovery Rate (FDR) calculations • Protein sequences are digested with enzymes in silico © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Database search > Protein List • Database search algorithm matches spectrum > peptide > protein • RESULTS: List of protein identifications with accession numbers • POST Database search options (outside CMSP): 1. Protein annotation 2. Sequence alignment 3. Obtain related Gene Ontology information © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 POST Database search options What you can do with your protein list. © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 1) Protein Annotation from UniProtKB © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 2) Sequence alignment with UniProt alignment tool © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 2) Sequence alignment with UniProt alignment tool: numerous amino acid labeling options * (asterisk) indicates positions which have a single, fully conserved residue. : (colon) indicates conservation between groups of strongly similar properties - scoring > 0.5 in the Gonnet PAM 250 matrix. (period) indicates conservation between groups of weakly similar properties - scoring =< 0.5 in the Gonnet PAM 250 matrix. © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 2) Sequence alignment with NCBI BLAST © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 3) Link Gene Ontology information to Proteins • Define: “The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products across databases.” • Ontologies/Vocabularies • molecular function: molecular activities of gene products • cellular component: where gene products are active • biological process: pathways and larger processes made up of the activities of multiple gene products (http://geneontology.org/page/documentation) © 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Molecular Function Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/) Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports,
Recommended publications
  • T-Coffee Documentation Release Version 13.45.47.Aba98c5

    T-Coffee Documentation Release Version 13.45.47.Aba98c5

    T-Coffee Documentation Release Version_13.45.47.aba98c5 Cedric Notredame Aug 31, 2021 Contents 1 T-Coffee Installation 3 1.1 Installation................................................3 1.1.1 Unix/Linux Binaries......................................4 1.1.2 MacOS Binaries - Updated...................................4 1.1.3 Installation From Source/Binaries downloader (Mac OSX/Linux)...............4 1.2 Template based modes: PSI/TM-Coffee and Expresso.........................5 1.2.1 Why do I need BLAST with T-Coffee?.............................6 1.2.2 Using a BLAST local version on Unix.............................6 1.2.3 Using the EBI BLAST client..................................6 1.2.4 Using the NCBI BLAST client.................................7 1.2.5 Using another client.......................................7 1.3 Troubleshooting.............................................7 1.3.1 Third party packages......................................7 1.3.2 M-Coffee parameters......................................9 1.3.3 Structural modes (using PDB)................................. 10 1.3.4 R-Coffee associated packages................................. 10 2 Quick Start Regressive Algorithm 11 2.1 Introduction............................................... 11 2.2 Installation from source......................................... 12 2.3 Examples................................................. 12 2.3.1 Fast and accurate........................................ 12 2.3.2 Slower and more accurate.................................... 12 2.3.3 Very Fast...........................................
  • Sequencing Alignment I Outline: Sequence Alignment

    Sequencing Alignment I Outline: Sequence Alignment

    Sequencing Alignment I Lectures 16 – Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence Alignment What Why (applications) Comparative genomics DNA sequencing A simple algorithm Complexity analysis A better algorithm: “Dynamic programming” 2 1 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G T C C A A T 3 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G | : | : | | : T C C – A A T 4 2 Sequence Alignment: Why The most basic sequence analysis task First aligning the sequences (or parts of them) and Then deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance Similar sequences often have similar origin or function New sequence always compared to existing sequences (e.g. using BLAST) 5 Sequence Alignment Example: gene HBB Product: hemoglobin Sickle-cell anaemia causing gene Protein sequence (146 aa) MVHLTPEEKS AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS TPDAVMGNPK VKAHGKKVLG AFSDGLAHLD NLKGTFATLS ELHCDKLHVD PENFRLLGNV LVCVLAHHFG KEFTPPVQAA YQKVVAGVAN ALAHKYH BLAST (Basic Local Alignment Search Tool) The most popular alignment tool Try it! Pick any protein, e.g.
  • Comparative Analysis of Multiple Sequence Alignment Tools

    Comparative Analysis of Multiple Sequence Alignment Tools

    I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE).
  • Chapter 6: Multiple Sequence Alignment Learning Objectives

    Chapter 6: Multiple Sequence Alignment Learning Objectives

    Chapter 6: Multiple Sequence Alignment Learning objectives • Explain the three main stages by which ClustalW performs multiple sequence alignment (MSA); • Describe several alternative programs for MSA (such as MUSCLE, ProbCons, and TCoffee); • Explain how they work, and contrast them with ClustalW; • Explain the significance of performing benchmarking studies and describe several of their basic conclusions for MSA; • Explain the issues surrounding MSA of genomic regions Outline: multiple sequence alignment (MSA) Introduction; definition of MSA; typical uses Five main approaches to multiple sequence alignment Exact approaches Progressive sequence alignment Iterative approaches Consistency-based approaches Structure-based methods Benchmarking studies: approaches, findings, challenges Databases of Multiple Sequence Alignments Pfam: Protein Family Database of Profile HMMs SMART Conserved Domain Database Integrated multiple sequence alignment resources MSA database curation: manual versus automated Multiple sequence alignments of genomic regions UCSC, Galaxy, Ensembl, alignathon Perspective Multiple sequence alignment: definition • a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned • homologous residues are aligned in columns across the length of the sequences • residues are homologous in an evolutionary sense • residues are homologous in a structural sense Example: 5 alignments of 5 globins Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths. We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers.
  • How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)

    How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)

    Tutorial: How to generate a publication-quality multiple sequence alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012) 1) Get your sequences in FASTA format: • Go to the NCBI website; find your sequences and display them in FASTA format. Each sequence should look like this (http://www.ncbi.nlm.nih.gov/protein/6678177?report=fasta): >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG 2) In a text editor, paste all your sequences together (in the order that you would like them to appear in the end). It should look like this: >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG >gi|151554658|gb|AAI47965.1| STX3 protein [Bos taurus] MKDRLEQLKAKQLTQDDDTDEVEIAVDNTAFMDEFFSEIEETRVNIDKISEHVEEAKRLYSVILSAPIPE PKTKDDLEQLTTEIKKRANNVRNKLKSMERHIEEDEVQSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKAREETKRAVKYQGQARKKLVIIIVIVVVLLGILAL IIGLSVGLK
  • Bioinformatics Study of Lectins: New Classification and Prediction In

    Bioinformatics Study of Lectins: New Classification and Prediction In

    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
  • "Phylogenetic Analysis of Protein Sequence Data Using The

    "Phylogenetic Analysis of Protein Sequence Data Using The

    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
  • Aligning Reads: Tools and Theory Genome Transcriptome Assembly Mapping Mapping

    Aligning Reads: Tools and Theory Genome Transcriptome Assembly Mapping Mapping

    Aligning reads: tools and theory Genome Transcriptome Assembly Mapping Mapping Reads Reads Reads RSEM, STAR, Kallisto, Trinity, HISAT2 Sailfish, Scripture Salmon Splice-aware Transcript mapping Assembly into Genome mapping and quantification transcripts htseq-count, StringTie Trinotate featureCounts Transcript Novel transcript Gene discovery & annotation counting counting Homology-based BLAST2GO Novel transcript annotation Transcriptome Mapping Reads RSEM, Kallisto, Sailfish, Salmon Transcript mapping and quantification Transcriptome Biological samples/Library preparation Mapping Reads RSEM, Kallisto, Sequence reads Sailfish, Salmon FASTQ (+reference transcriptome index) Transcript mapping and quantification Quantify expression Salmon, Kallisto, Sailfish Pseudocounts DGE with R Functional Analysis with R Goal: Finding where in the genome these reads originated from 5 chrX:152139280 152139290 152139300 152139310 152139320 152139330 Reference --->CGCCGTCCCTCAGAATGGAAACCTCGCT TCTCTCTGCCCCACAATGCGCAAGTCAG CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg Normal:HAH CD133hi:LM-Mel-42pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-14pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg DBTSS:human_MCF7 CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-42poschrX:152139280
  • Alignment of Next-Generation Sequencing Data

    Alignment of Next-Generation Sequencing Data

    Gene Expression Analyses Alignment of Next‐Generation Sequencing Data Nadia Lanman HPC for Life Sciences 2019 What is sequence alignment? • A way of arranging sequences of DNA, RNA, or protein to identify regions of similarity • Similarity may be a consequence of functional, structural, or evolutionary relationships between sequences • In the case of NextGen sequencing, alignment identifies where fragments which were sequenced are derived from (e.g. which gene or transcript) • Two types of alignment: local and global http://www‐personal.umich.edu/~lpt/fgf/fgfrseq.htm Global vs Local Alignment • Global aligners try to align all provided sequence end to end • Local aligners try to find regions of similarity within each provided sequence (match your query with a substring of your subject/target) gap mismatch Alignment Example Raw sequences: A G A T G and G A T TG 2 matches, 0 4 matches, 1 4 matches, 1 3 matches, 2 gaps insertion insertion end gaps A G A T G A G A ‐ T G . A G A T ‐ G . A G A T G . G A T TG . G A T TG . G A T TG . G A T TG NGS read alignment • Allows us to determine where sequence fragments (“reads”) came from • Quantification allows us to address relevant questions • How do samples differ from the reference genome • Which genes or isoforms are differentially expressed Haas et al, 2010, Nature. Standard Differential Expression Analysis Differential Check data Unsupervised expression quality Clustering analysis Trim & filter Count reads GO enrichment reads, remove aligning to analysis adapters each gene Align reads to Check
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

    A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
  • Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors

    Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors

    Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors Running head: Six-fold speed-up of Smith-Waterman searches Torbjørn Rognes* and Erling Seeberg Institute of Medical Microbiology, University of Oslo, The National Hospital, NO-0027 Oslo, Norway Abstract Motivation: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence search algorithm is that of Smith- Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. Results: A fast implementation of the Smith-Waterman sequence alignment algorithm using SIMD (Single-Instruction, Multiple-Data) technology is presented. This implementation is based on the MMX (MultiMedia eXtensions) and SSE (Streaming SIMD Extensions) technology that is embedded in Intel’s latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith-Waterman implementation on the same hardware was achieved by an optimised 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. Availability: Online searches with the software are available at http://dna.uio.no/search/ Contact: [email protected] Published in Bioinformatics (2000) 16 (8), 699-706. Copyright © (2000) Oxford University Press. *) To whom correspondence should be addressed.
  • Introduction to Linux for Bioinformatics – Part II Paul Stothard, 2006-09-20

    Introduction to Linux for Bioinformatics – Part II Paul Stothard, 2006-09-20

    Introduction to Linux for bioinformatics – part II Paul Stothard, 2006-09-20 In the previous guide you learned how to log in to a Linux account, and you were introduced to some basic Linux commands. This section covers some more advanced commands and features of the Linux operating system. It also introduces some command-line bioinformatics programs. One important aspect of using a Linux system from a Windows or Mac environment that was not discussed in the previous section is how to transfer files between computers. For example, you may have a collection of sequence records on your Windows desktop that you wish to analyze using a Linux command-line program. Alternatively, you may want to transfer some sequence analysis results from a Linux system to your Mac so that you can add them to a PowerPoint presentation. Transferring files between Mac OS X and Linux Recall that Mac OS X includes a Terminal application (located in the Applications >> Utilities folder), which can be used to log in to other systems. This terminal can also be used to transfer files, thanks to the scp command. Try transferring a file from your Mac to your Linux account using the Terminal application: 1. Launch the Terminal program. 2. Instead of logging in to your Linux account, use the same basic commands you learned in the previous section (pwd, ls, and cd) to navigate your Mac file system. 3. Switch to your home directory on the Mac using the command cd ~ 4. Create a text file containing your home directory listing using ls -l > myfiles.txt 5.