A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System

Total Page:16

File Type:pdf, Size:1020Kb

A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System From: ISMB-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System Philipp Bucher and Kay Hofmann, Swiss Institute for Experimental Cancer Research (ISREC) Ch. des boveresses 155, CH-1066Epalinges, Switzerland { pbucber,khofmann } @isrec-sun 1 .unil.ch Abstract to define a metric over the sequence space (e.g. Sellers 1974). Wepresent a probabilistic interpretation of local Here, we advocate a different view of local sequence sequencealignment methodswhere the alignment scor- alignment methods, in which the scoring system plays ing system(ASS) plays the role of a stochastic process defining a probability distribution over all sequence the role of a stochastic process generating pairs of pairs. Anexplicit algorithmis given to computethe pro- related sequences. Based on such an interpretation, we bability of two sequencesgiven an ASS.Based on this propose a modified version of a Smith-Watermanalgo- definition, a modified version of the Smith-Waterman rithm where the score computedfor two sequences is a local similarity search algorithm has been devised, whichassesses sequencerelationships by log likelihood log likelihood ratio of two probabilities, one given the ratios. Whentested on classical examplessuch as glo- scoring system, and one given a null-model. The bins or G-protein-coupled receptors, the new method mathematical concepts underlying this approach is provedto be up to an order of magnitudemore sensitive closely related to maximumlikelihood estimation for than the native Smith-Watermanalgorithm. global sequence alignments (e.g. Bishop & Thompson 1986). The performance of the new database search Introduction technique is assessed by test protocols previously used for the evaluation of alternative alignment scoring sys- The comparison of a new protein sequence against a tems. database of knownproteins is perhaps the most impor- tant computer application in molecular sequence analysis. It is generally accepted that the Smith- Review of the Native Smith-Waterman Watermanlocal similarity search algorithm (Smith Algorithm Waterman1981) is the most sensitive technique to dis- cover significant weak similarities between two Let a=ala2"’’a,,, and b=blb2"’’b,~ be two sequences. The more frequently used heuristic algo- sequences of residues from an alphabet S containing N elements. A local sequence alignment between two rithms implemented in the programs FASTA(Pearson such sequences is defined by an alignment path 1990) and BLAST(Altschul et al. 1990) can be con- sidered approximations or special cases of a full represented as an ordered set of index pairs: Smith-Waterman algorithm offering high speed in U = (Xl,Yl), (x2,Y2), " (Xl, l) E ach index pair exchangefor reduced sensitivity. identifies a pair of matched residues in the sequence alignment. A valid path u for sequences a and b The Smith-Waterman algorithm maximizes an align- satisfies the following conditions: xk+l>Xk, y~-+l>yk, ment scoring function over all possible local alignments x , Yl<-rt . between two sequences. The scoring function depends I<_m An alignment scoring system (ASS.) consists of on a set of parameters referred to as an alignment scor- substitution matrix and a gap weighting function. The ing system, which consists of a residue substitution substitution matrix defines substitution scores s(a,b) matrix and a linear gap penalty function. Optimal for pairs of residues (a,b)~ 2. The gap weighting alignment scores computed by a Smith-Waterman or function assigns weights w(k)= ot + ~k to insertions related sequence alignment algorithm have traditionally and deletions of length k>l. For gap length zero, we been viewed and interpreted as measures of similarity or distance (Smith, Waterman, & Fitch 1981). Con- define w(0) = cerning their mathematical properties, they were shown 44 ISMB-96 The scoring system assigns an alignment score Sa to PrObHMM(a) = Pr obH~(a,u) any local sequence alignment (a,b,u): paths u l over all possible paths through the model. SA (a,b,u) = k)~.~s (axk,by (1) The probability of a sequence pair a,b being gen- k=l erated by an ASSis the sum of the probabilities I-I of sequence pair a,b being generated via a partic- - ~.~w(Xk+l--Xk--1)+w (Y~+l--Yk--1) ular path u: k=l ProbAss(a,b ) = ~ ProbAss(a,b,u) Note that the alignment score can be described as the paths u stun of two components, a sequence-dependent match over all valid local alignment paths. score: I (iii) In database applications, membership of a SM(a,b,u) = S( axk,by,) (2a) sequence a to an HMM-definedsequence class is k=l estimated by an HMMscorewhich has the form and a sequence-independent gap score: of a log likelihood ratio: l-I Prob~M(a) HMMscore(a) = log Sa(u) = - w(x~+l--x~--l)+wfyk+ryk--1) (2 b) Probnull(a) k=l These two definitions will be useful in the formulation where Probn,t/(a) is the probability of sequence of the probabilistic version of the Smith-Waterman given a specific null-model. algorithm. The native Smith-Waterman algorithm com- In a probabilistic Smith-Waterman search, an putes the optimal local alignment score for two ASS-defined kind of similarity between two sequences: sequences a,b will be estimated by a PSWscore of the following type: SWscore (a,b) = max SA (a,b,u), (3) paths u ProbAss(a,b) PSWscore (a,b) = log which serves as a measure of sequence similarity in Probnult(a,b) database search applications. An efficient procedure to What remains to be done in order to implement the compute SWscore(a,b) is described in (Gotoh 1982). method suggested by this analogy, is to chose an appropriate null-model for randomsequence pairs, and Probabilistic Smith-Waterman (PSW) to work out a reasonable definition for the probability Algorithm of a local sequence alignment Probass(a,b,u) based the definition of the alignment score. The probabilistic version of the Smith-Watermanalgo- The null-model we choose has the general form: rithm is based on an analogy between alignment scor- ing systems (ASSs) and hidden Markov models Probnun(a,b) PL(m,n)Po(a,b) (4) (HMMs),a class of statistical models that have recently where PL(m,n) is a length distribution over sequence been introduced to molecular sequence analysis (Baldi length pair classes, and Po(a,b) is the null-model pro- et al. 1994, Krogh et al. 1994). This analogy comprises bability of sequence pair a,b given the length pair pair the following ideas and assumptions: class n ,m, whichis defined as follows: (i) An HMMdefines a probability distribution over the sequence space by means of a stochastic pro- Po(a,b) = fi p(ai) fi p(bj) (5) i=l j=l cess involving a randomwalk through the model. An ASSdefines a probability distribution over the where p (a) denotes the null-model probability of resi- space of sequence pairs by means of a stochastic due a. The null-model thus essentially consists of a process involving a random walk through an residue probability distribution over the alphabet S. alignment path matrix. The distribution of sequence length pair classes will be the same for the null-model and the ASS-defined (ii) The probability of a sequence a being generated probability distribution. For reasons that will become by an HMMis the sum of the probabilities of clear later, we require that there is a logarithmic base z sequence a being generated via a particular path such that: U: Bucher 45 s~a~) p(a)p(b)z = 1. (6) B(n,m) = ~ sa<u) (11) aES,beS paths u Note that log-odds substitution matrices, such as those where So(u) is the gap score of alignment path u from the PAM or BLOSUMseries (Henikoff defined by equation 2b, if the null-model satisfies the Henikoff 1992), have known mutational equilibrium constraint imposed by equation 6. Let us first rewrite compositions and logarithmic bases satisfying the the expressionfor B (m,n ) as follows: above condition. If a different residue composition is ac¯ u)[.-, s ~-, sMCa,b,~)~. .. B(m,n) = 2~ ] 2.~ z r0ta,D ) (12) used as null-model, there will always be a unique solu- paths a tion for z solving the above equation, if the substitution LaeSm,beS. matrix satisfies the conditions necessary for local where S~(a,b,u) is the match score of the sequence sequence alignments (Altschul 1991). alignment (a,b,u) as defined by equation 2a. The inner The ASS-definedprobability distribution has the gen- sum in the above expression can be rewritten as shown eral form: below, using the notation v iv2 ¯ ’ ’ vm-~ , wlw2.., wn_ for the residues of sequences a and b, b) (7) t ProbAss(a, = PL (m ,n )PA (a,b) which are not part of a matchedresidue pair defined by where PA(a,b) is the ASS-defined probability path u: sequence pair a,b, given the length pair class n,m. The zsM(a’b’u)P0(a,b precise nature of the length pair class distribution is ) re,be n s (at ,bvk) irrelevant, as long as it is the same for the null-model aE S S t m-I n -I l and the ASS, since the term Pt.(m,n) cancels itself in =sumI"[P (avk)YIp (bw,)lIP (ax,)P (by,)z the definition of the PSWscore" aEsm. bESn k=l k=l k=l ProbAss(a,b) PSWscore - (8) p (b ~’C p’~ (a)p’t’ (b)z Prob.~u(a,b) a~S.b~S Pt.(n,m) Pa (a,b,u) palhs u Combiningequations 8, 9, and 11, we obtain the fol- PL (n ,m )Po(a,b) lowing intuitively plausible expression for the PSWscore: Z PA (a,b,u) paths u Z Z SA(a’b’u) Po(a,b) pathsu PSWscore (a,b) (13) For sequences of a given length pair class s#u~ a e Sm, b e S", the probability of a sequence align- paths u ment (a,b,u) will be defined as follows: Both sumsin the log likelihood ratio can efficiently be zSA(~,b,u) P0(a,b) PA (a,b,u) (9) computed by special cases of the forward algorithm B (n ,m used for computation of HMMscores (Krogh et al.
Recommended publications
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors
    Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors Running head: Six-fold speed-up of Smith-Waterman searches Torbjørn Rognes* and Erling Seeberg Institute of Medical Microbiology, University of Oslo, The National Hospital, NO-0027 Oslo, Norway Abstract Motivation: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence search algorithm is that of Smith- Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. Results: A fast implementation of the Smith-Waterman sequence alignment algorithm using SIMD (Single-Instruction, Multiple-Data) technology is presented. This implementation is based on the MMX (MultiMedia eXtensions) and SSE (Streaming SIMD Extensions) technology that is embedded in Intel’s latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith-Waterman implementation on the same hardware was achieved by an optimised 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. Availability: Online searches with the software are available at http://dna.uio.no/search/ Contact: [email protected] Published in Bioinformatics (2000) 16 (8), 699-706. Copyright © (2000) Oxford University Press. *) To whom correspondence should be addressed.
    [Show full text]
  • Bioinformatics Courses Lecture 3: (Local) Alignment and Homology
    C E N Bioinformatics courses T R E Principles of Bioinformatics (BSc) & F B O I Fundamentals of Bioinformatics R O I I N N (MSc) T F E O G R R M A A T T I I Lecture 3: (local) alignment and V C E S homology searching V U Centre for Integrative Bioinformatics VU (IBIVU) Faculty of Exact Sciences / Faculty of Earth and Life Sciences 1 http://ibi.vu.nl, [email protected], 87649 (Heringa), Room P1.28 Divergent evolution sequence -> structure -> function • Common ancestor (CA) CA • Sequences change over time • Protein structures typically remain the same (robust against Sequence 1 ≠ Sequence 2 multiple mutations) • Therefore, function normally is Structure 1 = Structure 2 preserved within orthologous families Function 1 = Function 2 “Structure more conserved than sequence” 2 Reconstructing divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB─D A─BD 3 Reconstructing divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB─D A─BD true alignment 4 Pairwise alignment examples A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ****** 5 Evolution and three-dimensional protein structure information Multiple alignment Protein structure What do we see if we colour code the space-filling (CPK) protein model? • E.g., red for conserved alignment positions to blue for variable (unconserved) positions. 6 Evolution and three-dimensional protein structure information Isocitrate dehydrogenase: The distance from the active site (in yellow) determines the rate of evolution (red = fast evolution, blue = slow evolution) Dean, A.
    [Show full text]
  • A Survey on Data Compression Methods for Biological Sequences
    Article A Survey on Data Compression Methods for Biological Sequences Morteza Hosseini *, Diogo Pratas and Armando J. Pinho Institute of Electronics and Informatics Engineering of Aveiro/Department of Electronics, Telecommunications and Informatics (IEETA/DETI), University of Aveiro, 3810-193 Aveiro, Portugal; [email protected] (D.P.); [email protected] (A.J.P.) * Correspondence: [email protected]; Tel.: +351-234-370520; Fax: +351-234-370545 Academic Editor: Khalid Sayood Received: 27 June 2016; Accepted: 29 September 2016; Published: 14 October 2016 Abstract: The ever increasing growth of the production of high-throughput sequencing data poses a serious challenge to the storage, processing and transmission of these data. As frequently stated, it is a data deluge. Compression is essential to address this challenge—it reduces storage space and processing costs, along with speeding up data transmission. In this paper, we provide a comprehensive survey of existing compression approaches, that are specialized for biological data, including protein and DNA sequences. Also, we devote an important part of the paper to the approaches proposed for the compression of different file formats, such as FASTA, as well as FASTQ and SAM/BAM, which contain quality scores and metadata, in addition to the biological sequences. Then, we present a comparison of the performance of several methods, in terms of compression ratio, memory usage and compression/decompression time. Finally, we present some suggestions for future research on biological data compression. Keywords: protein sequence; DNA sequence; reference-free compression; reference-based compression; FASTA; Multi-FASTA; FASTQ; SAM; BAM 1. Introduction With the development of high-throughput sequencing (HTS) technologies, huge volumes of sequencing data are produced everyday, fast and cheaply, that can lead to upgrading the scientific knowledge of genome sequence information as well as facilitating the diagnosis and therapy.
    [Show full text]
  • Biohackathon Series in 2013 and 2014
    F1000Research 2019, 8:1677 Last updated: 02 AUG 2021 OPINION ARTICLE BioHackathon series in 2013 and 2014: improvements of semantic interoperability in life science data and services [version 1; peer review: 2 approved with reservations] Toshiaki Katayama 1, Shuichi Kawashima1, Gos Micklem 2, Shin Kawano 1, Jin-Dong Kim1, Simon Kocbek1, Shinobu Okamoto1, Yue Wang1, Hongyan Wu3, Atsuko Yamaguchi 1, Yasunori Yamamoto1, Erick Antezana 4, Kiyoko F. Aoki-Kinoshita5, Kazuharu Arakawa6, Masaki Banno7, Joachim Baran8, Jerven T. Bolleman 9, Raoul J. P. Bonnal10, Hidemasa Bono 1, Jesualdo T. Fernández-Breis11, Robert Buels12, Matthew P. Campbell13, Hirokazu Chiba14, Peter J. A. Cock15, Kevin B. Cohen16, Michel Dumontier17, Takatomo Fujisawa18, Toyofumi Fujiwara1, Leyla Garcia 19, Pascale Gaudet9, Emi Hattori20, Robert Hoehndorf21, Kotone Itaya6, Maori Ito22, Daniel Jamieson23, Simon Jupp19, Nick Juty19, Alex Kalderimis2, Fumihiro Kato 24, Hideya Kawaji25, Takeshi Kawashima18, Akira R. Kinjo26, Yusuke Komiyama27, Masaaki Kotera28, Tatsuya Kushida 29, James Malone30, Masaaki Matsubara 31, Satoshi Mizuno32, Sayaka Mizutani 28, Hiroshi Mori33, Yuki Moriya1, Katsuhiko Murakami34, Takeru Nakazato1, Hiroyo Nishide14, Yosuke Nishimura 28, Soichi Ogishima32, Tazro Ohta1, Shujiro Okuda35, Hiromasa Ono1, Yasset Perez-Riverol 19, Daisuke Shinmachi5, Andrea Splendiani36, Francesco Strozzi37, Shinya Suzuki 28, Junichi Takehara28, Mark Thompson38, Toshiaki Tokimatsu39, Ikuo Uchiyama 14, Karin Verspoor 40, Mark D. Wilkinson 41, Sarala Wimalaratne19, Issaku Yamada
    [Show full text]
  • Database and Bioinformatic Analysis of BCL-2 Family Proteins and BH3-Only Proteins Abdel Aouacheria, Vincent Navratil, Christophe Combet
    Database and Bioinformatic Analysis of BCL-2 Family Proteins and BH3-Only Proteins Abdel Aouacheria, Vincent Navratil, Christophe Combet To cite this version: Abdel Aouacheria, Vincent Navratil, Christophe Combet. Database and Bioinformatic Analysis of BCL-2 Family Proteins and BH3-Only Proteins. Gavathiotis E. BCL-2 Family Proteins. Methods in Molecular Biology, 1877, Springer Nature, pp.23-43, 2019, Methods in Molecular Biology, 978-1-4939- 8860-0. 10.1007/978-1-4939-8861-7_2. hal-02347884 HAL Id: hal-02347884 https://hal.archives-ouvertes.fr/hal-02347884 Submitted on 5 Nov 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Copyright Database and bioinformatic analysis of BCL-2 family proteins and BH3-only proteins Abdel Aouacheria 1,*, Vincent Navratil 2 and Christophe Combet 3 1 ISEM, Institut des Sciences de l’Evolution de Montpellier, Université de Montpellier, UMR 5554, CNRS, IRD, EPHE, Place Eugène Bataillon, 34095 Montpellier, France 2 PRABI, Rhône Alpes Bioinformatics Center, UCBL, Lyon1, Université de Lyon, Lyon, France. 3 Centre de Recherche en Cancérologie de Lyon, UMR Inserm U1052, CNRS 5286, Université Claude Bernard Lyon 1, Centre Léon Bérard, Lyon, France.
    [Show full text]
  • Protein Database
    UniProt Unit-2 Paper- Bioinformatics (DSE-1) B.Sc. (H) Microbiology V Sem UniProt It provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information UniProt - Universal protein resource Swiss-Prot & TrEMBL PIR-PSD •Merger of these three databases, since 2002 •Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database •UniProt is a collaboration between the European Bioinformatics Institute (EMBL- EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). •Collaboration through different tasks such as database curation, software development and support. Source of protein sequence data • Protein sequencing is rare • Most protein sequence Large-scale sequencing Individual Patent derived from nucleotide data projects scientists Offices Nucleotide sequencing Submit Protein sequencing Submit Nucleotide Derive Protein sequence protein sequence sequence database database More than 95 % of the protein sequences provided by UniProtKB are derived from the translation of the coding sequences (CDS) Protein sequence is mainly derived data submit DNA sequence ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT may not have direct Predicted Predicted transcribe Predicted evidence start splice sites stop Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC translate Derived protein sequence MRSNECCCAMSC UniProt • The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. • The UniProt databases are: 1. UniProt Knowledgebase (UniProtKB), 2. UniProt Reference Clusters (UniRef), 3. UniProt Archive (UniParc). UniProtKB UniProtKB • Reviewed (Swiss-Prot) - Manually annotated Records with information extracted from literature and curator-evaluated computational analysis and and scientific conclusions. • Is a high quality non-redundant protein sequence database • Unreviewed (TrEMBL) - Computationally analyzed Records that await full manual annotation.
    [Show full text]
  • Sequence Database Searching
    Sequence Database Searching 2007-01-14 Gao, Ge Center for Bioinformatics, Peking University Sequence Alignment: A Quick Review Object: see how close two sequences are Usages: To infer functions. To determine the common area of some sequences To infer protein structure, if the structure of one of the sequences is known. To “guess” whether they are descended from a common ancestor (and construct evolutionary trees). © Copyright 2006 Center for Bioinformatics, Peking University Sequence Alignment: in Biology The purpose of a sequence alignment is to line up all residues in the sequence that were derived from the same residue position in the ancestral gene or protein in any number of sequences A B A B © Copyright 2006g aCpe n=te irn fsoer rBtiooinf oorrm dateicles,t iPoenking University Sequence Alignment: Scoring GAATC GAAT-C -GAAT-C CATAC C-ATAC C-A-TAC GAATC- GAAT-C GA-ATC CA-TAC CA-TAC CATA-C Scoring function: measure the quality of a candidate alignment. scoring matrix, gap penalty © Copyright 2006 Center for Bioinformatics, Peking University Sequence Alignment : Global vs. Local Global alignments: align residues in whole sequence, are most useful when the sequences are similar and of roughly equal size. Algorithm: Needleman-Wunsch Local alignments: align residues in regions are more useful for diverged sequences Algorithm: Smith-Waterman With sufficiently similar sequences, there is no difference between local and global alignments. © Copyright 2006 Center for Bioinformatics, Peking University Sequence Database Searching Rather than do the alignment pair-wise, it’s necessary to run database searching in a high-throughput style. Identify similarities between novel query sequences whose structures and functions are unknown and uncharacterized sequences in (public) databases whose structures and functions have been elucidated.
    [Show full text]
  • Accessing Molecular Biology Information * Genome Browsers
    Accessing molecular biology information * Genome browsers - UCSC * NCBI * Galaxy Flow of genetic information and databases Exon 1 Exon 2 Exon 3 Exon 4 intron intron intron DNA transcription 5' 3' RNA Genbank/ EMBL splicing 5' UTR 3' UTR polyA mature coding sequence mRNA translation protein GenPept / SwissProt folding / UniProt Protein Data Bank (PDB) Databases, cont. Redundancy at GenBank => RefSeq Many sequences are represented more than once in GenBank 2003 RefSeq collection : curated secondary database non-redundant selected organisms •Genome DNA (assemblies) •Transcripts (RNA) •Protein http://www.ncbi.nlm.nih.gov/books/bv.fcg i?rid=handbook RefSeq vs GenBank GenBank RefSeq Not curated Curated Author submits NCBI creates from existing data Only author can revise NCBI reivses as new data emerge Multiple records from same loci Single records for each molecule common of major organisms Records can contradict each other No limit to species included Limited to model organisms Akin to primary literature Akin to review articles Archives with nucleotide sequences • Genbank/EMBL • Sequence read archive (NCBI) • GEO (Gene expression omnibus, NCBI) • 1000 Genomes Project • The Cancer Genome Atlas • The Cancer Genome Project • Species-specific databases like –FlyBase – WormBase – Saccharomyces Genome Database Genome sequencing using a shotgun approach Sequenced eukaryotic genomes Sequencing going wild ... BGI : "capacity to sequence the equivalent of 1,600 complete human genomes each day" "BGI and BGI Americas aim to build a library of digital
    [Show full text]
  • Introduction to Bioinformatics
    INTRODUCTION TO BIOINFORMATICS > Please take the initial BIOINF525 questionnaire: < http://tinyurl.com/bioinf525-questions Barry Grant University of Michigan www.thegrantlab.org BIOINF 525 http://bioboot.github.io/bioinf525_w16/ 12-Jan-2016 Barry Grant, Ph.D. [email protected] Ryan Mills, Ph.D. [email protected] Hongyang Li (GSI) [email protected] COURSE LOGISTICS Lectures: Tuesdays 2:30-4:00 PM Rm. 2062 Palmer Commons Labs: Session I: Thursdays 2:30 - 4:00 PM Session II: Fridays 10:30 - 12:00 PM Rm. 2036 Palmer Commons Website: http://tinyurl.com/bioinf525-w16 Lecture, lab and background reading material plus homework and course announcements MODULE OVERVIEW Objective: Provide an introduction to the practice of bioinformatics as well as a practical guide to using common bioinformatics databases and algorithms 1.1. ‣ Introduction to Bioinformatics 1.2. ‣ Sequence Alignment and Database Searching 1.3 ‣ Structural Bioinformatics 1.4 ‣ Genome Informatics: High Throughput Sequencing Applications and Analytical Methods TODAYS MENU Overview of bioinformatics • The what, why and how of bioinformatics? • Major bioinformatics research areas. • Skepticism and common problems with bioinformatics. Bioinformatics databases and associated tools • Primary, secondary and composite databases. • Nucleotide sequence databases (GenBank & RefSeq). • Protein sequence database (UniProt). • Composite databases (PFAM & OMIM). Database usage vignette • Searching with ENTREZ and BLAST. • Reference slides and handout on major databases. HOMEWORK Complete the initial
    [Show full text]
  • An Open-Access Platform for Glycoinformatics
    Glycobiology vol. 21 no. 4 pp. 493–502, 2011 doi:10.1093/glycob/cwq188 Advance Access publication on November 23, 2010 EUROCarbDB: An open-access platform for glycoinformatics Claus-Wilhelm von der Lieth 2,†, Ana Ardá Freire 3, relational database containing glycan structures, their Dennis Blank 4, Matthew P Campbell 5,6,11, Alessio Ceroni 7, biological context and, when available, primary and David R Damerell 7, Anne Dell 7,RaymondADwek6, interpreted analytical data from high-performance liquid Beat Ernst 8, Rasmus Fogh 9,12, Martin Frank 2, chromatography, mass spectrometry and nuclear magnetic Hildegard Geyer 4, Rudolf Geyer 4, Mathew J Harrison 9,13, resonance experiments. Database content can be accessed Kim Henrick 9,14, Stefan Herget 2, William E Hull 2, via a web-based user interface. The database is complemen- John Ionides 9,14, Hiren J Joshi 2,9, Johannis P Kamerling 3, ted by a suite of glycoinformatics tools, specifically designed 3 3,12 Bas R Leeflang , Thomas Lütteke , to assist the elucidation and submission of glycan structure Downloaded from Magnus Lundborg 10, Kai Maass 4,15, Anthony Merry 9, and experimental data when used in conjunction with con- René Ranzinger 2,16, Jimmy Rosen 3, Louise Royle 5,6, temporary carbohydrate research workflows. All software Pauline M Rudd 5,6, Siegfried Schloissnig 2, tools and source code are licensed under the terms of the Roland Stenutz 10, Wim F Vranken 9,14, Göran Widmalm 10, Lesser General Public License, and publicly contributed glycob.oxfordjournals.org and Stuart M Haslam 1,7 structures and data are freely accessible.
    [Show full text]