Fast, Sensitive Homology Detection Using HMMER

Total Page:16

File Type:pdf, Size:1020Kb

Fast, Sensitive Homology Detection Using HMMER Fast, sensitive homology detection using HMMER Rob Finn Sequence Families Team Lead @robdfinn, 14th Nov 2018 Making sense of sequence data Sequence Data Information Model Experimental Organisms Literature Reference Sporadic Proteomes Literature Complete Proteomes Similarity Other Sequences & Uncharacterized Metagenomics MGnify Protein Database Length distribution 5 Growth of MGnify Compared with UniProt 1,200,000,000 1,000,000,000 800,000,000 4 600,000,000 400,000,000 Number of Sequences 200,000,000 3 Protein 0 Partail 2002 2004 2006 2008 2010 2012 2014 2016 2018 C−term truncated -200,000,000 Year N−term truncated UniProt MGNify Full length Frequency (millions) 2 1 0 0 500 1000 1500 2000 length of sequence • >1 billion sequences, mean length of 205 • <1% match UniProtKB, but 58% match Pfam 406 STRUCTURE COMPARISON AND ALIGNMENT Sequence And Structure Alignments 406 STRUCTURE COMPARISON AND ALIGNMENT Figure 16.2. Structure alignment for c-phycocyanin (1CPC:A) (black) and colicin A (1COL:A) (gray) as computed by SALIGN. The alignment extended over 86 residues with a 0.97 A RMSD. The sequence identity of the superposed residues with respect to the shorter of the two structures was 11.9%. undergone convergent evolution to form a stable 3-on-3 a-helical sandwich fold. Interest- ingly, it was subsequently discovered that phycocianins can aggregate forming clusters that Figure 16.2.thenStructure adhere to alignment the membrane for c-phycocyanin forming the (1CPC:A) so-called (black) phycobilisomes. and colicin A Such (1COL:A) a functional (gray) as computed byrelationship SALIGN. The may alignment indeed point extended to convergent over 86 evolution residues with from a a 0.97 distant A RMSD. common The ancestor. sequence identity of theFig. Adapted superposedThe second from Chap.16, example, residues Structural with whichBioinformatics, respect is extracted 2nd toEd., theMarti-Renom from shorter the et al work of the of two one structures of our groups was (Tsigelny 11.9%. et al., 2000), illustrated how the combination and integration of different sources of undergoneinformation, convergent including evolution structural to form alignments, a stable 3-on-3 coulda help-helical to functionally sandwich fold. characterize Interest- a ingly, it wasprotein. subsequently In our work, discovered two new that EF-hand phycocianins motifs were can aggregate identified in forming acetylcholinesterase clusters that (AChE) and related proteins by combining the results from a hidden Markovmodel sequence then adheresearch, to the Prosite membrane pattern extraction, forming and the protein so-called structure phycobilisomes. alignments by CE. Such It was a functional also found 2 relationshipthat may the a indeed–b hydrolase point fold to convergent family, including evolution acetylcholinesterases, from a distant contains common putative ancestor. Ca þ The secondbinding example, sites, indicative which of is an extracted EF-hand from motif, the and work which of in one some of family our groups members (Tsigelny may be et al., 2000),critical illustrated for heterologous how the cell combination associations. This and putative integration finding of represented different sources the second of information,characterization including structural of an EF-hand alignments, motif within could an help extracellular to functionally protein, which characterize previously a protein. Inhad our only work, been two found new in osteonectins. EF-hand motifs Thus, were structure identified alignment in had acetylcholinesterase contributed to our (AChE) andunderstanding related proteins of an by important combining family the of results proteins. from a hidden Markovmodel sequence Finally, the third example, also from a previous work of one of our groups (McMahon search, Prosite pattern extraction, and protein structure alignments by CE. It was also found et al., 2005), combined information from structural alignments deposited in the DBAli2 that the a–databaseb hydrolase and foldexperiments family, to including analyze the acetylcholinesterases, sequence and fold diversity contains of putative a C-type Ca lectinþ binding sites,domain. indicative We demonstrated of an EF-hand that the motif, C-type and lectin which fold adopted in some by a family major tropism members determinant may be critical forsequence, heterologous a retroelement-encoded cell associations. receptor This putative binding finding protein, represented provides a highly the second static characterizationstructural of scaffold an EF-hand in support motif of a within diverse anarray extracellular of sequences. protein, Immunoglobulins which previously are known had only beento fulfill found the same in osteonectins. role of a scaffold Thus, supporting structure a large alignment variety of sequences had contributed necessary to for our an understandingantigenic of an response. important C-type family lectins of were proteins. shown to represent a different evolutionary solution taken by retroelements to balance diversity against stability. Finally, the third example, also from a previous work of one of our groups (McMahon et al., 2005), combined information from structural alignments deposited in the DBAli database andMULTIPLE experiments STRUCTURE to analyze ALIGNMENT the sequence and fold diversity of a C-type lectin domain. We demonstrated that the C-type lectin fold adopted by a major tropism determinant sequence,Our a retroelement-encoded discussions thus far have involved receptor only binding pair-wise protein, structure provides comparison a and highly alignment, static structural scaffoldor at best, in alignment support of of multiple a diverse structures array of to sequences. a single representative Immunoglobulins in a pair-wise are fashion known (i.e., progressive pair-wise structure alignment). Most of the available methods for multiple to fulfill thestructure same role alignment of a scaffold start by computingsupporting all a pair-wise large variety alignments of sequences between necessary a set of structures for an antigenic response.but then use C-type them to lectins generate were the shown optimal to consensus represent alignment a different between evolutionary all the structures. solution taken by retroelements to balance diversity against stability. MULTIPLE STRUCTURE ALIGNMENT Our discussions thus far have involved only pair-wise structure comparison and alignment, or at best, alignment of multiple structures to a single representative in a pair-wise fashion (i.e., progressive pair-wise structure alignment). Most of the available methods for multiple structure alignment start by computing all pair-wise alignments between a set of structures but then use them to generate the optimal consensus alignment between all the structures. Profile hidden Markov models • Statistical inference, accounting for uncertainty • Use more information Profile hidden Markov models • Statistical inference, accounting for uncertainty • Use more information P(t | model of homology to q) P(t | model of homology to q) P(t | model of nonhomology) P(t | H) P(t | R) P(t | H) S = log P(t | R) joint probability of t, and the alignment P(t,πo | H) S = log P(t | R) Optimal alignment scores are only an approximation. and the approximation breaks down on remote homologs. P(t,πo | H) S = log P(t | R) ...GHRL... ...| |... ...GI-M... Optimal alignment scores are only an approximation. and the approximation breaks down on remote homologs. P(t,πo | H) S = log P(t | R) ...GHRL... ...| |... ...GI-M... According to inference theory, the correct score is a log-odds ratio summed over all alignments P(t,πo | H) max P(t,π | H) V = log = log π optimal alignment score P(t | R) P(t | R) HMMs: "Viterbi" score, V P(t | H) Σπ P(t,π | H) F = log = log HMMs: "Forward" score, F P(t | R) P(t | R) According to inference theory, the correct score is a log-odds ratio summed over all alignments Depends on: - a probability model of alignment, not just scores - algorithms fast enough to use in practice P(t,πo | H) max P(t,π | H) V = log = log π optimal alignment score P(t | R) P(t | R) HMMs: "Viterbi" score, V P(t | H) Σπ P(t,π | H) F = log = log HMMs: "Forward" score, F P(t | R) P(t | R) According to inference theory, the correct score is a log-odds ratio summed over all alignments Depends on: - a probability model of alignment, not just scores BLAST (almost) P(t,πo | H) max P(t,π | H) V = log = log π optimal alignment score P(t | R) P(t | R) HMMs: "Viterbi" score, V P(t | H) Σπ P(t,π | H) F = log = log HMMs: "Forward" score, F P(t | R) P(t | R) HMMER Profile hidden Markov models • Statistical inference, accounting for uncertainty • Use more information Profile Hidden Markov Models - Encapsulate diversity Input multiple alignment: seq1 ACGACG-LD-LD Consensus columns assigned, seq2 SCGSCG--E--E Defining inserts and deletes: Seq3 NCGNCGgFDgFD Seq4 TCGTCG-WQ-WQ 123-45 N W T F D A L E S C G Y Q B M1 M2 M3 M4 M5 E Plan7 core D1 D2 D3 D4 D5 model I0 I1 I2 I3 I4 I5 Profile Hidden Markov Models Input multiple alignment: seq1 ACG-LD Consensus columns assigned, seq2 SCG--E Defining inserts and deletes: Seq3 NCGgFD Seq4 TCG-WQ 123-45 Profile Hidden Markov Models Input multiple alignment: Consensus columns assigned, Defining inserts and deletes: seq1 ACG-LD seq2 SCG--E Seq3 NCGgFD Seq4 TCG-WQ 123-45 Profile Hidden Markov Models Input multiple alignment: Consensus columns assigned, Defining inserts and deletes: seq1 ACG-LDACG-LD seq2 SCG--ESCG--E Seq3 NCGgFDNCGgFD Seq4 TCG-WQTCG-WQ 123-45 Profile Hidden Markov Models Input multiple
Recommended publications
  • RDA COVID-19 Recommendations and Guidelines on Data Sharing
    RDA COVID-19 Recommendations and Guidelines on Data Sharing DOI: 10.15497/RDA00052 Authors: RDA COVID-19 Working Group Published: 30th June 2020 Abstract: This is the final version of the Recommendations and Guidelines from the RDA COVID19 Working Group, and has been endorsed through the official RDA process. Keywords: RDA; Recommendations; COVID-19. Language: English License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication RDA webpage: epidemiology-rda-covid19-clinical-rda-covid19-1 Related resources: - RDA COVID-19 Guidelines and Recommendations – preliminary version, - Data Sharing in Epidemiology, - RDA COVID-19 Zotero Library, Citation and Download: RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing. Research Data Alliance. 2020. DOI: RDA COVID-19 Recommendations and Guidelines on Data Sharing RDA Recommendation (FINAL Release) Produced by: RDA COVID-19 Working Group, 2020 Document Metadata Identifier DOI: Citation To cite this document please use: RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing. Research Data Alliance. 2020. DOI: Title RDA COVID-19; Recommendations and Guidelines on Data Sharing, Final release 30 June 2020 Description This is the final version of the Recommendations and Guidelines
    [Show full text]
  • Enhanced Representation of Natural Product Metabolism in Uniprotkb
    H OH metabolites OH Article Diverse Taxonomies for Diverse Chemistries: Enhanced Representation of Natural Product Metabolism in UniProtKB Marc Feuermann 1,* , Emmanuel Boutet 1,* , Anne Morgat 1 , Kristian B. Axelsen 1, Parit Bansal 1, Jerven Bolleman 1 , Edouard de Castro 1, Elisabeth Coudert 1, Elisabeth Gasteiger 1,Sébastien Géhant 1, Damien Lieberherr 1, Thierry Lombardot 1,†, Teresa B. Neto 1, Ivo Pedruzzi 1, Sylvain Poux 1, Monica Pozzato 1, Nicole Redaschi 1 , Alan Bridge 1 and on behalf of the UniProt Consortium 1,2,3,4,‡ 1 Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet, CH-1211 Geneva 4, Switzerland; (A.M.); (K.B.A.); (P.B.); (J.B.); (E.d.C.); (E.C.); (E.G.); (S.G.); (D.L.); (T.L.); (T.B.N.); (I.P.); (S.P.); (M.P.); (N.R.); (A.B.); (U.C.) 2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 3 Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA 4 Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street NorthWest, Suite 1200, Washington, DC 20007, USA * Correspondence: (M.F.); (E.B.); Tel.: +41-22-379-58-75 (M.F.); +41-22-379-49-10 (E.B.) † Current address: Centre Informatique, Division Calcul et Soutien à la Recherche, University of Lausanne, CH-1015 Lausanne, Switzerland.
    [Show full text]
  • Sequencing Alignment I Outline: Sequence Alignment
    Sequencing Alignment I Lectures 16 – Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence Alignment What Why (applications) Comparative genomics DNA sequencing A simple algorithm Complexity analysis A better algorithm: “Dynamic programming” 2 1 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G T C C A A T 3 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G | : | : | | : T C C – A A T 4 2 Sequence Alignment: Why The most basic sequence analysis task First aligning the sequences (or parts of them) and Then deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance Similar sequences often have similar origin or function New sequence always compared to existing sequences (e.g. using BLAST) 5 Sequence Alignment Example: gene HBB Product: hemoglobin Sickle-cell anaemia causing gene Protein sequence (146 aa) MVHLTPEEKS AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS TPDAVMGNPK VKAHGKKVLG AFSDGLAHLD NLKGTFATLS ELHCDKLHVD PENFRLLGNV LVCVLAHHFG KEFTPPVQAA YQKVVAGVAN ALAHKYH BLAST (Basic Local Alignment Search Tool) The most popular alignment tool Try it! Pick any protein, e.g.
    [Show full text]
  • The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe
    The EMBL-European Bioinformatics Institute The hub for bioinformatics in Europe Blaise T.F. Alako, PhD What is EMBL-EBI? • Part of the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data, services and research The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Structural biology Bioinformatics Administration Grenoble Monterotondo, Rome EMBO EMBL staff: 1500 people Structural biology Mouse biology >60 nationalities EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia Who we are ~500 members of staff ~400 work in services & support >53 nationalities ~120 focus on basic research EMBL-EBI’s mission • Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • Contribute to the advancement of biology through basic investigator-driven research in bioinformatics • Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • Help disseminate cutting-edge technologies to industry • Coordinate biological data provision throughout Europe Services Data and tools for molecular life science Browse our services 9 What services do we provide? Labs around the
    [Show full text]
  • Comparative Analysis of Multiple Sequence Alignment Tools
    I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS ( DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail:, Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE).
    [Show full text]
  • Chapter 6: Multiple Sequence Alignment Learning Objectives
    Chapter 6: Multiple Sequence Alignment Learning objectives • Explain the three main stages by which ClustalW performs multiple sequence alignment (MSA); • Describe several alternative programs for MSA (such as MUSCLE, ProbCons, and TCoffee); • Explain how they work, and contrast them with ClustalW; • Explain the significance of performing benchmarking studies and describe several of their basic conclusions for MSA; • Explain the issues surrounding MSA of genomic regions Outline: multiple sequence alignment (MSA) Introduction; definition of MSA; typical uses Five main approaches to multiple sequence alignment Exact approaches Progressive sequence alignment Iterative approaches Consistency-based approaches Structure-based methods Benchmarking studies: approaches, findings, challenges Databases of Multiple Sequence Alignments Pfam: Protein Family Database of Profile HMMs SMART Conserved Domain Database Integrated multiple sequence alignment resources MSA database curation: manual versus automated Multiple sequence alignments of genomic regions UCSC, Galaxy, Ensembl, alignathon Perspective Multiple sequence alignment: definition • a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned • homologous residues are aligned in columns across the length of the sequences • residues are homologous in an evolutionary sense • residues are homologous in a structural sense Example: 5 alignments of 5 globins Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths. We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers.
    [Show full text]
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 ( ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Dual Proteome-Scale Networks Reveal Cell-Specific Remodeling of the Human Interactome
    bioRxiv preprint doi:; this version posted January 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Dual Proteome-scale Networks Reveal Cell-specific Remodeling of the Human Interactome Edward L. Huttlin1*, Raphael J. Bruckner1,3, Jose Navarrete-Perea1, Joe R. Cannon1,4, Kurt Baltier1,5, Fana Gebreab1, Melanie P. Gygi1, Alexandra Thornock1, Gabriela Zarraga1,6, Stanley Tam1,7, John Szpyt1, Alexandra Panov1, Hannah Parzen1,8, Sipei Fu1, Arvene Golbazi1, Eila Maenpaa1, Keegan Stricker1, Sanjukta Guha Thakurta1, Ramin Rad1, Joshua Pan2, David P. Nusinow1, Joao A. Paulo1, Devin K. Schweppe1, Laura Pontano Vaites1, J. Wade Harper1*, Steven P. Gygi1*# 1Department of Cell Biology, Harvard Medical School, Boston, MA, 02115, USA. 2Broad Institute, Cambridge, MA, 02142, USA. 3Present address: ICCB-Longwood Screening Facility, Harvard Medical School, Boston, MA, 02115, USA. 4Present address: Merck, West Point, PA, 19486, USA. 5Present address: IQ Proteomics, Cambridge, MA, 02139, USA. 6Present address: Vor Biopharma, Cambridge, MA, 02142, USA. 7Present address: Rubius Therapeutics, Cambridge, MA, 02139, USA. 8Present address: RPS North America, South Kingstown, RI, 02879, USA. *Correspondence: (E.L.H.), (J.W.H.), (S.P.G.) #Lead Contact: bioRxiv preprint doi:; this version posted January 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder.
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
    UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR.
    [Show full text]
  • Sequence Motifs, Correlations and Structural Mapping of Evolutionary
    Talk overview • Sequence profiles – position specific scoring matrix • Psi-blast. Automated way to create and use sequence Sequence motifs, correlations profiles in similarity searches and structural mapping of • Sequence patterns and sequence logos evolutionary data • Bioinformatic tools which employ sequence profiles: PFAM BLOCKS PROSITE PRINTS InterPro • Correlated Mutations and structural insight • Mapping sequence data on structures: March 2011 Eran Eyal Conservations Correlations PSSM – position specific scoring matrix • A position-specific scoring matrix (PSSM) is a commonly used representation of motifs (patterns) in biological sequences • PSSM enables us to represent multiple sequence alignments as mathematical entities which we can work with. • PSSMs enables the scoring of multiple alignments with sequences, or other PSSMs. PSSM – position specific scoring matrix Assuming a string S of length n S = If we want to score this string against our PSSM of length n (with n lines): n alignment _ score = m ∑ s j , j j=1 where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast). Sequence space PSI-BLAST • For a query sequence use Blast to find matching sequences. • Construct a multiple sequence alignment from the hits to find the common regions (consensus). • Use the “consensus” to search again the database, and get a new set of matching sequences • Repeat the process ! Sequence space Position-Specific-Iterated-BLAST • Intuition – substitution matrices should be specific to sites and not global. – Example: penalize alanine→glycine more in a helix •Idea – Use BLAST with high stringency to get a set of closely related sequences.
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at and iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: 1 Introduction Nucleotide Archive (ENA).
    [Show full text]