The Uniprot Knowledgebase BLAST
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Uniprot at EMBL-EBI's Role in CTTV
Barbara P. Palka, Daniel Gonzalez, Edd Turner, Xavier Watkins, Maria J. Martin, Claire O’Donovan European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK UniProt at EMBL-EBI’s role in CTTV: contributing to improved disease knowledge Introduction The mission of UniProt is to provide the scientific community with a The Centre for Therapeutic Target Validation (CTTV) comprehensive, high quality and freely accessible resource of launched in Dec 2015 a new web platform for life- protein sequence and functional information. science researchers that helps them identify The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of therapeutic targets for new and repurposed medicines. functional information on proteins, with accurate, consistent and rich CTTV is a public-private initiative to generate evidence on the annotation. As much annotation information as possible is added to each validity of therapeutic targets based on genome-scale experiments UniProtKB record and this includes widely accepted biological ontologies, and analysis. CTTV is working to create an R&D framework that classifications and cross-references, and clear indications of the quality of applies to a wide range of human diseases, and is committed to annotation in the form of evidence attribution of experimental and sharing its data openly with the scientific community. CTTV brings computational data. together expertise from four complementary institutions: GSK, Biogen, EMBL-EBI and Wellcome Trust Sanger Institute. UniProt’s disease expert curation Q5VWK5 (IL23R_HUMAN) This section provides information on the disease(s) associated with genetic variations in a given protein. The information is extracted from the scientific literature and diseases that are also described in the OMIM database are represented with a controlled vocabulary. -
Sequencing Alignment I Outline: Sequence Alignment
Sequencing Alignment I Lectures 16 – Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence Alignment What Why (applications) Comparative genomics DNA sequencing A simple algorithm Complexity analysis A better algorithm: “Dynamic programming” 2 1 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G T C C A A T 3 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G | : | : | | : T C C – A A T 4 2 Sequence Alignment: Why The most basic sequence analysis task First aligning the sequences (or parts of them) and Then deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance Similar sequences often have similar origin or function New sequence always compared to existing sequences (e.g. using BLAST) 5 Sequence Alignment Example: gene HBB Product: hemoglobin Sickle-cell anaemia causing gene Protein sequence (146 aa) MVHLTPEEKS AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS TPDAVMGNPK VKAHGKKVLG AFSDGLAHLD NLKGTFATLS ELHCDKLHVD PENFRLLGNV LVCVLAHHFG KEFTPPVQAA YQKVVAGVAN ALAHKYH BLAST (Basic Local Alignment Search Tool) The most popular alignment tool Try it! Pick any protein, e.g. -
Comparative Analysis of Multiple Sequence Alignment Tools
I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE). -
Chapter 6: Multiple Sequence Alignment Learning Objectives
Chapter 6: Multiple Sequence Alignment Learning objectives • Explain the three main stages by which ClustalW performs multiple sequence alignment (MSA); • Describe several alternative programs for MSA (such as MUSCLE, ProbCons, and TCoffee); • Explain how they work, and contrast them with ClustalW; • Explain the significance of performing benchmarking studies and describe several of their basic conclusions for MSA; • Explain the issues surrounding MSA of genomic regions Outline: multiple sequence alignment (MSA) Introduction; definition of MSA; typical uses Five main approaches to multiple sequence alignment Exact approaches Progressive sequence alignment Iterative approaches Consistency-based approaches Structure-based methods Benchmarking studies: approaches, findings, challenges Databases of Multiple Sequence Alignments Pfam: Protein Family Database of Profile HMMs SMART Conserved Domain Database Integrated multiple sequence alignment resources MSA database curation: manual versus automated Multiple sequence alignments of genomic regions UCSC, Galaxy, Ensembl, alignathon Perspective Multiple sequence alignment: definition • a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned • homologous residues are aligned in columns across the length of the sequences • residues are homologous in an evolutionary sense • residues are homologous in a structural sense Example: 5 alignments of 5 globins Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths. We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers. -
The ELIXIR Core Data Resources: Fundamental Infrastructure for The
Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that mention Core Data Resource by name or include specific data record accession numbers. -
Dual Proteome-Scale Networks Reveal Cell-Specific Remodeling of the Human Interactome
bioRxiv preprint doi: https://doi.org/10.1101/2020.01.19.905109; this version posted January 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Dual Proteome-scale Networks Reveal Cell-specific Remodeling of the Human Interactome Edward L. Huttlin1*, Raphael J. Bruckner1,3, Jose Navarrete-Perea1, Joe R. Cannon1,4, Kurt Baltier1,5, Fana Gebreab1, Melanie P. Gygi1, Alexandra Thornock1, Gabriela Zarraga1,6, Stanley Tam1,7, John Szpyt1, Alexandra Panov1, Hannah Parzen1,8, Sipei Fu1, Arvene Golbazi1, Eila Maenpaa1, Keegan Stricker1, Sanjukta Guha Thakurta1, Ramin Rad1, Joshua Pan2, David P. Nusinow1, Joao A. Paulo1, Devin K. Schweppe1, Laura Pontano Vaites1, J. Wade Harper1*, Steven P. Gygi1*# 1Department of Cell Biology, Harvard Medical School, Boston, MA, 02115, USA. 2Broad Institute, Cambridge, MA, 02142, USA. 3Present address: ICCB-Longwood Screening Facility, Harvard Medical School, Boston, MA, 02115, USA. 4Present address: Merck, West Point, PA, 19486, USA. 5Present address: IQ Proteomics, Cambridge, MA, 02139, USA. 6Present address: Vor Biopharma, Cambridge, MA, 02142, USA. 7Present address: Rubius Therapeutics, Cambridge, MA, 02139, USA. 8Present address: RPS North America, South Kingstown, RI, 02879, USA. *Correspondence: [email protected] (E.L.H.), [email protected] (J.W.H.), [email protected] (S.P.G.) #Lead Contact: [email protected] bioRxiv preprint doi: https://doi.org/10.1101/2020.01.19.905109; this version posted January 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. -
Toolboxes for a Standardised and Systematic Study of Glycans
Campbell et al. BMC Bioinformatics 2014, 15(Suppl 1):S9 http://www.biomedcentral.com/1471-2105/15/S1/S9 RESEARCH Open Access Toolboxes for a standardised and systematic study of glycans Matthew P Campbell1, René Ranzinger2, Thomas Lütteke3, Julien Mariethoz4, Catherine A Hayes5, Jingyu Zhang1, Yukie Akune6, Kiyoko F Aoki-Kinoshita6, David Damerell7,11, Giorgio Carta8, Will S York2, Stuart M Haslam7, Hisashi Narimatsu9, Pauline M Rudd8, Niclas G Karlsson4, Nicolle H Packer1, Frédérique Lisacek4,10* From Integrated Bio-Search: 12th International Workshop on Network Tools and Applications in Biology (NETTAB 2012) Como, Italy. 14-16 November 2012 Abstract Background: Recent progress in method development for characterising the branched structures of complex carbohydrates has now enabled higher throughput technology. Automation of structure analysis then calls for software development since adding meaning to large data collections in reasonable time requires corresponding bioinformatics methods and tools. Current glycobioinformatics resources do cover information on the structure and function of glycans, their interaction with proteins or their enzymatic synthesis. However, this information is partial, scattered and often difficult to find to for non-glycobiologists. Methods: Following our diagnosis of the causes of the slow development of glycobioinformatics, we review the “objective” difficulties encountered in defining adequate formats for representing complex entities and developing efficient analysis software. Results: Various solutions already implemented and strategies defined to bridge glycobiology with different fields and integrate the heterogeneous glyco-related information are presented. Conclusions: Despite the initial stage of our integrative efforts, this paper highlights the rapid expansion of glycomics, the validity of existing resources and the bright future of glycobioinformatics. -
To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR. -
Life with 6000 Genes Author(S): A. Goffeau, B. G. Barrell, H. Bussey, R
Life with 6000 Genes Author(s): A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann, F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin and S. G. Oliver Source: Science, Vol. 274, No. 5287, Genome Issue (Oct. 25, 1996), pp. 546+563-567 Published by: American Association for the Advancement of Science Stable URL: https://www.jstor.org/stable/2899628 Accessed: 28-08-2019 13:31 UTC REFERENCES Linked references are available on JSTOR for this article: https://www.jstor.org/stable/2899628?seq=1&cid=pdf-reference#references_tab_contents You may need to log in to JSTOR to access the linked references. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at https://about.jstor.org/terms American Association for the Advancement of Science is collaborating with JSTOR to digitize, preserve and extend access to Science This content downloaded from 152.3.43.41 on Wed, 28 Aug 2019 13:31:29 UTC All use subject to https://about.jstor.org/terms by FISH across the whole genome. A subset of the 37. G. D. Billingsleyet al., Am. J. Hum. Genet. 52, 343 (1994); L. -
Sequence Motifs, Correlations and Structural Mapping of Evolutionary
Talk overview • Sequence profiles – position specific scoring matrix • Psi-blast. Automated way to create and use sequence Sequence motifs, correlations profiles in similarity searches and structural mapping of • Sequence patterns and sequence logos evolutionary data • Bioinformatic tools which employ sequence profiles: PFAM BLOCKS PROSITE PRINTS InterPro • Correlated Mutations and structural insight • Mapping sequence data on structures: March 2011 Eran Eyal Conservations Correlations PSSM – position specific scoring matrix • A position-specific scoring matrix (PSSM) is a commonly used representation of motifs (patterns) in biological sequences • PSSM enables us to represent multiple sequence alignments as mathematical entities which we can work with. • PSSMs enables the scoring of multiple alignments with sequences, or other PSSMs. PSSM – position specific scoring matrix Assuming a string S of length n S = s1s2s3...sn If we want to score this string against our PSSM of length n (with n lines): n alignment _ score = m ∑ s j , j j=1 where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast). Sequence space PSI-BLAST • For a query sequence use Blast to find matching sequences. • Construct a multiple sequence alignment from the hits to find the common regions (consensus). • Use the “consensus” to search again the database, and get a new set of matching sequences • Repeat the process ! Sequence space Position-Specific-Iterated-BLAST • Intuition – substitution matrices should be specific to sites and not global. – Example: penalize alanine→glycine more in a helix •Idea – Use BLAST with high stringency to get a set of closely related sequences. -
ELIXIR Poster Numbers: P El001 - 037 Application Posters: P El034 - 037
POSTER LIST ORDERED ALPHABETICALLY BY POSTER TITLE GROUPED BY THEME/TRACK THEME/TRACK: ELIXIR Poster numbers: P_El001 - 037 Application posters: P_El034 - 037 Poster EasyChair Presenting Author list Title Abstract Theme/track Topics number number author P_El001 714 Joan Segura, Daniel Joan Segura 3DBIONOTES: Unifying molecular biology With the advent of next generation sequencing methods, the amount of proteomic and genomic information is growing faster than ever. Several projects have been undertaken to annotate the ELIXIR poster ELIXIR Tabas Madrid, Ruben information genomes of most important organisms, including human. For example, the GENECODE project seeks to enhance all human genes including protein-coding loci with alternatively splices Sanchez-Garcia, Jesús variants, non-coding loci and pseudogenes. Another example is the 1000 genomes, a repository of human genetic variations, including SNPs and structural variants, and their haplotype Cuenca, Carlos Oscar contexts. These projects feed most relevant biological databases as UNIPROT and ENSEMBL, extending the amount of available annotation for genes and proteins.Genomic and proteomic Sánchez Sorzano, Ardan annotations are a valuable contribution in the study of protein and gene functions. However, structural information is an essential key for a deeper understanding of the molecular properties Patwardhan and Jose that allow proteins and genes to perform specific tasks. Therefore, depicting genomic and proteomic information over structural data would offer a very complete picture in order to understand Maria Carazo how proteins and genes behave in the different cellular processes.In this work we present the second version of a web platform -3DBIONOTES- that aims to merge the different levels of molecular biology information, including genomics, proteomics and interactomics data into a unique graphical environment. -
Webnetcoffee
Hu et al. BMC Bioinformatics (2018) 19:422 https://doi.org/10.1186/s12859-018-2443-4 SOFTWARE Open Access WebNetCoffee: a web-based application to identify functionally conserved proteins from Multiple PPI networks Jialu Hu1,2, Yiqun Gao1, Junhao He1, Yan Zheng1 and Xuequn Shang1* Abstract Background: The discovery of functionally conserved proteins is a tough and important task in system biology. Global network alignment provides a systematic framework to search for these proteins from multiple protein-protein interaction (PPI) networks. Although there exist many web servers for network alignment, no one allows to perform global multiple network alignment tasks on users’ test datasets. Results: Here, we developed a web server WebNetcoffee based on the algorithm of NetCoffee to search for a global network alignment from multiple networks. To build a series of online test datasets, we manually collected 218,339 proteins, 4,009,541 interactions and many other associated protein annotations from several public databases. All these datasets and alignment results are available for download, which can support users to perform algorithm comparison and downstream analyses. Conclusion: WebNetCoffee provides a versatile, interactive and user-friendly interface for easily running alignment tasks on both online datasets and users’ test datasets, managing submitted jobs and visualizing the alignment results through a web browser. Additionally, our web server also facilitates graphical visualization of induced subnetworks for a given protein and its neighborhood. To the best of our knowledge, it is the first web server that facilitates the performing of global alignment for multiple PPI networks. Availability: http://www.nwpu-bioinformatics.com/WebNetCoffee Keywords: Multiple network alignment, Webserver, PPI networks, Protein databases, Gene ontology Background tools [7–10] have been developed to understand molec- Proteins are involved in almost all life processes.