![The Nextprot Knowledgebase in 2020](https://data.docslib.org/img/3a60ab92a6e30910dab9bd827208bcff-1.webp)
D328–D334 Nucleic Acids Research, 2020, Vol. 48, Database issue Published online 14 November 2019 doi: 10.1093/nar/gkz995 The neXtProt knowledgebase in 2020: data, tools and usability improvements Monique Zahn-Zabal1, Pierre-Andre´ Michel1, Alain Gateau1,Fred´ eric´ Nikitin1, Mathieu Schaeffer1,2, Estelle Audot1, Pascale Gaudet 1, Paula D. Duek1, Daniel Teixeira1, Valentine Rech de Laval1,2,3, Kasun Samarasinghe1,2,AmosBairoch1,2 and Lydie Lane1,2,* 1CALIPHO group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland, 2Department of microbiology and molecular medicine, Faculty of Medicine, University of Geneva, Geneva, Switzerland and 3Haute ecole´ specialis´ ee´ de Suisse occidentale, Haute Ecole de Gestion de Geneve,` Carouge, Switzerland Received September 11, 2019; Revised October 10, 2019; Editorial Decision October 11, 2019; Accepted October 18, 2019 ABSTRACT post-translational modifications (PTMs), as well as peptide identified in mass spectrometry experiments and epitopes The neXtProt knowledgebase (https://www.nextprot. recognized by antibodies have been integrated from a num- org) is an integrative resource providing both data ber of resources. By doing so, neXtProt extends the contents on human protein and the tools to explore these. of UniProtKB/Swiss-Prot (2) to provide a more compre- In order to provide comprehensive and up-to-date hensive data set. data, we evaluate and add new data sets. We describe However, data alone is not sufficient for scientists to the incorporation of three new data sets that provide comprehend complex information rapidly. For this reason, expression, function, protein-protein binary interac- neXtProt organizes the information concerning an entry in tion, post-translational modifications (PTM) and vari- several views, with interactive viewers that allow the user to ant information. New SPARQL query examples illus- select the data displayed. We also provide tools to analyze trating uses of the new data were added. neXtProt has and explore the data. A basic, full text search, as well as an advanced, SPARQL-based search, allow users to search the continued to develop tools for proteomics. We have data in neXtProt. Additional tools have been implemented. improved the peptide uniqueness checker and have Users can store and compare private lists of entries. The implemented a new protein digestion tool. Together, peptide uniqueness checker (3) determines which peptides these tools make it possible to determine which pro- are unambiguous and can thus be used to confidently iden- teases can be used to identify trypsin-resistant pro- tify protein entries (4). teins by mass spectrometry. In terms of usability, we In this manuscript, we describe the latest progress on de- have finished revamping our web interface and com- veloping neXtProt. Since 2016, three major data sets have pletely rewritten our API. Our SPARQL endpoint now been integrated. Firstly, high quality, tissular expression supports federated queries. All the neXtProt data are data from the Human Protein Atlas (HPA) obtained by available via our user interface, API, SPARQL end- RNA-seq (5) has been added. Secondly, information anno- point and FTP site, including the new PEFF 1.0 for- tated from the literature on the function, cellular localiza- tion, interactions and phosphorylations carried out by hu- mat files. Finally, the data on our FTP site is now CC man protein kinases has been incorporated. Lastly, variant BY 4.0 to promote its reuse. frequency data from the Genome Aggregation Database (gnomAD) (6) extends the information on sequence vari- INTRODUCTION ations at the protein level. We also report on improvements Comprehensive, current, high quality data, as well as inno- made to the peptide uniqueness checker and the implemen- vative and powerful tools are necessary for researchers to tation of the new protein digestion tool. Finally, we present make the most of the ever-increasing data relevant to hu- improvements to the web site and SPARQL endpoint to im- man biology. neXtProt (1), a knowledgebase focusing exclu- prove the accessibility and usability of the neXtProt data. sively on human proteins, leverages the expert manual an- notation carried out at specialist resources and in-house to neXtProt data overview provide a single point of reference. Information concerning human protein function, cellular localization, tissular ex- The first neXtProt release in April 2011 contained data from pression, interactions, variants and their phenotypic effect, UniProtKB, Ensembl, HPA, Bgee and GOA. Since then *To whom correspondence should be addressed. Tel: +41 22 379 58 41; Email: [email protected] C The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2020, Vol. 48, Database issue D329 neXtProt has been steadily incorporating new data from in a biological process were captured using Gene Ontology additional resources, with a particular emphasis on expres- (GO) terms (23,24). Binary interactions with human pro- sion data, proteomics data and variant data. The current teins were also annotated and are displayed in the Interac- neXtProt release was built using human genome assembly tions view. GRCh38 (7). The data from UniProtKB (2) is currently sup- As with the phenotypic data described in our previous re- plemented with data from Bgee (8), HPA (5,9), PeptideAt- port (1), this protein kinase data is available in the new Pro- las (10), SRMAtlas (11), GOA (12), dbSNP (13), Ensembl tein kinase function portal. Accessible from the top menu (14), COSMIC (15), DKF GFP-cDNA localization (16,17), ‘Portals’, the data are presented in tabular form, with each Weizmann Institute of Science’s Kahn Dynamic Proteomics column being searchable and sortable. More details con- Database (18), IntAct (19), GlyConnect (20), gnomAD (6), cerning the experimental context for the data in all portals as well as in-house curated data (21,22). Table 1 summarizes is now provided. Two new columns, labeled Cell line / Tis- the changes in the content since our last neXtProt update sue and Experimental details, can be used to filter the data. (1). The data in the portals can be downloaded in CSV format, The data in the UniProtKB/Swiss-Prot (Reviewed) en- copied or printed. The entry accession (AC) corresponding tries for Homo sapiens (TaxID: 9606) having the keyword to the annotation subject has also been added. Complete proteome (KW-0181) provide the groundwork for neXtProt. In order to evaluate the improvement in cov- Variant frequency erage through the integration of data from sources other than UniProtKB, we determined the number of entries To date the corpus of variant data in neXtProt covers vari- in neXtProt with data from UniProtKB with that hav- ants observed in health and disease, as well as the pheno- ing data from any source using SPARQL queries (Table typic effect of the variants. The neXtProt database con- 2). UniProtKB provides excellent coverage for a single re- tains over six million single amino acid variations imported source; it thus provides a good foundation for the con- from UniProtKB, dbSNP, COSMIC and manually anno- struction of neXtProt. The incorporation of data from ad- tated from the literature, but it is difficult to make use ditional sources considerably improves the coverage––over of this variant data in the absence of information about 78% of entries in neXtProt have information about the func- their frequencies in human populations. The Genome Ag- tion, cellular localization, interactions, expression, post- gregation Database (gnomAD) (6) spans 126 216 exome translational modifications and variants. sequences and 15 136 whole-genome sequences extracted from a variety of large-scale sequencing projects and pro- vides computed allele frequencies for most of the reported RNA-seq variants. We have thus integrated variant frequency infor- We incorporated the RNA-seq data from 37 different nor- mation from the gnomAD version 2.1.1. mal tissues from Human Protein Atlas. As RNA-seq data is neXtProt now contains 18 685 entries (92%) and 2 691 highly accurate for quantifying expression levels with high 323 variants (45%) with frequency data from gnomAD. We reproducibility, this improved the expression data provided display the number of times the allele was sequenced (allele in neXtProt at the level of the transcript, which until then count), the number of individuals homozygous for the allele came from microarray and expressed sequence tag (EST) (homozygous count), the total number of alleles sequenced data. While RNA-seq is quantitative, the semi-quantitative (allele number) and the allele frequency in the evidence (Fig- expression values (undetected, low, medium and high) pro- ure 2). SPARQL queries can be used to answer questions vided by HPA were taken and are displayed in the same such as which variants have a frequency greater than 0.1 manner as the other data in the Expression view to make for (NXQ 00255) or which variants are frequently found in a easier comparison (Figure 1). This also enables the RNA- homozygous state (NXQ 00256). seq data to be queried in the same manner using SPARQL. Peptide uniqueness checker Protein kinases The peptide uniqueness checker (3) allows scientists to de- Another new large set that was integrated is a set of manual fine which peptides can be used to validate the existence annotations that we have created to capture a wide range of of human proteins by determining whether a peptide maps published experimental results concerning 300 protein ki- uniquely versus multiply to human protein sequences taking nases. The proteins phosphorylated by these kinases, as well into account isobaric substitutions, alternative splicing and as whenever possible the specific amino acid residue which single amino acid variants. It was adapted to take into ac- is phosphorylated, were annotated.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-