Contributing to the UniProt Knowledgebase - how you can help

Cecilia Arighi, PhD PIR team lead UniProt [email protected]

1 Outline v UniProt Overview v UniProt curated and additional publications v Community contribution: Why? What is the benefit? v Your personal researcher identifier v Publication submission and review processes v What happens after submission v Examples and demo

2 The Universal Protein Resource

www..org

Comprehensive, high-quality and freely accessible resource of protein sequence and functional information

3 Resources we provide

>500K unique visitors per month

BLAST SEQUENCE ID MAPPING PEPTIDE Tools: ALIGNMENT SEARCH

PROGRAMATIC DOWNLOADS ACCESS

https://www.uniprot.org 4 The knowledgebase UniProtKB

• high quality expert curation • non-redundant (1 entry/gene/species) • cross-references in every Reviewed entry (Swiss-Prot)

• automatic annotation • sequence redundancy allowed • computationally generated Unreviewed • cross-references in every (TrEMBL) entry Release 2020_02

5 Types of data in a UniProtKB entry

Functional Names comprehensive annotation standardized

Sequences Identifiers isoforms unique, stable variants, etc

Links to Amino acid- specialized specific data databases

6 Literature Annotations Expert UniProtKB organized in topics in curation curation Entry Entry view

CLDN1 - Claudin-1 - Homo sapiens (Human) - CLDN1 gene & protein http://www.uniprot.org/uniprot/O95832

UniProtKB - O95832 (CLD1_HUMAN)

Protein Claudin-1 Gene CLDN1 Organism Homo sapiens (Human)

Status s Reviewed - Annotation score: - Experimental evidence at protein level

Function

Claudins function as major constituents of the tight junction complexes that regulate the permeability of epithelia. While some claudin family members play essential roles in the formation of impermeable barriers, others mediate the permeability to ions and small molecules. Often, several claudin family members are coexpressed and interact with each other, and this determines the overall permeability. CLDN1 is required to prevent the paracellular diffusion of small molecules through tight junctions in the epidermis and is required for the normal barrier function of the skin. Required for normal water homeostasis and to prevent excessive water loss through the skin, probably via an indirect effect on the expression levels of other proteins, since CLDN1 itself seems to be dispensable for water barrier formation in keratinocyte tight junctions (PubMed:23407391). Evidence: 1 Publication (Microbial infection) Acts as a receptor for hepatitis C virus in hepatocytes (PubMed:17325668). Acts as a receptor for dengue virus (PubMed:24074594). Evidence: 2 Publications

GO - Molecular function identical protein binding Evidence: Source: UniProtKB structural molecule activity Evidence: Source: InterPro virus receptor activity Evidence: Source: UniProtKB-KW

GO - Biological process aging Evidence: Source: Ensembl bicellular tight junction assembly Evidence: Source: UniProtKB calcium-independent cell-cell adhesion via plasma membrane cell-adhesion molecules Evidence: Source: UniProtKB cell-cell junction organization Evidence: Source: MGI cellular response to butyrate Evidence: Source: Ensembl Updates cellular response to interferon-gamma Evidence: Source: Ensembl cellular response to lead ion Evidence: Source: Ensembl cellular response to transforming growth factor beta stimulus Evidence: Source: Ensembl cellular response to tumor necrosis factor Evidence: Source: Ensembl drug transport across blood-nerve barrier Evidence: Source: Ensembl Or establishment of blood-nerve barrier Evidence: Source: Ensembl 1 of 11 3/22/17, 9:08 AM New Entries

Poux, Arighi, Magrane et al. 2017, 33(21):3454, doi: 10.1093/bioinformatics/btx439

7 However…. v UniProt has a finite curation task force v Expert curation activity is prioritized, focusing on certain taxonomic groups or protein sets v The set of articles supporting annotations is a selection representing the landscape of knowledge about the protein at a given time (PMID:29036270) v Emerging critical topics, like COVID-19, with rapid accumulation of knowledge demanding up-to-date coverage

8 To expand access to published knowledge about a protein entry v Complement UniProt literature set with additional publications

v Computationally mapped publications from external resources v Leverage community expertise for adding publications and information (annotations) v Classify publications into the entry annotation topics to improve navigation and discovery

9 Publication display in UniProt entry

Sperm-associated antigen 5

UniProt

Additional

Filter by annotation topic

https://www.uniprot.org/uniprot/Q96R06/publications 10 Publication display in UniProt entry

Sperm-associated antigen 5

https://www.uniprot.org/uniprot/Q96R06/publications 11 ComputationallyNIA workshopmapped bibliography

Sources • Sources of literature: MGI PhosphoSitePlus SGD iPTMnet • Curated & Text mining sources dictyBase PRO WormBase PDB Additional bibliography for UniProt release 2020_02 PomBase pGenN covers (unique): TAIR PubTator 39,366,390 AC/PMID pairs FlyBase BioMuta 347,890 ACs ZFIN MEROPS 985,675 PMIDs RGD IntAct IC4R GeneRif BioCyc GAD Reactome Alzforum • Article categorization into different UniProt topics UPCLASS classification for UniProt release 2020_02 covers 37,893,926 AC/PMID pairs 345,617 ACs 954,619 PMIDs

12 Community: You as a contributor of literature and knowledge Why? v You have the expertise v You can help scale up curation v You asked

Benefits to you v Recognition for the papers and annotations contributed v Contribution citable and can be used as a delivery of your research v Play an active role in improving the database v An improved database better supports the research community

13 Icon made by Flat Icons from www.flaticon.com Did you know about the different publication sources in UniProt (curated, from external resources, and community) prior to this webinar?

14 ORCID https://orcid.org/

v Unique digital identifier for researchers v You control public data in your profile v Used as login mechanism to verify your identity v Used for giving you recognition for your contribution

15 Icon made by Flat Icons from www.flaticon.com

Do you have an ORCID ID?

16 System Overview

17 Snapshot of 1-Auto filled with data from the Submission Form entry 2-Checking publication exists and it has not been curated in UniProt

3-What aspects does the publication describe about the protein?

6-Does the publication show 4-Does publication provide name for any aspect that associates this gene/protein for the entry? protein with some disease? 5-Does the publication show some 7-Any other annotation? function about this protein?

9-Agree to show ORCID on 8-Your contact information website for recognition (not to be shared)

18 Review Process

v Ensure content is appropriate. Only facts related to the protein as described in the publication, not personal opinions

v Minor edits to correct typos, grammar and for standardization purposes

v Other content changes are done only with the submitter’s permission

Track status of submissions: https://community.uniprot.org/bbsub/bbsubinfo.html

19 Status tag What does it mean? Who can view these? The submission can be viewed in the Public entry publication section on the UniProt Everyone website The submission has been reviewed and Reviewed will show on UniProt entry page on Everyone website in upcoming release The submission has not yet been Submitter when Under Review reviewed by UniProt and it is not ready signed in with ORCID for release The submission has been found Submitter when Dropped inappropriate (e.g., incorrect association signed in with ORCID of paper to entry)

20 https://community.uniprot.org/bbsub/bbsubinfo.html 21 ORCID as Source Attribution for your Work

https://www.uniprot.org/uniprot/Q96R06/publications 22 Community Submission Statistics https://community.uniprot.org/bbsub/STATS.html

23 Linking a Publication to an Entry v You don’t need to be author of the publication to contribute v Important to match the protein that is described in the publication with the correct species v If you are the author you know what species you worked on v If you are not the author, you can consider the following tips: v Check species info in materials and methods section of paper to do a search in UniProt with name and species v Does publication provide any type of identifiers for the proteins/genes (e.g., GenBank, PDB, etc)? v Is there any sequence that can be compared to UniProt one?

24 “Sequence analysis indicates that the cDNA is 3,843-nt long and encodes a protein of 1,193 aa with a predicted molecular mass of 134,400 Da (Fig. ​(Fig.2;2; accession no. AF399910).“ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC64699/ v Use UniProt Retrieve/ID mapping to map external identifiers to UniProt https://www.uniprot.org/uploadlists/

25 Publication has some sequence that can be used to find entry TRPV1-long and short in vampire bat differ in C-terminal sequence TRPV1-S GIKRTLSFSLRSSRAV TRPV1-L GIKRTLSFSLRSSRVAGRNWKNFALVPLLRDASTRERQP

Gracheva et al., Nature. 2011 Aug 3;476(7358):88-9. PMID: 21814281

GIKRTLSFSLRSSRAV v Use peptide search with GIKRTLSFSLRSSRVAGRNWKNFALVPLLRDASTRERQP subsequence to find correct entry https://www.uniprot.org/peptidesearch/ Desmodus rotundus (Vampire bat)

TRPV1-L

TRPV1-S

26 DEMO ON COMMUNITY SUBMISSION Go to https://community.uniprot.org/bbsub/doc/public/Com munitysubmissionUniProtdemo_voice.mp4

27 https://covid-19.uniprot.org/uniprotkb?query=* 28 Future work in community submissions v Submission in batch. For publications describing many proteins v Make publications and annotation more discoverable v Improve display on Website v Standardize when possible v Link public submissions to your ORCID profile

Demo available here: https://community.uniprot.org/bbsub/doc/public/CommunitysubmissionUniProtdemo_voice.mp4

29 UniProt Team

PIs: Alex Bateman, Alan Bridge, Cathy Wu

Key staff: Cecilia Arighi (Curation), Lionel Breuza (Curation), Elisabeth Coudert (Curation), Hongzhan Huang (Development), Damien Lieberherr (Curation), Michele Magrane (Curation), Maria Martin (Development), Peter McGarvey (Content), Darren Natale (Content), Sandra Orchard (Content), Ivo Pedruzzi (Curation), Sylvain Poux (Curation), Manuela Pruess (Coordination), Shriya Raj (Coordination), Nicole Redaschi (Development)

Content / Curation: Lucila Aimo, Ghislaine Argoud-Puy, Andrea Auchincloss, Kristian Axelsen, Emmanuel Boutet, Emily Bowler, Ramona Britto, Hema Bye-A-Jee, Cristina Casals-Casas, Anne Estreicher, Livia Famiglietti, Marc Feuermann, John S. Garavelli, Penelope Garmiri, George Georghiou, Arnaud Gos, Nadine Gruaz, Emma Hatton-Ellis, Ursula Hinz, Chantal Hulo, Nevila Hyka-Nouspikel, Florence Jungo, Kati Laiho, Philippe Lemercier, Yvonne Lussi, Alistair MacDougall, Patrick Masson, Anne Morgat, Sandrine Pilbout, Catherine Rivoire, Karen Ross, Christian Sigrist, Elena Speretta, Shyamala Sundaram, Nidhi Tyagi, C. R. Vinayaka, Qinghua Wang, Kate Warner, Lai-Su Yeh, Rossana Zaru Development: Shadab Ahmed, Emanuele Alpi, Leslie Arminski, Parit Bansal, Delphine Baratin, Teresa Batista Neto, Jerven Bolleman, Borisas Bursteinas, Chuming Chen, Yongxing Chen, Beatrice Cuche, Alan Da Silva, Edouard De Castro, Tunca Dogan, Leyla Garcia Castro, Elisabeth Gasteiger, Sebastien Gehant, Leonardo Gonzales, Alexandr Ignatchenko, Giuseppe Insana, Rizwan Ishtiaq, Vishal Joshi, Dushyanth Jyothi, Arnaud Kerhornou, Thierry Lombardot, Jie Luo, Mahdi Mahmoudy, Andrew Nightingale, Joseph Onwubiko, Monica Pozzato, Sangya Pundir, Guoying Qi, Daniel Rice, Rabie Saidi, Edward Turner, Preethi Vasudev, Vladimir Volynkin, Yuqi Wang, Xavier Watkins, Hermann Zellner, Jian Zhang

European Bioinformatics Institute Protein Information Resource (PIR), SIB Swiss Institute of Bioinformatics (EMBL-EBI), Hinxton, Cambridge, UK Washington DC and Delaware, USA (SIB), Geneva, Switzerland

https://www.uniprot.org/help/uniprot_staff 30 Thanks for tuning in!

31