BD2K All Hands Meeting

Interoperability, Sustainability & Impact A UniProt Case Study

BD2K All Hands Meeting November 30, 2016 Bethesda, MD

Alex Bateman – EMBL-EBI (European Institute, UK) Cathy Wu – PIR (Protein Information Resource, USA) Ioannis Xenarios – SIB (Swiss institute of Bioinformatics, Switzerland)

UniProt Consortium http://www.uniprot.org BD2K All Hands Meeting UniProt: Hub for Protein Sequence & Function Information

2 BD2K All Hands Meeting UniProt BD2K Supplement: A Sustainable Infrastructure Mapping Protein Function onto DNA Variation The Need for Interoperability and Engaging the Clinical Genomics Community Many medical geneticists in the community do not know how to make use of protein functional information in UniProt for variant curation. => Add UniProt Sequence Annotation to Human Reference Genome

• Aim 1. To support the worldwide genomics community in the full interpretation and exploitation of the human genome & proteome by providing an up-to-date and high quality mapping between the genomic coordinates and the protein sequence and its features • Aim 2. To develop programmatic interoperability between the UniProtKB and ClinGen/ClinVar and CPTAC and collaborate with the wider genomic and proteomic communities to develop use cases and knowledge exchange to best use the integrated data for curation of the wealth of variant information being generated 3 BD2K All Hands Meeting UniProt Genome Annotation Tracks Mapping of reference proteome and annotations to the reference genome • Human protein sequences mapped to the human reference genome (GRCh38) • BED/BigBED track hubs are available for 26 Feature Types plus proteins & isoforms • View protein sequence annotations via public track hubs on UCSC & Ensembl browsers

UniProt Feature Viewer

Knowing the genome position of functional protein features is a valuable tool for gene, variant and protein curation as they can often associate with biological cause 4 BD2K All Hands Meeting UniProt Genome Tracks on UCSC Browser Active Sites Disulfide Bond Glycosylation Site

• GLA gene (associated with Fabry disease) on browser plus ClinVar, dbSNP & OMIM tracks • Active site - variant disrupts enzyme active site; SNPs not observed in other resources • Disulfide bond - variant associated with FD that removes a Cys in a structural fold • Glycosylation - variant disrupts site for lysosome targeting; pathogenic variants in ClinVar 5 BD2K All Hands Meeting Aligning ClinVar SNPs to UniProt Features • Mapping of ClinVar SNPs to UniProt features by genomic position shows the raw number and % of ClinVar pathogenic SNPs that aligned with different features (Table1). • Comparison between ClinVar ‘pathogenic’ SNPs and UniProt ‘disease-associated’ Variants shows that 36% of ClinVar pathogenic SNPs are in UniProt and and 48% of UniProt disease-associated variants are currently in ClinVar (Table 2). Table1 CV Pathogenic CV SNPs that Pathogenic in SNPs that map to UniProt Feature Type (total) map to feature ClinVar (%) feature • Collaboration with NCBI ClinVar group Intramembrane (265) 233 181 77.68% DNA Binding dom. (712) 727 474 65.20% ‒ reciprocal links at variant level Region (8,969) 15,962 5,225 32.73% ‒ supporting data provider (like OMIM) Domain (65,655) 106,572 33,616 31.54% Metal Binding (2,735) 2,082 1,158 55.62% ‒ UniProt variant submission Topological dom. (18,494) 18,865 6,407 33.96% Natural variant (72,808) 32,373 20,092 62.06% Pathogenic Non-Pathogenic Coiled Coil (10,858) 11,126 2,665 23.95% Table2 All CinVar SNPs CinVar SNPs ClinVar SNPs Peptide (384) 199 94 47.24% (111,022) Transit peptide (458) 344 97 28.20% (33,521) (77,501) Transmembrane (39,729) 9,822 5,192 52.86% Repeat (14,913) 6,253 1,900 30.39% All UniProt 18,641 12,337 6,304 Nucleotide binding (3,379) 561 351 62.57% Variants (73,968) Ca Binding Site (458) 83 42 50.60% Signal Peptide (9,274) 1,791 603 33.67% UniProt Disease Motif (2,991) 297 142 47.81% 13,505 11,253 2,252 Zn Finger (8,972) 760 333 43.82% Variants (28,025) Site (1,945) 124 66 53.23% Binding Sites (5,589) 183 134 73.22% UniProt non- Active Site (3,608) 60 41 68.33% Disease Variants 5,136 1,084 4,052 Modified Residue (51,094) 561 184 32.80% (45,943) Cross Link (2,798) 24 7 29.17% Carbohydrate Site (16,166) 154 39 25.32% 6 Lipid (1,018) 10 2 20.00% BD2K All Hands Meeting Integrating UniProt Features into ClinGenKB & Pathogenicity Calculator • 20,000+ evidence documents for UniProt AA disease associated variants provided for use in the ClinGen Pathogenicity Calculator • UniProt AA variations and proteins to be loaded in public ClinGen Allele Registry as protein alleles (early 2017).

ClinGen protein feature Linked data document model

7 BD2K All Hands Meeting

SCIENTIFIC DATA | 3:160018 | DOI: 10.1038/sdata.2016.18 UniProt is FAIR

Findable Accessible Interoperable Reusable

Examples of FAIRness and the resulting value-added

8 BD2K All Hands Meeting

Growth of UniProt Databases

140,000,000

120,000,000

100,000,000 UniParc

UniProtKB

Entries 80,000,000 UniRef100

60,000,000 UniRef90 Number of of Number 40,000,000 UniRef50

UniProtKB/ 20,000,000 Swiss-Prot

0

9 BD2K All Hands Meeting

Dealing with Scale

10 BD2K All Hands Meeting Growth of Biomedical Literature

Curatable?

Evaluated? Curated

• Is expert curation sustainable? • How many articles do we evaluate in total every year?

• What proportion of PubMed is curatable for UniProtKB/Swiss-Prot? 12 BD2K All Hands Meeting

Literature Triage: Curation workflow

• 4 curators from different annotation programs • Run tests over 8 months • Use PubTator to select publications (in collaboration with Z Lu at NCBI)

• Tag curatable papers • Tag non-curatable papers and describe why

12 BD2K All Hands Meeting

Sustainability of Literature curation

• Curators evaluate ~50-60,000 papers per year • ~10K are curated and added to UniProt • ~10K are redundant to existing information • ~10K low priority • ~10K are not well supported • ~20K out of scope • Sampling shows 90% PubMed out of scope • We estimate that we curate 35-45% of the curatable part of PubMed • The major challenge is the literature triage step • The number of publications curated is important, but it is as important to select papers that provide the maximum of high quality information to make best use of our resources

Expert curation is sustainable 13 BD2K All Hands Meeting

UniProt @ Innovations in Curation Workshop Text-mining assisted manual curation • PubTator for literature triage Additional Bibliography: computationally mapped references • Bibliography from other curated sources • In progress: Text mining-assisted UniProt tagging (including ePMC); Computational assignment of concept categories Integration of text mining into publishing • UniProt ID assignment at point of paper submission in collaboration with journals (BioCreative VI Bio-ID Track)

14 BD2K All Hands Meeting Impact: Resource Utilization UniProt Google Analytics Statistics

Period / monthly average Visits Unique visitors Pageviews June 2010 - May 2011 612,905 320,892 3,177,758 June 2011 - May 2012 724,286 369,485 3,703,560 June 2012 - May 2013 820,623 408,244 4,022,786 June 2013 - May 2014 808,135 409,848 4,255,675 March 2014 - Feb. 2015* 821,368 433,136 4,097,871 March 2015 - Feb. 2016 952,837 509,278 4,758,278 * Different period due to new NIH grant period; these numbers were reported to the NIH

15 BD2K All Hands Meeting Impact: Communities Served

16 BD2K All Hands Meeting Impact: Citation, Linking, Reuse

WoS Citation network of UniProt paper Increased use of resource URLs for citation

Increased # UniProt links to external resources Impact of Linking from UniProt Links to external resources in UniProt

50000000 16

45000000 14 resources #new 40000000 12 MobiDB 35000000 30000000 10 25000000 8 20000000 6 linked 15000000 4 10000000

#UniProt entries with links 5000000 2 0 0 2012 2013 2014 Year UniProt release 04

UniProt data is reused in hundreds of tools and resources (e.g., NCBI) - How to assess the impact of resource reuse? 17 BD2K All Hands Meeting

PIs: Alex Bateman, Cathy Wu, Ioannis Xenarios The Team

Key staff: Cecilia Arighi (Curation), Lydie Bougueleret (Co-Direction), Alan Bridge (Content), Hongzhan Huang (Development), Michele Magrane (Curation), Maria Martin (Development), Peter McGarvey (Content), Darren Natale (Content), Claire O’Donovan (Content), Sylvain Poux (Curation), Manuela Pruess (Coordination), Nicole Redaschi (Development)

Content/Curation: Lucila Aimo, Ghislaine Argoud-Puy, Andrea Auchincloss, Kristian Axelsen, Sara Benmohhamed, Brigitte Boeckmann, Emmanuel Boutet, Lionel Breuza, Ramona Britto, Hema Bye-A-Jee, Cristina Casals Casas, Elisabeth Coudert, Melanie Courtot, Anne Estreicher, Livia Famiglietti, Marc Feuermann, John S. Garavelli, Penelope Garmiri, Daniel Gonzalez, Arnaud Gos, Nadine Gruaz, Emma Hatton-Ellis, Ursula Hinz, Chantal Hulo, Nevila Hyka- Nouspikel, Florence Jungo, Guillaume Keller, Kati Laiho, Philippe Lemercier, Damien Lieberherr, Alistair MacDougall, Patrick Masson, Anne Morgat, Barbara Palka, Ivo Pedruzzi, Klemens Pichler, Sandrine Pilbout, Catherine Rivoire, Bernd Roechert, Karen Ross, Michel Schneider, Aleksandra Shypitsyna, Christian Sigrist, Elena Speretta, Andre Stutz, Shyamala Sundaram, Michael Tognolli, Nidhi Tyagi, C. R. Vinayaka, Qinghua Wang, Kate Warner, Lai-Su Yeh, Rosanna Zaru

Development: Emanuele Alpi, Ricardo Antunes, Leslie Arminski, Parit Bansal, Delphine Baratin, Teresa Batista Neto, Benoit Bely, Mark Bingley, Jerven Bolleman, Borisas Bursteinas, Chuming Chen, Yongxing Chen, Beatrice Cuche, Alan Da Silva, Edouard De Castro, Maurizio De Giorgi, Tunca Dogan, Leyla Garcia Castro, Elisabeth Gasteiger, Sebastien Gehant, Leonardo Gonzales, Arnaud Kerhornou, Vicente Lara, Wudong Liu, Thierry Lombardot, Jie Luo, Xavier Martin, Andrew Nightingale, Joseph Onwubiko, Monica Pozzato, Sangya Pundir, Guoying Qi, Alexandre Renaux, Steven Rosanoff, Rabie Saidi, Tony Sawford, Edward Turner, Vladimir Volynkin, Yuqi Wang, Tony Wardell, Xavier Watkins, Hermann Zellner, Jian Zhang

European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK Protein Information Resource (PIR), Washington DC and Delaware, USA SIB Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland 18