Protein Data Bank: an Open Access Resource Enabling Basic and Applied Research and Education in Biology and Medicine

Protein Data Bank: An open access resource enabling basic and applied research and education in biology and medicine John Westbrook, Ph.D. RCSB PDB Data & Software Architect Lead Overview . A bit of background about the PDB . PDB data content . PDB data representation and data quality standards . The PDB biocuration platform . Some key features delivered by the RCSB PDB 1 Protein Data Bank . First open access digital resource in biology (est. 1971 with 7 entries) . Single global archive of 3-D macromolecular structures (contains >120,000 entries) . Freely available to all at pdb.org . US PDB headquartered at Rutgers/UCSD (NSF, NIH, DOE) . US PDB part of Worldwide PDB with partners in EU and Japan Worldwide Protein Data Bank . Established in 2003, wwPDB ensures data are freely & globally available in a common repository . Collaborate on data quality and representation standards, and tools and procedures for biocuration . Each partner delivers different services and views of the common repository of data wwPDB Advisory Committee Meeting, 2015 3 1970s 1980s 1990s 2000s 2010s Small enzymes, RNA DNA, Protein-DNA Ribosomes Large viruses complexes macromolecular machines Science X-ray diffraction, Synchrotron Electron microscopy, High throughput Hybrid methods diffractometers, radiation, computer fast computers, fast structural genomics, punched cards graphics, NMR detectors robots Technology PDB archive IUCr deposition Standardization, Experimental data Validation standards established guidelines RCSB required, wwPDB Community 4 Diverse Molecular Content of the PDB 5 Technologies Experimental Evolving Rapidly Supporting First First 500 1991 First First 500 2012 X-ray Crystallography Nuclear Magne c Resonance Spectroscopy 90000 12000 80000 10000 e e v 70000 i v i h h c 8000 c r 60000 r 10 12 14 A A 0 2 4 6 8 n 50000 i n i 6000 s s e 40000 i e r i t r n 197530000 t 4000 n E E 20000 & Hybrid Methods Integrative 2000 198010000 0 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 8 8 9 9 0 0 1 1 7 8 8 9 9 0 0 1 1 9 9 9 9 0 0 0 0 9 9 9 9 9 0 0 0 0 1 1 1 1 2 2 2 2 1985 1 1 1 1 1 2 2 2 2 Year Year 1990 First First 500 1995 Year 3D Electron Microscopy 19951000 800 e 2000 v i h c r 600 A n i 2005 s e i 400 r t n E 2010200 0 2015 5 0 5 0 5 0 5 0 5 7 8 8 9 9 0 0 1 1 9 9 9 9 9 0 0 0 0 1 1 1 1 1 2 2 2 2 Year 6 PDB Growth and Data Usage PDB Depositors >800 new entries/month Growth in PDB Depositions 14000 Total Number of Annual Depositions 12000 Projected Annual Depositions 10000 8000 6000 # of Entries # 4000 PDB Users 2000 FTP and RSYNC Download Traffic in 2015: 526 million downloads 0 2011 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2012 2013 2014 2015 2016 2017 2018 Year As of 2015, ~55% increase in the number of global depositions since 2008 RCSB PDB PDBe PDBj 347 million 100 million 58 million 7 PDB Data Content . Atomic coordinates and primary experimental data . Sample composition and preparation details . Protein and nucleic acid polymer sequences and taxonomy . Small molecules (ligands) . Experimental data collection, structure solution, and structure refinement . Structure classification by sequence, function and other criteria . Citation and references to related data resources . PDB entries contain 250 – 1200+ unique items of data The 3.8 angstrom resolution cryo-EM structure of Zika virus. Sirohi, D., Chen, Z., Sun, L., Klose, T., Pierson, T.C., Rossmann, M.G., Kuhn, R.J. (2016) Science 352: 467-470 8 Community Data Standards . PDB manages data using the macromolecular extension to the Crystallographic Information Framework (mmCIF) originally developed as an IUCr data standard . PDB coordinates the extension of the standard PDBx Format Structure (PDBx/mmCIF) to support the broader needs of in the Lab Determination both contributors and users of the archive Round Trip wwPDB (> 4400 data terms) Deposition . A PDBx/mmCIF Working Group of community PDBx Format wwPDB Processing In PDB Archive and Annotation experts and methods developers oversees the evolution of the standard and ensures that the standard is well supported by key community software tools. Workshop Participants, . PDB hosts community workshops to support September 2011 the data standard and maintains a web site Workshop serving PDBx/mmCIF data dictionaries, schema Participants, October and software tools (mmcif.wwpdb.org) 2014 9 PDBx/mmCIF Development Timeline • 1991 • 1994 • 1997 • 2000 • 2003 • 2006 • 2009 • 2012 IUCr mmCIF Working Party IUCr mmCIF Maintenance Group Core CIF V1 mmCIF V1 mmCIF V2 mmCIF/Core sync’d Workshops Rutgers York CARB Honolulu Glasgow Tarrytown St. Louis Orlando Brussels Seattle Rutgers EBI DDL 1 DDL 2 mmCIF +Extensions PDB Exchange Dictionary wwPDB One Archive – One Dictionary wwPDB Common Deposition & Annotation Data mmCIF System 10 Community Standards for Data Quality Method-specific Community Validation Task Forces have been convened to collect recommendations and develop consensus on data quality standards, identify software tools to perform required validation tasks, and to define related content requirements for archiving. Meeting/ Task Force Chair(s)/Membership Outcomes Workshop X-ray 2008 Randy Read (Univ of Cambridge) (2011) Structure Validation 2015 17 members 19: 1395-1412 Task Force NMR 2009, 2011, Gaetano Montelione (Rutgers) (2013) Structure Validation 2013 (x2), Michael Nilges (Institut Pasteur) 21: 1563-1570 Task Force 2015 10 members 2016 3DEM 2010 Richard Henderson (MRC-LMB) (2012) Structure Validation Andrej Sali (UCSF) 20: 205-214 Task Force 21 members Small-Angle 2011 Jill Trewhella (Univ Sydney) (2013) Structure Scattering 2014 6 members 21: 875-881 Task Force Hybrid 2014 Andrej Sali (UCSF), Torsten (2015) Structure Methods Task Schwede (Univ Basel), Jill Trewhella 23: 1156-1167 Force (Univ Sydney) 27 members Presenting Data Quality to Diverse Audiences . Provide relative and absolute quality metrics in graphical Overall Quality format . Provide tabulations of key data, refinement statistics, and quality diagnostics . Assess all macromolecular and ligand structural components Residue Plots . PDF format reports can be uploaded with manuscript submission to a journal . Diagnostics also delivered as an XML format data files Grey – not modeled Green, yellow, orange, red – 0,1,2, 3 or more issues Red dot – poor fit to electron density 12 OneDep – The PDB Biocuration Platform PDB OneDep: a unified global deposition, PDB • Polymer check biocuration, and validation system • Ligand check 3. • Electron density fit Deposition deposit.wwpdb.org 4. 2. Pre-deposition Biocuration Validation ill validate.wwpdb.org OK 1 issue 2 issues 3+ issues 1. 5. Data Public Release Harvesting ftp://ftp.wwpdb.org wwpdb.org/deposition rcsb.org Data providers Data Users • Access data via web and ftp • Generate atomic coordinate and experimental download and web services data files • Enable other research for • Assemble mandatory data items for deposition user community 13 RCSB PDB Data Delivery Pipeline Deposition & Biocuration 14 RCSB PDB Web Portal . Launchpad for a wide range of functionalities o Deposit o Search o Analyze o Visualize o Tabulate o Download http://rcsb.org 15 15 RCSB PDB Mobile App . Provides convenient access to PDB data on the go with a minimal feature set . Provides a browser, simple search, and an interactive 3D viewer . Supports iPhone, iPad, and Android http://www.rcsb.org/pdb/static.do?p=mobile/RCSBapp.html 16 Web Services (RESTful APIs) . Programmatic access to data: application-to- application communication . Provides external workflows and analysis tools with direct access to a wide range of PDB data and services . Enables integration of PDB data and services with programs and scripts in a variety of computer languages and computing environments http://www.rcsb.org/pdb/software/rest.do 17 Enabling Data Access Through Integration and Visualization . Protein Feature View: mapping protein sequence to 3D structure . Gene View: mapping genome location to 3D structure . Visualization . Browser native visualization tools using an efficient data compression protocol . Small molecule electron density and binding interactions . Data integration resource files provided for download: . Correspondences with CCDC ligand structures . Sequence cluster data files . Phased release of data to support blinded molecular docking tests 18 Protein Sequence Integrated View 19 Genome Sequence Integrated View 20 Molecular Visualization Zn http://mmtf.rcsb.org/ https://github.com/arose/ngl 21 Visualizing Small Molecule Interactions 22 Reaching Diverse User Communities Who are our users? What are they using? Biologists: structural biology, biophysics, RCSB PDB website, deposition tools, data biochemistry, genetics, Immunology, pharmacology, cell and molecular biology … Other scientists: bioinformatics, software Web Services, search engines, data developers, … Students & teachers PDB-101 Media: Writers, textbook authors, patient advocacy Images, data, information, outreach material, groups, … e.g., posters General public: Curious/interested individuals, Images, Molecule of the Month, information artists, sculptors, … from external media 23 23 Online Educational Resources Resources to help understand biology at the molecular level http://pdb101.rcsb.org/ Animations Paper Models Posters 24 24 Molecule of the Month 25 PDB Management The Protein Data Bank PDB members past and present at the PDB40 Archive is managed by: Anniversary Symposium, 2011 wwpdb.org Members Worldwide Protein Data Bank rcsb.org pdbe.org pdbj.org bmrb.wisc.edu RCSB Protein Data Bank proteindatabank ja-jp.facebook.com/PDBjapan @buildmodels @PDBEurope @PDB_ja Funding: Funding: Funding: Funding: NSF, NIH, DOE EMBL-EBI, Wellcome Trust, NBDC-JST NLM BBSRC, NIGMS, EU 26 26.

Protein Data Bank: an Open Access Resource Enabling Basic and Applied Research and Education in Biology and Medicine

Impact of the Protein Data Bank Across Scientific Disciplines.Data Science Journal, 19: 25, Pp

Pdbefold Tutorial Tutorial Pdbefold Can May Be Accessed from Multiple Locations on the Pdbe Website

EMBL-EBI-Overview.Pdf

Human Genetics 1990–2009

EC-PSI: Associating Enzyme Commission Numbers with Pfam Domains

RCSB Protein Data Bank: Overview

Uniprot Knowledgebase: a Hub of Integrated Protein Data

Lab Manual.Indd 1 17/01/2019 4:34:55 PM Title : Bioinformatics for Beginners Laboratory Manual

The Role of Uniprot's Protein Sequence Databases in Biomedical Research

Search of Biological Databases and Literature

Saccharomyces Cerevisiae HOWARD BUSSEY*T, DAVID B

Introduction to BLAST Using Human Leptin