Bioinformatic Databases

PharmaMatrix Workshop 2010 Bioinforma6c Databases 14 July 2010 Philip Winter & Ishwar Hosamani Database Growth Source: http://www.kokocinski.net/bioinformatics/databases.php Database Survey Genes & Proteins Cheminformacs: Gene & Protein Drugs & Metabolites Interac=ons Database Survey Pfam PDB TGI UniProt dbSNP Genes & GenBank Proteins GEO Cheminformacs: Gene & Protein Drugs & Metabolites Interac=ons Database Survey Pfam PDB TGI UniProt dbSNP Genes & GenBank Proteins GEO Cheminformacs: Gene & Protein Drugs & Metabolites Interac=ons SciFinder DrugBank PubChem ZINC Database Survey Pfam PDB TGI UniProt dbSNP Genes & GenBank Proteins GEO Cheminformacs: Gene & Protein Drugs & Metabolites Interac=ons SciFinder KEGG DrugBank PubChem NetPath BioGRID ZINC Curaon • Manual curaon (or just curaon): A human creates and annotates the database entry • Automa6c curaon: A computer program creates and annotates the database entry • Semi-automa6c curaon: A combina=on of manual and automa=c Database Idenfiers • Every database record will have a unique iden6fier; oUen this will be called an accession number which is assigned with the record is first added to the database • Be careful: databases will oUen permit a record to be modified but keep the same accession number; you should record the version number as well • Furthermore, databases may have different rules for handling records that are merged or split Database Idenfier Cheat Sheets PaMern Iden6fier Examples En6ty Database URL Name [op=onal GenInfo GI: Nucleo=de GenBank, hp:// “GI:”] Iden=fier 34222261 or protein RefSeq www.ncbi.nl [digits] sequence m.nih.gov/ [leèr][5 GenBank AB088100 Nucleo=de GenBank hp:// digits] ACCESSION sequence www.ncbi.nl OR m.nih.gov/ [2 leèrs][6 digits] [2 leèr RefSeq NM_178014 Nucleo=de RefSeq hp:// type code]_ ACCESSION or protein www.ncbi.nl [digits] sequence m.nih.gov/ [GenBank GenBank or AB088100.1 Nucleo=de GenBank, hp:// or RefSeq RefSeq or protein RefSeq www.ncbi.nl ACCESSION] VERSION NM_178014 sequence m.nih.gov/ .[version .2 number] (iden=cal to GenBank Nucleo=de GenBank, hp:// accession LOCUS or protein RefSeq www.ncbi.nl for recent sequence m.nih.gov/ entries) PaMern Iden6fier Examples En6ty Database URL Name [Protein Swiss-Prot TBB5_ Protein UniProtKB/ hp:// code]_ ID (entry HUMAN sequence Swiss-Prot www.unipro [Species name) t.org/ code] [UniProt AC] UniProt ID Q9BUU9_ Protein UniProtKB/ hp:// _[Species (entry HUMAN sequence TrEMBL www.unipro code] name) t.org/ [A-N,R-Z] UniProt AC P07437 Protein UniProtKB hp:// [0-9][A-Z] (accession sequence www.unipro [A-Z, 0-9][A- number) t.org/ Z, 0-9][0-9] OR [O,P,Q][0-9] [A-Z, 0-9][A- Z, 0-9][A-Z, 0-9][0-9] PaMern Iden6fier Examples En6ty Database URL Name [capital HGNC gene TUBB Human HGNC hp:// leers or symbol gene database www.genen digits; no TUBB1 ames.org/ ini=al digit] GO:[7 GO GO: Gene class AmiGO hp:// digits] accession 0005874 www.geneo number ntology.org/ [0-9][A-Z, PDB ID 1TUB Protein, PDB hp:// 0-9][A-Z, nucleic acid, www.rcsb.o 0-9][A-Z, or complex rg/ 0-9] structure [2 or 3 PDB ligand CN2 Ligand PDB hp:// leers or ID www.rcsb.o digits] rg/ PaMern Iden6fier Examples En6ty Database URL Name [up to 7 CAS registry 64-86-8 Chemical SciFinder hps:// digits]-[2 number structure scifinder- digits]-[1 cas- digit] org.login.ez proxy.library .ualberta.ca / [digits] PubChem 6167 Chemical PubChem hp:// CID structure pubchem.nc (compound bi.nlm.nih.g ID) ov/ ZINC[8 ZINC ID ZINC006218 Chemical ZINC hp:// digits] 53 structure zinc.docking OR .org/ [digits] 621853 DB[5 digits] DrugBank DB01394 Drug DrugBank hp:// accession (chemical www.drugb number structure) ank.ca/ Key File Formats for Sequences and Structures • Sequences – FASTA format .fasta .fst .txt! • Macromolecule structures – PDB format .pdb .ent! Accessing Databases • Web interface • Query string e.g. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?! db=nucleotide&id=34222261&rettype=fasta&retmode=fasta! • Web services (SOAP) • FTP -> local copy Cheminformac Database Survey CAS SciFinder PubChem DrugBank ZINC Cheminformac Database Survey h`ps://scifinder-cas- org.login.ezproxy.library.ualberta.ca/ hp://pubchem.ncbi.nlm.nih.gov/ CAS SciFinder PubChem DrugBank ZINC hp://www.drugbank.ca/ hp://zinc.docking.org/ Cheminformac Database Survey >27 million unique structures >52 million organic compounds >23 million with 3d conforma=ons >61 million inorganic compounds Mostly organic, biologically Physical property info interesng compounds CAS SciFinder PubChem DrugBank ZINC ~4,800 drugs >1,350 FDA approved drugs >13 million purchasable compounds Includes drug target info Ready to dock Stereochemistry Issues 5 Chaetocin structures from PubChem O O O HO HO HO H H H N N N N N N S S S N N N S S S O O O O O O S S S N N N S S S N N N N N N H H H OH OH OH O O O CID 161591: CID 5390098: CID 11563851: no stereochemistry bad stereochemistry incomplete stereochemistry O O HO HO H H N N N N S S N N CID 46191942: S S Enan=omer of O O O O natural product S S N N S S CID 11657687: N N Natural product N N H H OH OH stereochemistry O O Other Cheminformac Issues • Tautomers / protonaon states? • Salt forms? • Implicit or explicit hydrogens? • 2D connec=vity only or 3D conforma=on? • Non-organic elements? – Many programs only handle: CHNOPS + halogens – But some drugs have B, Pt, Hg, As, … SMILES O H CC(=O)N[C@H]1CCC2=CC! N (=C(C(=C2C3=CC=C(C(=O)! C=C13)OC)OC)OC)OC O O O • Isomeric SMILES O – Allows specifica=on of stereochemistry O • Canonical SMILES – Canonicaliza=on will generate a unique string for a molecule, regardless of atom order – Different programs will canonicalize differently • SMARTS – Chemical paèrns for searching or filtering hp://www.daylight.com/smiles/index.html File formats • MDL Molfile .mol – Allows a 3D conforma=on to be stored • SDF .sdf! – Wraps Molfile format; mul=ple structures; annota=ons • PDB .pdb .ent! – Not the best for small molecules Need to convert? -> Try OpenBabel hp://openbabel.org/wiki/Main_Page Pathway and Interac6on Databases KEGG Pathways NetPath BioGRID Pathway and Interac6on Databases hp://www.genome.jp/kegg/ hp://www.netpath.org/ KEGG Pathways NetPath BioGRID hp://thebiogrid.org/ Pathway and Interac6on Databases Manually drawn pathways of metabolism, signaling, and other biological processes Curated protein signal pathways in humans >300 pathways + organism specific versions 20 pathways, 1,800 interac=ons KEGG Pathways NetPath BioGRID A repository for protein and gene interac=on data 345,620 interac=ons Pathway Formats • SBML .xml! – The Systems Biology Markup Language hp://sbml.org/Main_Page • Also check out the BioPAX format hp://www.biopax.org/ Pathway Tools • libSBML hp://sbml.org/SoUware/libSBML • Cell Designer hp://www.celldesigner.org/ • CytoScape hp://www.cytoscape.org/ <?xml version="1.0" encoding="UTF-8"?> <sbml level="2" version="3" xmlns="http://www.sbml.org/sbml/level2/version3"> ... <listOfSpecies> <species compartment="cytosol" id="ES" /> <species compartment="cytosol" id="P" /> <species compartment="cytosol" id="S" /> <species compartment="cytosol" id="E" /> </listOfSpecies> <listOfReactions> <reaction id="veq"> <listOfReactants> <speciesReference species="E"/> <speciesReference species="S"/> </listOfReactants> <listOfProducts> <speciesReference species="ES"/> </listOfProducts> <kineticLaw> <math xmlns="http://www.w3.org/1998/Math/MathML"> <apply> <times/> <ci>cytosol</ci> KEGG: Pathways in Cancer NetPath: EGFR1 pathway Exercises 1. What databases are these iden=fiers from? a. 3KYL b. EZH2 c. Q15910 d. GO:0008017 e. GI:8017 f. A9145C 2. Try finding the corresponding entries online Exercise Answers 1. What databases are these iden=fiers from? a. 3KYL -> PDB (a protein-RNA structure for telomerase reverse transcriptase, cataly=c region) b. EZH2 -> HGNC (a human gene for a histone lysine methyl transferase) c. Q15910 -> UniProt (a protein sequence for EZH2) d. GO:0008017 -> AmiGO (microtubule binding gene ontology) e. GI:8017 -> GenBank (a DNA sequence from D. melanogaster) f. A9145C -> this one’s a trick: it’s a chemical compound; you can look it up in PubChem with CID: 6438632 .

Bioinformatic Databases

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Bioinformatics Study of Lectins: New Classification and Prediction In

Webnetcoffee

The Biogrid Interaction Database

PINOT: an Intuitive Resource for Integrating Protein-Protein Interactions James E

Genbank Is a Reliable Resource for 21St Century Biodiversity Research

Genbank Dennis A

Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

Biogrid Australia Facilitates Collaborative Medical And

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Uniprot.Ws: R Interface to Uniprot Web Services

Unexpected Insertion of Carrier DNA Sequences Into the Fission Yeast Genome During CRISPR–Cas9 Mediated Gene Deletion