PharmaMatrix Workshop 2010 Bioinforma c Databases
14 July 2010 Philip Winter & Ishwar Hosamani Database Growth
Source: http://www.kokocinski.net/bioinformatics/databases.php Database Survey
Genes & Proteins
Cheminforma cs: Gene & Protein Drugs & Metabolites Interac ons Database Survey
Pfam PDB TGI UniProt dbSNP
Genes & GenBank Proteins GEO
Cheminforma cs: Gene & Protein Drugs & Metabolites Interac ons Database Survey
Pfam PDB TGI UniProt dbSNP
Genes & GenBank Proteins GEO
Cheminforma cs: Gene & Protein Drugs & Metabolites Interac ons SciFinder
DrugBank PubChem
ZINC Database Survey
Pfam PDB TGI UniProt dbSNP
Genes & GenBank Proteins GEO
Cheminforma cs: Gene & Protein Drugs & Metabolites Interac ons SciFinder KEGG
DrugBank PubChem NetPath BioGRID ZINC Cura on
• Manual cura on (or just cura on): A human creates and annotates the database entry
• Automa c cura on: A computer program creates and annotates the database entry
• Semi-automa c cura on: A combina on of manual and automa c Database Iden fiers
• Every database record will have a unique iden fier; o en this will be called an accession number which is assigned with the record is first added to the database
• Be careful: databases will o en permit a record to be modified but keep the same accession number; you should record the version number as well
• Furthermore, databases may have different rules for handling records that are merged or split Database Iden fier Cheat Sheets Pa ern Iden fier Examples En ty Database URL Name [op onal GenInfo GI: Nucleo de GenBank, h p:// “GI:”] Iden fier 34222261 or protein RefSeq www.ncbi.nl [digits] sequence m.nih.gov/ [le er][5 GenBank AB088100 Nucleo de GenBank h p:// digits] ACCESSION sequence www.ncbi.nl OR m.nih.gov/ [2 le ers][6 digits] [2 le er RefSeq NM_178014 Nucleo de RefSeq h p:// type code]_ ACCESSION or protein www.ncbi.nl [digits] sequence m.nih.gov/ [GenBank GenBank or AB088100.1 Nucleo de GenBank, h p:// or RefSeq RefSeq or protein RefSeq www.ncbi.nl ACCESSION] VERSION NM_178014 sequence m.nih.gov/ .[version .2 number] (iden cal to GenBank Nucleo de GenBank, h p:// accession LOCUS or protein RefSeq www.ncbi.nl for recent sequence m.nih.gov/ entries) Pa ern Iden fier Examples En ty Database URL Name [Protein Swiss-Prot TBB5_ Protein UniProtKB/ h p:// code]_ ID (entry HUMAN sequence Swiss-Prot www.unipro [Species name) t.org/ code] [UniProt AC] UniProt ID Q9BUU9_ Protein UniProtKB/ h p:// _[Species (entry HUMAN sequence TrEMBL www.unipro code] name) t.org/
[A-N,R-Z] UniProt AC P07437 Protein UniProtKB h p:// [0-9][A-Z] (accession sequence www.unipro [A-Z, 0-9][A- number) t.org/ Z, 0-9][0-9] OR [O,P,Q][0-9] [A-Z, 0-9][A- Z, 0-9][A-Z, 0-9][0-9] Pa ern Iden fier Examples En ty Database URL Name [capital HGNC gene TUBB Human HGNC h p:// le ers or symbol gene database www.genen digits; no TUBB1 ames.org/ ini al digit] GO:[7 GO GO: Gene class AmiGO h p:// digits] accession 0005874 www.geneo number ntology.org/ [0-9][A-Z, PDB ID 1TUB Protein, PDB h p:// 0-9][A-Z, nucleic acid, www.rcsb.o 0-9][A-Z, or complex rg/ 0-9] structure [2 or 3 PDB ligand CN2 Ligand PDB h p:// le ers or ID www.rcsb.o digits] rg/ Pa ern Iden fier Examples En ty Database URL Name [up to 7 CAS registry 64-86-8 Chemical SciFinder h ps:// digits]-[2 number structure scifinder- digits]-[1 cas- digit] org.login.ez proxy.library .ualberta.ca / [digits] PubChem 6167 Chemical PubChem h p:// CID structure pubchem.nc (compound bi.nlm.nih.g ID) ov/ ZINC[8 ZINC ID ZINC006218 Chemical ZINC h p:// digits] 53 structure zinc.docking OR .org/ [digits] 621853 DB[5 digits] DrugBank DB01394 Drug DrugBank h p:// accession (chemical www.drugb number structure) ank.ca/ Key File Formats for Sequences and Structures
• Sequences – FASTA format .fasta .fst .txt!
• Macromolecule structures – PDB format .pdb .ent! Accessing Databases
• Web interface
• Query string
e.g. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?! db=nucleotide&id=34222261&rettype=fasta&retmode=fasta!
• Web services (SOAP)
• FTP -> local copy Cheminforma c Database Survey
CAS SciFinder PubChem
DrugBank ZINC Cheminforma c Database Survey
h ps://scifinder-cas- org.login.ezproxy.library.ualberta.ca/ h p://pubchem.ncbi.nlm.nih.gov/
CAS SciFinder PubChem
DrugBank ZINC
h p://www.drugbank.ca/ h p://zinc.docking.org/ Cheminforma c Database Survey
>27 million unique structures >52 million organic compounds >23 million with 3d conforma ons >61 million inorganic compounds Mostly organic, biologically Physical property info interes ng compounds
CAS SciFinder PubChem
DrugBank ZINC
~4,800 drugs >1,350 FDA approved drugs >13 million purchasable compounds
Includes drug target info Ready to dock Stereochemistry Issues 5 Chaetocin structures from PubChem
O O O HO HO HO H H H N N N N N N S S S N N N S S S
O O O O O O
S S S N N N S S S N N N N N N H H H OH OH OH O O O CID 161591: CID 5390098: CID 11563851: no stereochemistry bad stereochemistry incomplete stereochemistry
O O HO HO H H N N N N S S N N CID 46191942: S S Enan omer of O O O O natural product
S S N N S S CID 11657687: N N Natural product N N H H OH OH stereochemistry O O Other Cheminforma c Issues
• Tautomers / protona on states? • Salt forms? • Implicit or explicit hydrogens? • 2D connec vity only or 3D conforma on? • Non-organic elements? – Many programs only handle: CHNOPS + halogens – But some drugs have B, Pt, Hg, As, … SMILES
O H CC(=O)N[C@H]1CCC2=CC! N (=C(C(=C2C3=CC=C(C(=O)!
C=C13)OC)OC)OC)OC O
O O • Isomeric SMILES O – Allows specifica on of stereochemistry O
• Canonical SMILES – Canonicaliza on will generate a unique string for a molecule, regardless of atom order – Different programs will canonicalize differently
• SMARTS – Chemical pa erns for searching or filtering h p://www.daylight.com/smiles/index.html File formats
• MDL Molfile .mol – Allows a 3D conforma on to be stored
• SDF .sdf! – Wraps Molfile format; mul ple structures; annota ons
• PDB .pdb .ent! – Not the best for small molecules
Need to convert? -> Try OpenBabel h p://openbabel.org/wiki/Main_Page Pathway and Interac on Databases
KEGG Pathways NetPath
BioGRID Pathway and Interac on Databases
h p://www.genome.jp/kegg/
h p://www.netpath.org/
KEGG Pathways NetPath
BioGRID
h p://thebiogrid.org/ Pathway and Interac on Databases
Manually drawn pathways of metabolism, signaling, and other biological processes Curated protein signal pathways in humans >300 pathways + organism specific versions 20 pathways, 1,800 interac ons
KEGG Pathways NetPath
BioGRID
A repository for protein and gene interac on data
345,620 interac ons Pathway Formats
• SBML .xml! – The Systems Biology Markup Language h p://sbml.org/Main_Page
• Also check out the BioPAX format h p://www.biopax.org/ Pathway Tools
• libSBML h p://sbml.org/So ware/libSBML
• Cell Designer h p://www.celldesigner.org/
• CytoScape h p://www.cytoscape.org/
1. What databases are these iden fiers from? a. 3KYL b. EZH2 c. Q15910 d. GO:0008017 e. GI:8017 f. A9145C 2. Try finding the corresponding entries online Exercise Answers
1. What databases are these iden fiers from? a. 3KYL -> PDB (a protein-RNA structure for telomerase reverse transcriptase, cataly c region) b. EZH2 -> HGNC (a human gene for a histone lysine methyl transferase) c. Q15910 -> UniProt (a protein sequence for EZH2) d. GO:0008017 -> AmiGO (microtubule binding gene ontology) e. GI:8017 -> GenBank (a DNA sequence from D. melanogaster) f. A9145C -> this one’s a trick: it’s a chemical compound; you can look it up in PubChem with CID: 6438632