PharmaMatrix Workshop 2010 Bioinformac Databases

14 July 2010 Philip Winter & Ishwar Hosamani Database Growth

Source: http://www.kokocinski.net/bioinformatics/databases.php Database Survey

Genes &

Cheminformacs: & Drugs & Metabolites Interacons Database Survey

Pfam PDB TGI UniProt dbSNP

Genes & GenBank Proteins GEO

Cheminformacs: Gene & Protein Drugs & Metabolites Interacons Database Survey

Pfam PDB TGI UniProt dbSNP

Genes & GenBank Proteins GEO

Cheminformacs: Gene & Protein Drugs & Metabolites Interacons SciFinder

DrugBank PubChem

ZINC Database Survey

Pfam PDB TGI UniProt dbSNP

Genes & GenBank Proteins GEO

Cheminformacs: Gene & Protein Drugs & Metabolites Interacons SciFinder KEGG

DrugBank PubChem NetPath BioGRID ZINC Curaon

• Manual curaon (or just curaon): A creates and annotates the database entry

• Automac curaon: A computer program creates and annotates the database entry

• Semi-automac curaon: A combinaon of manual and automac Database Idenfiers

• Every database record will have a unique idenfier; oen this will be called an accession number which is assigned with the record is first added to the database

• Be careful: databases will oen permit a record to be modified but keep the same accession number; you should record the version number as well

• Furthermore, databases may have different rules for handling records that are merged or split Database Idenfier Cheat Sheets Paern Idenfier Examples Enty Database URL Name [oponal GenInfo GI: Nucleode GenBank, hp:// “GI:”] Idenfier 34222261 or protein RefSeq www.ncbi.nl [digits] sequence m.nih.gov/ [leer][5 GenBank AB088100 Nucleode GenBank hp:// digits] ACCESSION sequence www.ncbi.nl OR m.nih.gov/ [2 leers][6 digits] [2 leer RefSeq NM_178014 Nucleode RefSeq hp:// type code]_ ACCESSION or protein www.ncbi.nl [digits] sequence m.nih.gov/ [GenBank GenBank or AB088100.1 Nucleode GenBank, hp:// or RefSeq RefSeq or protein RefSeq www.ncbi.nl ACCESSION] VERSION NM_178014 sequence m.nih.gov/ .[version .2 number] (idencal to GenBank Nucleode GenBank, hp:// accession LOCUS or protein RefSeq www.ncbi.nl for recent sequence m.nih.gov/ entries) Paern Idenfier Examples Enty Database URL Name [Protein Swiss-Prot TBB5_ Protein UniProtKB/ hp:// code]_ ID (entry HUMAN sequence Swiss-Prot www.unipro [ name) t.org/ code] [UniProt AC] UniProt ID Q9BUU9_ Protein UniProtKB/ hp:// _[Species (entry HUMAN sequence TrEMBL www.unipro code] name) t.org/

[A-N,R-Z] UniProt AC P07437 Protein UniProtKB hp:// [0-9][A-Z] (accession sequence www.unipro [A-Z, 0-9][A- number) t.org/ Z, 0-9][0-9] OR [O,P,Q][0-9] [A-Z, 0-9][A- Z, 0-9][A-Z, 0-9][0-9] Paern Idenfier Examples Enty Database URL Name [capital HGNC gene TUBB Human HGNC hp:// leers or symbol gene database www.genen digits; no TUBB1 ames.org/ inial digit] GO:[7 GO GO: Gene class AmiGO hp:// digits] accession 0005874 www.geneo number ntology.org/ [0-9][A-Z, PDB ID 1TUB Protein, PDB hp:// 0-9][A-Z, nucleic acid, www.rcsb.o 0-9][A-Z, or complex rg/ 0-9] structure [2 or 3 PDB ligand CN2 Ligand PDB hp:// leers or ID www.rcsb.o digits] rg/ Paern Idenfier Examples Enty Database URL Name [up to 7 CAS registry 64-86-8 Chemical SciFinder hps:// digits]-[2 number structure scifinder- digits]-[1 cas- digit] org.login.ez proxy.library .ualberta.ca / [digits] PubChem 6167 Chemical PubChem hp:// CID structure .nc (compound bi.nlm.nih.g ID) ov/ ZINC[8 ZINC ID ZINC006218 Chemical ZINC hp:// digits] 53 structure zinc.docking OR .org/ [digits] 621853 DB[5 digits] DrugBank DB01394 Drug DrugBank hp:// accession (chemical www.drugb number structure) ank.ca/ Key File Formats for Sequences and Structures

• Sequences – FASTA format .fasta .fst .txt!

• Macromolecule structures – PDB format .pdb .ent! Accessing Databases

• Web interface

• Query string

e.g. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?! db=nucleotide&id=34222261&rettype=fasta&retmode=fasta!

• Web services (SOAP)

• FTP -> local copy Cheminformac Database Survey

CAS SciFinder PubChem

DrugBank ZINC Cheminformac Database Survey

hps://scifinder-cas- org.login.ezproxy.library.ualberta.ca/ hp://pubchem.ncbi.nlm.nih.gov/

CAS SciFinder PubChem

DrugBank ZINC

hp://www.drugbank.ca/ hp://zinc.docking.org/ Cheminformac Database Survey

>27 million unique structures >52 million organic compounds >23 million with 3d conformaons >61 million inorganic compounds Mostly organic, biologically Physical property info interesng compounds

CAS SciFinder PubChem

DrugBank ZINC

~4,800 drugs >1,350 FDA approved drugs >13 million purchasable compounds

Includes drug target info Ready to dock Stereochemistry Issues 5 Chaetocin structures from PubChem

O O O HO HO HO H H H N N N N N N S S S N N N S S S

O O O O O O

S S S N N N S S S N N N N N N H H H OH OH OH O O O CID 161591: CID 5390098: CID 11563851: no stereochemistry bad stereochemistry incomplete stereochemistry

O O HO HO H H N N N N S S N N CID 46191942: S S Enanomer of O O O O natural product

S S N N S S CID 11657687: N N Natural product N N H H OH OH stereochemistry O O Other Cheminformac Issues

• Tautomers / protonaon states? • Salt forms? • Implicit or explicit hydrogens? • 2D connecvity only or 3D conformaon? • Non-organic elements? – Many programs only handle: CHNOPS + halogens – But some drugs have B, Pt, Hg, As, … SMILES

O H CC(=O)N[C@H]1CCC2=CC! N (=C(C(=C2C3=CC=C(C(=O)!

C=C13)OC)OC)OC)OC O

O O • Isomeric SMILES O – Allows specificaon of stereochemistry O

• Canonical SMILES – Canonicalizaon will generate a unique string for a molecule, regardless of atom order – Different programs will canonicalize differently

• SMARTS – Chemical paerns for searching or filtering hp://www.daylight.com/smiles/index.html File formats

• MDL Molfile .mol – Allows a 3D conformaon to be stored

• SDF .sdf! – Wraps Molfile format; mulple structures; annotaons

• PDB .pdb .ent! – Not the best for small molecules

Need to convert? -> Try OpenBabel hp://openbabel.org/wiki/Main_Page Pathway and Interacon Databases

KEGG Pathways NetPath

BioGRID Pathway and Interacon Databases

hp://www..jp/kegg/

hp://www.netpath.org/

KEGG Pathways NetPath

BioGRID

hp://thebiogrid.org/ Pathway and Interacon Databases

Manually drawn pathways of metabolism, signaling, and other biological processes Curated protein signal pathways in >300 pathways + specific versions 20 pathways, 1,800 interacons

KEGG Pathways NetPath

BioGRID

A repository for protein and gene interacon data

345,620 interacons Pathway Formats

• SBML .xml! – The Systems Biology Markup Language hp://sbml.org/Main_Page

• Also check out the BioPAX format hp://www.biopax.org/ Pathway Tools

• libSBML hp://sbml.org/Soware/libSBML

Designer hp://www.celldesigner.org/

• CytoScape hp://www.cytoscape.org/ ... cytosol KEGG: Pathways in Cancer NetPath: EGFR1 pathway Exercises

1. What databases are these idenfiers from? a. 3KYL b. EZH2 c. Q15910 d. GO:0008017 e. GI:8017 f. A9145C 2. Try finding the corresponding entries online Exercise Answers

1. What databases are these idenfiers from? a. 3KYL -> PDB (a protein-RNA structure for telomerase reverse transcriptase, catalyc region) b. EZH2 -> HGNC (a human gene for a histone lysine methyl transferase) c. Q15910 -> UniProt (a protein sequence for EZH2) d. GO:0008017 -> AmiGO (microtubule binding ) e. GI:8017 -> GenBank (a DNA sequence from D. melanogaster) f. A9145C -> this one’s a trick: it’s a chemical compound; you can look it up in PubChem with CID: 6438632