Bioinformatic Databases

Bioinformatic Databases

PharmaMatrix Workshop 2010 Bioinforma6c Databases 14 July 2010 Philip Winter & Ishwar Hosamani Database Growth Source: http://www.kokocinski.net/bioinformatics/databases.php Database Survey Genes & Proteins Cheminformacs: Gene & Protein Drugs & Metabolites Interac=ons Database Survey Pfam PDB TGI UniProt dbSNP Genes & GenBank Proteins GEO Cheminformacs: Gene & Protein Drugs & Metabolites Interac=ons Database Survey Pfam PDB TGI UniProt dbSNP Genes & GenBank Proteins GEO Cheminformacs: Gene & Protein Drugs & Metabolites Interac=ons SciFinder DrugBank PubChem ZINC Database Survey Pfam PDB TGI UniProt dbSNP Genes & GenBank Proteins GEO Cheminformacs: Gene & Protein Drugs & Metabolites Interac=ons SciFinder KEGG DrugBank PubChem NetPath BioGRID ZINC Curaon • Manual curaon (or just curaon): A human creates and annotates the database entry • Automa6c curaon: A computer program creates and annotates the database entry • Semi-automa6c curaon: A combina=on of manual and automa=c Database Idenfiers • Every database record will have a unique iden6fier; oUen this will be called an accession number which is assigned with the record is first added to the database • Be careful: databases will oUen permit a record to be modified but keep the same accession number; you should record the version number as well • Furthermore, databases may have different rules for handling records that are merged or split Database Idenfier Cheat Sheets PaMern Iden6fier Examples En6ty Database URL Name [op=onal GenInfo GI: Nucleo=de GenBank, hp:// “GI:”] Iden=fier 34222261 or protein RefSeq www.ncbi.nl [digits] sequence m.nih.gov/ [le`er][5 GenBank AB088100 Nucleo=de GenBank hp:// digits] ACCESSION sequence www.ncbi.nl OR m.nih.gov/ [2 le`ers][6 digits] [2 le`er RefSeq NM_178014 Nucleo=de RefSeq hp:// type code]_ ACCESSION or protein www.ncbi.nl [digits] sequence m.nih.gov/ [GenBank GenBank or AB088100.1 Nucleo=de GenBank, hp:// or RefSeq RefSeq or protein RefSeq www.ncbi.nl ACCESSION] VERSION NM_178014 sequence m.nih.gov/ .[version .2 number] (iden=cal to GenBank Nucleo=de GenBank, hp:// accession LOCUS or protein RefSeq www.ncbi.nl for recent sequence m.nih.gov/ entries) PaMern Iden6fier Examples En6ty Database URL Name [Protein Swiss-Prot TBB5_ Protein UniProtKB/ hp:// code]_ ID (entry HUMAN sequence Swiss-Prot www.unipro [Species name) t.org/ code] [UniProt AC] UniProt ID Q9BUU9_ Protein UniProtKB/ hp:// _[Species (entry HUMAN sequence TrEMBL www.unipro code] name) t.org/ [A-N,R-Z] UniProt AC P07437 Protein UniProtKB hp:// [0-9][A-Z] (accession sequence www.unipro [A-Z, 0-9][A- number) t.org/ Z, 0-9][0-9] OR [O,P,Q][0-9] [A-Z, 0-9][A- Z, 0-9][A-Z, 0-9][0-9] PaMern Iden6fier Examples En6ty Database URL Name [capital HGNC gene TUBB Human HGNC hp:// leers or symbol gene database www.genen digits; no TUBB1 ames.org/ ini=al digit] GO:[7 GO GO: Gene class AmiGO hp:// digits] accession 0005874 www.geneo number ntology.org/ [0-9][A-Z, PDB ID 1TUB Protein, PDB hp:// 0-9][A-Z, nucleic acid, www.rcsb.o 0-9][A-Z, or complex rg/ 0-9] structure [2 or 3 PDB ligand CN2 Ligand PDB hp:// leers or ID www.rcsb.o digits] rg/ PaMern Iden6fier Examples En6ty Database URL Name [up to 7 CAS registry 64-86-8 Chemical SciFinder hps:// digits]-[2 number structure scifinder- digits]-[1 cas- digit] org.login.ez proxy.library .ualberta.ca / [digits] PubChem 6167 Chemical PubChem hp:// CID structure pubchem.nc (compound bi.nlm.nih.g ID) ov/ ZINC[8 ZINC ID ZINC006218 Chemical ZINC hp:// digits] 53 structure zinc.docking OR .org/ [digits] 621853 DB[5 digits] DrugBank DB01394 Drug DrugBank hp:// accession (chemical www.drugb number structure) ank.ca/ Key File Formats for Sequences and Structures • Sequences – FASTA format .fasta .fst .txt! • Macromolecule structures – PDB format .pdb .ent! Accessing Databases • Web interface • Query string e.g. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?! db=nucleotide&id=34222261&rettype=fasta&retmode=fasta! • Web services (SOAP) • FTP -> local copy Cheminformac Database Survey CAS SciFinder PubChem DrugBank ZINC Cheminformac Database Survey h`ps://scifinder-cas- org.login.ezproxy.library.ualberta.ca/ hp://pubchem.ncbi.nlm.nih.gov/ CAS SciFinder PubChem DrugBank ZINC hp://www.drugbank.ca/ hp://zinc.docking.org/ Cheminformac Database Survey >27 million unique structures >52 million organic compounds >23 million with 3d conforma=ons >61 million inorganic compounds Mostly organic, biologically Physical property info interesng compounds CAS SciFinder PubChem DrugBank ZINC ~4,800 drugs >1,350 FDA approved drugs >13 million purchasable compounds Includes drug target info Ready to dock Stereochemistry Issues 5 Chaetocin structures from PubChem O O O HO HO HO H H H N N N N N N S S S N N N S S S O O O O O O S S S N N N S S S N N N N N N H H H OH OH OH O O O CID 161591: CID 5390098: CID 11563851: no stereochemistry bad stereochemistry incomplete stereochemistry O O HO HO H H N N N N S S N N CID 46191942: S S Enan=omer of O O O O natural product S S N N S S CID 11657687: N N Natural product N N H H OH OH stereochemistry O O Other Cheminformac Issues • Tautomers / protonaon states? • Salt forms? • Implicit or explicit hydrogens? • 2D connec=vity only or 3D conforma=on? • Non-organic elements? – Many programs only handle: CHNOPS + halogens – But some drugs have B, Pt, Hg, As, … SMILES O H CC(=O)N[C@H]1CCC2=CC! N (=C(C(=C2C3=CC=C(C(=O)! C=C13)OC)OC)OC)OC O O O • Isomeric SMILES O – Allows specifica=on of stereochemistry O • Canonical SMILES – Canonicaliza=on will generate a unique string for a molecule, regardless of atom order – Different programs will canonicalize differently • SMARTS – Chemical pa`erns for searching or filtering hp://www.daylight.com/smiles/index.html File formats • MDL Molfile .mol – Allows a 3D conforma=on to be stored • SDF .sdf! – Wraps Molfile format; mul=ple structures; annota=ons • PDB .pdb .ent! – Not the best for small molecules Need to convert? -> Try OpenBabel hp://openbabel.org/wiki/Main_Page Pathway and Interac6on Databases KEGG Pathways NetPath BioGRID Pathway and Interac6on Databases hp://www.genome.jp/kegg/ hp://www.netpath.org/ KEGG Pathways NetPath BioGRID hp://thebiogrid.org/ Pathway and Interac6on Databases Manually drawn pathways of metabolism, signaling, and other biological processes Curated protein signal pathways in humans >300 pathways + organism specific versions 20 pathways, 1,800 interac=ons KEGG Pathways NetPath BioGRID A repository for protein and gene interac=on data 345,620 interac=ons Pathway Formats • SBML .xml! – The Systems Biology Markup Language hp://sbml.org/Main_Page • Also check out the BioPAX format hp://www.biopax.org/ Pathway Tools • libSBML hp://sbml.org/SoUware/libSBML • Cell Designer hp://www.celldesigner.org/ • CytoScape hp://www.cytoscape.org/ <?xml version="1.0" encoding="UTF-8"?> <sbml level="2" version="3" xmlns="http://www.sbml.org/sbml/level2/version3"> ... <listOfSpecies> <species compartment="cytosol" id="ES" /> <species compartment="cytosol" id="P" /> <species compartment="cytosol" id="S" /> <species compartment="cytosol" id="E" /> </listOfSpecies> <listOfReactions> <reaction id="veq"> <listOfReactants> <speciesReference species="E"/> <speciesReference species="S"/> </listOfReactants> <listOfProducts> <speciesReference species="ES"/> </listOfProducts> <kineticLaw> <math xmlns="http://www.w3.org/1998/Math/MathML"> <apply> <times/> <ci>cytosol</ci> KEGG: Pathways in Cancer NetPath: EGFR1 pathway Exercises 1. What databases are these iden=fiers from? a. 3KYL b. EZH2 c. Q15910 d. GO:0008017 e. GI:8017 f. A9145C 2. Try finding the corresponding entries online Exercise Answers 1. What databases are these iden=fiers from? a. 3KYL -> PDB (a protein-RNA structure for telomerase reverse transcriptase, cataly=c region) b. EZH2 -> HGNC (a human gene for a histone lysine methyl transferase) c. Q15910 -> UniProt (a protein sequence for EZH2) d. GO:0008017 -> AmiGO (microtubule binding gene ontology) e. GI:8017 -> GenBank (a DNA sequence from D. melanogaster) f. A9145C -> this one’s a trick: it’s a chemical compound; you can look it up in PubChem with CID: 6438632 .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    32 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us