SESSION III – DATABASES
“Databases and Identification”
Prof. Jacques Vervoort
BRAMA training for technicians – Module I, Rome For more information see http://fiehnlab.ucdavis.edu
To be a master of spectra you need to be a master of structures in the first place.
765 100 OH N
NH O O 50 N 807 747 O 705 O N O HO O O O 676 723 604 265 353 395 455 513 538 636 0 260 310 360 410 460 510 560 610 660 710 760 810 (nist_msms) Vincristine
‰ Complex MS data interpretations only possible with software ‰ MS data obtained by hyphenated techniques (GC-MS, LC-MS) ‰ Mass spectral database search and structure search routinely are used ‰ Mass spectrometers deliver multidimensional data 2 BRAMA training for technicians – Module I, Rome Be prepared – visualize your structures
Try Marvin Space via Webstart 3 BRAMA training for technicians – Module I, Rome Organic Chemistry Reminder
Molecular Formula
C3H7F
47 100
F
50
61 27 41 13 19 33 59 0 4 10BRAMA20 training30 40 for50 technicians60 70 – Module I, Rome (mainlib) Propane, 2-fluoro- Be prepared - StereoIsomers How many stereoisomers can you expect from glucose ( KEGG )?
O OH HO
HO OH
OH
Glucose
5 BRAMAExample training calculated for technicians with MarvinView – Module I, Rome(via JAVA Webstart ) Be prepared – Tautomers
How many tautomers can you expect? Important for mass spectral interpretations.
O
CH 3 H3C O
Methyl acetate
Example calculated with MarvinView Start via WebStart 6 BRAMA training for technicians – Module I, Rome Be prepared – Resonance (electron shifts) What are possible resonant structures? Important for mass spectral interpretation (electron impact, electrospray)
OH
Phenol
Example calculated with MarvinView Start via WebStart 7 BRAMA training for technicians – Module I, Rome Structure search – know what could be possible
How many compounds (isomer structures) are found in public databases?
http://www.chemspider.com/
8 BRAMA training for technicians – Module I, Rome H3 H3 Chemical Structure Handling C C
H O H3 C
O C C H H 3 O 3
C C H H Moronic Acid - CID: 489941 Most common structure formats you need to know: 3 3
SMILES /SMARTS - Simplified Molecular Input Line Entry Specification SDF /MOL - Structure Data File InChI /InChIkey - IUPAC In ternational Ch emical Identifier PDB - Protein Data Bank CML - Chemical Markup Language
Some problems:
• Data format needs to be based on Open Standard (problem with SMILES, ok with CML) • Stereo and aromatic bond information needs to be saved (ok with SDF) • Format needs to be small in space for millions of compounds (ok with SMILES) • SMILES notation needs to be unique (problem with SMILES) • Structure representation should be portable and based on Open Standard (ok with CML)
9 BRAMA training for technicians – Module I, Rome Chemical Structure Identifiers
CH 3
N N O Structure Identifiers are needed for uniquely identifying structures Important for searching chemical structures in text and databases N N CH 3 H C 3 O
Structure Name – IUPAC name or common name 1,3,7-trimethylpurine-2,6-dione
CAS RN – Chemical Abstracts identifier 58-08-2
PubChem ID – PubChem Compound ID CID: 2519
InChIKey – Short representation of InChI InChiKey= RYYVLZVUVIJVGH -UHFFFAOYAW
InChI – IUPAC In ternational Ch emical Identifier InChI=1/C8H10N4O2/c1-10-4-9-6- 5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
10 BRAMA training for technicians – Module I, Rome SMILES structure format
Positive: Good for storing structures in single line Fast text based search possible; human readable Negative: Many different SMILES codes exist SMILES for same structure can be different (canonical or unique SMILES needed)
CH 3
C N N O HC N N CC CH 3 H C 3 O
CCC InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
CCCC All those SMILES codes represent caffeine [c]1([n+]([CH3])[c]([c]2([c]([n+]1[CH3])[n][cH][n+]2[CH3]))[O-])[O-] CCCCO CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 Cn1cnc2n(C)c(=O)n(C)c(=O)c12 Cn1cnc2c1c(=O)n(C)c(=O)n2C CCCCN N1(C)C(=O)N(C)C2=C(C1=O)N(C)C=N2 O=C1C2=C(N=CN2C)N(C(=O)N1C)C CN1C=NC2=C1C(=O)N(C)C(=O)N2C Caffeine SMILES Source InChiI FAQ 11 BRAMA training for technicians – Module I, Rome SDF/MOL structure format
Positive: established standard format; good for storing structures safely can store 3D structure; can store metadata (boiling points, toxicity, mass spectra) Negative: large file size, need compression
OpenBabel02240823422D
1 0 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 M END $$$$
OpenBabel02240823422D
2 1 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 1 2 1 0 0 0 M END $$$$
OpenBabel02240823422D Creator
3 2 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 Coordinates for 3D 0.0000 0.0000 0.0000 C 0 0 0 0 0 1 2 1 0 0 0 2 3 1 0 0 0 Connection of atoms M END $$$$ 12 BRAMA training for technicians – Module I, Rome Molecules and mass spectra
Close relationship between molecular structure and mass spectra
Molecular structure is reflected in mass spectral features (peaks, peak heights and peak combinations)
Mass spectra reflect a state of gas phase ion physics and chemistry (rearrangements, fragmentations, bond cleavages)
130 130 130 100 100 100 73 73 Si N Si 73 NH O Si N Si 50 50 50
59 145 45 58 45 114 59 147 29 84 100 145 29 86 100 114 46 91 105 160 0 0 0 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 (mainlib) tert-Butylaminotrimethylsilane (mainlib) N,N-Diethyl-1,1,1-trimethylsilylamine (replib) Silanamine, N,1,1,1-tetramethyl-N-[1-methy l-2-phenyl-2-[(trimethylsilyl)oxy]ethyl]-, [S-(R*,R *)]-
13 BRAMAElectron impact training (70 eV) for mass technicians spectra; Source: NIST0– Module5 I, Rome Molecules and mass spectra
Similar structures may or may have not similar mass spectra
130 100
Si N 73 O 50 Si
59 147 47 91 105 114 163 179 188 204 220 294 0 65 163 206 59 91 102 132 280 147 179 Si N O 44 50 Si
100 73 116 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 Silanamine, N,1,1,1-tetramethyl-N-[1-methyl-2-phenyl-2-[(trimethylsilyl)oxy]ethyl]-, [S-(R*,R*)]-N-Methylphenylethanolamine, bis(trimethylsilyl)-
Electron impact (70 eV) mass spectra; Source: NIST05; Created using structure similarity search in NIST MS Search program 14 BRAMA training for technicians – Module I, Rome Molecules and mass spectra
Similar mass spectra may or may have not similar structures
43 100 55
70 83
97 50 29 111 27 125 196 15 65 140 154 168 0 32 153 139 168 196 27 125 111 29
50 97
41 69 83
100 55 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 1-Tetradecene Cyclotetradecane
Electron impact (70 eV) mass spectra; Source: NIST05; Created using spectral similarity search in NIST MS Search program 15 BRAMA training for technicians – Module I, Rome Mass spectral databases I
Name Spectra count Type NIST05 200,000 electron impact spectra (EI 70 eV) Wiley 8 400,000 electron impact spectra (EI 70 eV) Palisade 600K 600,000 electron impact spectra (EI 70 eV)
NIST MS/MS 5,200 MS/MS (ESI, +/-, 30-100V CID) MassFrontier 7,000 MS n, ESI, (Spectral Tree Library )
Important is data quality Annotation with CAS and Structure and Formula Link to literature or publication useful Currently no large ESI,APPI,APCI libraries available (free or commercial)
16 BRAMA training for technicians – Module I, Rome Mass spectral databases II
272 Smaller specialized libraries 100 Cl Cl Pfleger Maurer Weber (Drugs) MS+RI, 70eV Cl
Cl Cl Cl Cl MassFinder (Volatiles) MS+RI, 70eV 50 237 Cl Cl
RIZA DB (Toxicants) MS+RI, 70eV Cl Cl Cl 332 Golm DB (primary Metabolites) MS+RI, 70eV 404 0 230 250 270 290 310 330 350 370 390 410 430 450 Fiehnlib (primary Metabolites) MS+RI, 70eV (riza_web) |RI|2583|KEY|1596|CAS|2385-85-5|FRML|Empty|CMPD|Mirex| MassBank (Metabolites) ESI, MS n , accurate masses AAFS (Drugs, Forensic,Toxicology), MS+RI, 70eV ChemicalSoft (Drugs), MS/MS, MS E ______
In case of electron impact (EI) same GC-Column (DB-5, RTX-5, DB-1, OV-1) and temperature program must be used for matching retention indices
In case of ESI, APPI spectra (LC-MS) same mass spectrometer design and setup should be used (triple-quad, ion-trap, TOF, Q-TOF), collision energy 17 BRAMA training for technicians – Module I, Rome Searching Molecules on PubChem
18 million compound DB (++)
18 GotoBRAMA PubChem training for technicians Structure – Module Search I, Rome CAS SciFinder
• 33 million molecules and 60 million peptides/proteins • largest reaction DB (14 million reactions) and literature DB • substructure and similarity search of structures • a must for chemists and biochemists/biologists • no bulk download, no good Import/ Export, no Link outs
19 BRAMA training for technicians – Module I, Rome Structure search in SciFinder
Retrieved 4000 papers
(refine search only MS and MALDI) 20 BRAMA training for technicians – Module I, Rome Atomic Mass
Correct unit is [ u] – unified atomic mass unit or [ Da ] Dalton see SI units 1 u = 1 Da = 1/12 th of mass of carbon 12 C = 1.66053886 x 10 -27 kg
C6Cl6: C6 Cl6 p(gss, s/p:40) Chrg 0 R: 1000 Res.Pwr...
283.81 100
90 Hexachlorobenzene (C6Cl6) Cl 285.81 80 Cl Cl 70 average mass - 284.7804 u 60 Cl Cl 281.81 50 Cl integer mass - 282.0 u
40 RelativeAbundance 287.80 30 monoisotopic mass - 281.81312 u 20 282.82 284.81 10 286.81 289.80 288.81 291.80 292.80 294.80 295.80 0 282 284 286 288 290 292 294 296 m/z
Always (always) check molecular masses obtained from databases or publications. For mass spectrometry the monoisotopic mass is used. 21 BRAMA training for technicians – Module I, Rome InChIKey: CKAPSXZOOQJIBF -UHFFFAOYAV Mass Accuracy
Instruments must be calibrated to obtain high mass accuracy. In case of FT-ICR-MS mass calibration can be stable over weeks. Post- mass calibration can be performed if calibrant was run with samples. Mass of electron becomes important at around 500 Da.
Type Mass Accuracy
FT-ICR-MS 0.1 - 1 ppm Orbitrap 0.5 - 1 ppm mexp m- calc Magnetic Sector 1 - 2 ppm ppm = ( )∗1E+ 6 mexp TOF-MS 3 - 5 ppm
Q-TOF 3 - 5 ppm m(e-) = 0.00054858026 u = mass of electron m(1H) = 1.0078246 u = mass of proton Triple Quad 3 - 5 ppm Linear IonTrap 50-200 ppm (10 ppm in Ultra-Zoom) 22 BRAMA training for technicians – Module I, Rome Resolving Power
RP = 1700
High resolving power is helpful for separation of species with almost same mass ( isobars ). RP = 48,250
High resolving power can not be used to distinguish between structural isomers .
Example: 23 BRAMAC8H 10trainingN2O for has technicians 100,082,479 – Module isomers. I, Rome Example Solanine (CID=30185) Isotopic Pattern Generators
Elements can be a) monoisotopic (F, Na, P, I) b) polyisotopic (H, C, N, O, S, Cl, Br) Isotopic pattern generators generate the isotopic abundances for a given mass value. Calculation is very time-consuming and based on Fast Fourier algorithms.
24 BRAMA training for technicians – Module I, Rome Charge states
charge state 1 charge state 2
CID: 3081765 MW = 1125.50082 C50H72N13O15P
25 BRAMA training for technicians – Module I, Rome Different charge states and peak resolutions
562.75 1125.50 100 100
90 90 80 2000 Resolving Power 80 2000 Resolving Power 70 Charge state 2 70 Charge state 1 563.25 1126.51 60 60 C H N O P: 50 50 72 13 15 50 C 50 H72 N13 O 15 P 1 C 50 H72 N13 O 15 P: 40 p (gss, s /p:40) Chrg 2 40 C 50 H72 N13 O 15 P 1 R: 2000 Res .Pwr . @FWHM p (gss, s /p:40) Chrg 1 30 30 R: 2000 Res .Pwr . @FWHM 563.76 1127.52 20 20
10 564.26 10 1128.52 564.76 565.77 566.78 567.78 1130.54 1132.55 1134.55 0 0 562.75 1125.50 100 100
90 90
80 80 70 200,000 Resolving Power 70 200,000 Resolving Power 60 563.25 Charge state 2 60 1126.50 Charge state 1 50 0.5 50 1.0 40 40 C H N O P: 50 72 13 15 C H N O P: C H N O P 50 72 13 15 30 50 72 13 15 1 30 C H N O P p (gss, s /p:40) Chrg 2 50 72 13 15 1 p (gss, s /p:40) Chrg 1 R: 200000 Res .Pwr . @FWHM 20 563.75 20 1127.51 R: 200000 Res .Pwr . @FWHM 10 10 564.25 1128.51 564.76 565.76 566.76 567.26 1129.51 1131.52 1133.52 1135.53 0 0 562 563 564 565 566 567 568 1125 1130 1135 m/z 26 BRAMA training for techniciansm/z – Module I, Rome Example of Phosphorylated Angiotensin isotopic pattern without adduct [M+H] + simulated by Thermo XCalibur Molecular Formula Generators
Formula generators are used to create molecular formulae from accurate masses. Input requires 1) accurate isotopic mass (with or without adduct) and 2) error in ppm or mDa (milli Dalton)
Accurate mass
Mass error Example MWTWIN 27 BRAMA training for technicians – Module I, Rome The molecular formula space of small molecules calculated by the Seven Golden Rules
Each molecular formula can expand to billions of structural isomers. Molecular Formula ≠ Molecular Isomer http://fiehnlab.ucdavis.edu/projects/Seven_Golden_Rules/ 28 BRAMA training for technicians – Module I, Rome Frequency distribution of molecular formulas
29 BRAMA training for technicians – Module I, Rome Impact of mass accuracy on number of formulas
30 BRAMA training for technicians – Module I, Rome Mass accuracy and isotopic pattern
[M+H] +
C45H73NO15 MW = 867.49799
Example : ESI-MS (+) of Solanine on a LTQ Resolving Power: 1700 Mass Accuracy: 46 ppm Isotopic Abundance Error: ±1.46%
31 BRAMA training for technicians – Module I, Rome Isotopic abundances as orthogonal filter
32 BRAMA training for technicians – Module I, Rome Tasks
( 1) Calculate the number of isomers for C12H12
(2) Generate the isotopic pattern for Chlorophyll a and Hexachlorobenzene.
(3) Find the molecular formula for the mass spectrum of the next page. http://www.ch.ic.ac.uk/java/applets/FormToM.html Use H=24, C=24, O=8 and others=4 in the settings. Include S ! Use the isotope generator to check which of the possible formula ’s is the best to fit the pattern observed.
(4) Find the possible molecule(s) in SciFinder and in the National Library of Medicine. Which one is the most likely? http://chem.sis.nlm.nih.gov/chemidplus/ (note: use the formula with hyphen and use the letters alphabetically) https://scifinder.cas.org/scifinder/login.jsf
33 BRAMA training for technicians – Module I, Rome Int=100 Int=9 Int=2
Int=20 Int=2
34 BRAMA training for technicians – Module I, Rome Webapplications
• Isotope calculator: • http://yanjunhua.tripod.com/pattern.htm • Mass to Formula and Formula to Mass: http://www.ch.ic.ac.uk/java/applets/FormToM.html • Tutorial GC-MS: • http://eu.shimadzu.de/products/chromato/gcms/TutorialGCMS/default.aspx
• Databases: • Dictionary of Natural Products (there is a limited access because of lack of license) • http://dnp.chemnetbase.com/dictionary-search.do?method=view&id=2885722&si • Chemical lookup service: • http://cactus.nci.nih.gov/ • SciFinder: • This needs to be activated through the university library link.
• Good website for Mass spectrometry background: • “The expanding role of Mass spectrometry in Biotechnology ” • http://masspec.scripps.edu/book_toc.php
35 BRAMA training for technicians – Module I, Rome