<<

1. Introduction searching queries (i.e., sets of conditions which information must information which conditions of sets (i.e., queries searching In ofcreation the to related issues selected discuss we information. sections, next recovering and searching for tools sophisticated the large amount of gathered data, impose the necessity of developing the globalinterfaceofEntrez. using query search one only entering and formulating by searched by can databases these All databases. related and biological, chemical, 30 searching large from resources provides [1] system Entrez the example, For systems. of development the enables what other, each compatible with are bases these that noting worth is It record. one only even or few a only – groups) research small or laboratories by held ones the (like other million), 25.7 them has ZINC while records, million 33 of Some databases. over has LLC Chemical Fine Aurora (e.g., records of 400 millions provide almost from Substance information PubChem example, contains For datasets. other from compiled hundred thousand). several and one between are substances tested of (sets bioactivity contains Compound PubChem BioAssay contains information about screening PubChem results for databases, andSubstance, PubChem in structures chemical about information other many from compounds) complex and extracts, mixtures, as (such substances and bonds). PubChem Substance contains information about 1000 than (less small about information contain which bases three of consists Pubchem world. the in structures chemical Health (NIH). Currently, it has the largest free of charge dataset of of Institutes National US the at (NCBI) Information Biotechnology forCenter National the by managed is and 2004 in originated base This years. last the over database Pubchem in data of increase the are characterized by a large increase of information. Table 1 shows BioAssay PubChem Compound PubChem Substance PubChem Grzegorz FIC–PhD., (Eng.),e-mail:[email protected] Corresponding author: The aforementioned features of chemical databases, foremost databases, chemical of features aforementioned The information contain databases chemical large of majority A Modern databases which contain structures ­ Values wereobtainedbytheuseofsearchingquery:all[filt] Increase ofPubChemdatabaseinyears2007–2016[mln]. Please citeas:CHEMIK2016,70,8,410–418 of Technology, Rzeszów, Poland Grzegorz FIC*,MariuszSKOMRA, BarbaraDĘBSKA –Faculty ofChemistry, RzeszowUniversity structure andchemicalcompounddatabases Current strategiesforsearchingthrough 29.10.07 0.000667 18.1 21.3 nr 8/2016 •tom 70 16.09.08 0.001188 19.3 44.6 11.11.09 0.001917 25.7 61.2 19.02.10 0.002153 26.1 62.5 18.03.13 0.649147 116.8 46.7 1.112105 6.11.14 178.4 62.0 1.154427 9.02.16 T 216.9 87.2 able 1 divided into: compound chemical through searching databases for Strategies 2. 3. Methodsofenteringstructuralinformation 2. 1. • searching queries,canbeentered inseveralwaysusing: ae o fzy usrcue, D n 3 ceia similarity, chemical 3D and 2D ontology strategies,andstructure-propertiessimilarities. substructures, fuzzy on strategies as based such area, this in solutions newest the to attention fulfil) and strategies of searching through databases. We pay special 5. 4. 3. ot eeal, taeis hc ae urnl ue cn be can used currently are which strategies generally, Most • • • • • • • distinguished: category,be this can In methods MOL. several or SMILES as such structural, which use fields containing structural information properties etc. bibliographical , ier oe MLS 2, nh vcos (IUPAC vectors InChI [2], SMILES code linear Number), EINECS (e.g., number id name, (IUPAC),vernacular represent chemical structures in queries, such as systematic name structure the of for identifiers text essential is which structure, chemical about Information similar propertiestotheseofthedefinedstructure/substructure structure-properties type and search to allow browse basesbytermswhichcharacterizeindividualsubstances and ontologies, some require databases, ontological needed and/orforbiddenpropertiesofsubstances the define which logical) and numerical, textual, (structural, type hybrid erhn fr l ioes fr example (for isomers all for searching Same SkeletoninChemSpider), structural skeleton (for example defined the contain which structures possible all for searching strategy inBioPath databaseofmetabolictransformations), the example (for synthetized be can substructure defined the which from structures – precursors for searching with conjunction screening tests(forexampleintheprocessofdrugdesign), in similarity; for structures of sets create to allows it strategy Substructure structural user-defined fulfill of which conditions structures similar for searching assisted organicsynthesis(searchforbuildingblocks), the in query (Superstructure) defined – this strategy is useful during structure computer the of substructures for searching the query(Substructure), searching for structures containing the substructure defined in in ChemSpiderdatabase), one the with identical defined in the query are (Exact ), including tautomers All ( Tautomers which structures for searching ChemSpider). , where the query is composed of conditions of different of conditions of composed is query the where , , which are used mainly in the chemical and biological and chemical the in mainly used are which , , which use fields like: name, author, producer, author, name, like: fields use which , , searching for compounds which have – these are strings which can which strings are these – Flex, Flexplus in ChemIDplus, CAS Registry Number, Registry CAS l Isomers All Precursor • 415 in science • technique science • technique 416 • 4. Fuzzysubstrcutures no h sac egn’ itrae y seil om r y SMARTS by or notation (Fig.3) [7], whichbelongstoSMILESnotation family. form special a by interface engine’s search the into entered is fuzziness about Information substructures. many but one, not represents substructure fuzzy every Therefore, and values. forbidden allowed of set a has attribute Every value. one than more have can position), type, (bond bonds and/or etc.) chain aliphatic aromatic system, ring, a in position electrons, free of number neighbors, of the attributes which describe atoms’ properties (such as type, number groups of compounds with and similar structures. classes In of fuzzy substructures representation [6] consistent providing of purpose the in introduced was substructures fuzzy of idea The substructures. fuzzy 2). (Fig. way Markush structures, which include “fuzzy” varying fragments, are ancestors of a in or value) one exactly has which them attribute describes every (when way “sharp” a in defined be can They searching queries. Substructures are fragments of chemical structures. • • • Fig. 1.Selectedcaffeinetext identifiers. From above:IUPA InChI, hIKey, SMILES,CAS Registry Number, EINECSNumber many for element important an is “substructure” so-called The information graphical file Internet the in or computer a on saved be can File pdb). smi, cml, mol, as (such formats available the of one in structure about information information structural file computer to theuser’s Internet browser plug-in or of an applet which is send by the server an of form a have can which software special a by engine) (search server database the to send is SMILES) as (such form coded a in graphical structure information N2 atoms). and C2 and atoms C1 and N1 are these 1. Fig. (in symbols atoms’ after numbers identical of pair a by represented are chain end the of the and beginning The ring. every in bond one “cutting” of i.e., process, structure chemical the of linearization the of result a is SMILES databases. structure through searching during used are defined a They isomers. containing differentiate second the while compounds skeleton, structural for characteristic is them of First dashes. by separated blocks, three of consists It algorithm. SHA-256 by information InChI of hash-process [5]). a in exceptions created is It some are there (but representation structure containing information unreadable properties. for humans. It isotopic is an unambiguous and characters 27 of length constant stereochemical with vector a is 1) InChIKey(Fig. or charge to related sublayers more be can There atoms. other with connected atoms ’s connection table, “/h” – sublayer which describe hydrogen the of beginning the denotes “/c” next, comes formula chemical version, InChI the denotes 1S 1): (Fig. prefix a and “/” sign the by InChI) [4]. InChI is made up of several [3], layers Identifier) Chemical International of information separated vial i CeSie.o ad hmcl tutr Lookup Structure Service 2008. Chemical and ChemSpider.com in available search engine(suchasgif, jpg,tif, pdf);thistechnologyiscurrently of scan or figure from picture an article) saved in one of the formats accepted by the handmade a as (such structure the of image CN1C=NC2=C1C(=O)N(C(=O)N2C)C 5(10)7(13)12(3)8(14)11(6)2/h4H InChI=1S/C8H10N4O2/c1–10–4–9–6– – uploaded file which contains the contains which file uploaded – RYYVLZVUVIJVGH-UHFFFAOYSA-N – uploaded file which contains which file uploaded – 1,3,7-trimethylpurine-2,6-dione – drawn structure/substructure or InChIKey (hash version of version (hash InChIKey or 200-362-1 58-08-2 ,1–3H3 C name, 5. Chemicalstructuresimilarity • spatial full or representation whichareusedduringtheevaluationofsimilarities: similarity) 2D of types main three chemical are There similarity). chemical (3D structure (2D structure constitutional chemical defined a similarity measure. and method representation structure proper a needs them of one Every proposed. been have structures chemical fuzzy structure,ontheright–selectedsubstructureswhichfulfilthis (extension ofSMILESnotationwhichallowforsearchingwithfuzzy (C(=O)).(CC)>>C(=CC. (C(=O) Fig. 2.Examplesoffuzzysubstructures.Ontheleft–definition h hs agrtm s sd n a binary a and Afterwards, used . is the algorithm hash in the atoms of number the denotes n where atomic n- to one from substructures, possible all of set the creates process fragmentation structure complete The advance. in defined not is features of set the case, this In disadvantage. this one on based another.on based dissimilar similar very are structures two used the on depends similarity of In effect, results of comparison are not fully objective. The degree impossibility of determination of all compound structural features. of type this of disadvantage major A structure). compound the in present are that features structural of amount the define which numbers are vector of elements (the descriptor 166/322-bit forms: accommodate they OpenBabel, 55/307 substructure types) [9], MACCS [10] – which has different (in FP4 and FP3 [8], features) in 3D. The most common are PubChem fingerprint (881 structure bits and (iii) the selection of elements which describe the structure of meaning (ii) bits, of number (i) by: differ They described. been different Many compound. the in property some of presence the 0) (value excludes or 1) (value confirms element fingerprints methods. representation Structure between similarity of evaluation for methods computer Different Fig.3. ExamplesofsearchingqueriesinSMARTSnotation definition (no.1:sharpsubstructure,withoutfuzziness) [n;R]&*C(=O)O .OCC)>>C(=)CC. [OH]c1ccccc1 bnr vcos ih ifrn lnts wee every where lengths, different with vectors binary – substructures approach) fingerprint Hash Searching forallmoleculeswhichcontain Searching forallmoleculeswithcontain a carboxylicgroupandnitrogenatom and a benzeneringwithhydroxylgroup attempt to eliminate to attempt fingerprints nr 8/2016 •tom 70 – it is possible that possible is it – fingerprint Searching foranintermolecular Searching foranintramolecular 6/2-lmns number 166/322-elements in somearomaticring esterification reaction esterification reaction s generated is fingerprint hy osdr only consider They is the is fingerprint have fingerprints and fingerprint Euclidean distance (ED), where coefficients which allow to consider asymmetry. Another measure is the where TS of version modified a is (TI) Tversky of index asymmetrical activity.The biological similar example, for have, similarity of two chemical structures. It is very likely that such structure that they have the same not indicate that the compared structures are identical, but only states does TS=1 value that here mention to needs One pairs. bit [1,1] of absent in the second structure), amount of features represented in the (TS): similarity Tanimoto by which similarity evaluate them, to used are of Some structures. chemical two between similarity • • to mentionthefollowing: 3D molecules. worth in especially is compared it similarity, 3D atoms measuring of of methods of all Among conformations position different the for know to coordinates necessary is it similarity, • • known [14]. also are above, described ones the from different measures, similarity chemical Another descriptors. on based are which methods in mostly measure a as used is ED structures. between similarity the is smaller the is, value ED the greater The B. and A structures two for elements Fig. 4.Definitionsofselectedchemicalsimilaritymeasureswhichare (Comparative Molecular SimilarityIndexAnalysis), other),for and example CoMFA density (Comparative Molecular Field Analysis) or CoMSIA electron potential field, shape (electrostatic geometrical similarity field, field molecular of evaluation bonds angles,distancesbetween atoms), as (such 3D in angles and distances to related descriptors of use similarity. chemical 3D measures. Similarity chemical structure,forexampleBCUTdescriptors[13]. molecular eigenvalues – one numerical value represents the whole features ofthemolecule which structural specified of the of more each or one quantitatively values, represents numerical – [12] descriptors molecular of bondsfrominitialstate). of neighbors in another layers (which are measured by the number range of features is expanded and takes into account the influence hash generation new of group – [11] ECFP atoms, 1–7 containing fragments to limited is fragmentation hash used commonly most The inadvisable. rather but allowed is situation more than one structure element (so called “bits’ collision” ). That substructures. As a result of hash process some bits can of represent number the of independently elements, of number same the bits). 512 or 1024 of length has usually (it r: P (pnae) hr structure where (OpenBabel) FP2 are: fingerprints nr 8/2016 •tom 70 fingerprints. TS values above 0.85 indicate high is the amount of [1,0] bit pairs (i.e., the (i.e., pairs bit [1,0] of amount the is a hy ecie n qatttv wy the way quantitative a in describe They used intext In order to measure the 3D chemical 3D the measure to order In b – amount of [0,1] bit pairs, a i

and , are shown in Figure 4. Figure in shown are fingerprints, b fingerprint of one structure, but i

are the where the initial the where fingerprints i-values has always has Fingerprint and a of fingerprints are weight are b c – amount Tanimoto (T fields molecular for version modified the in similarity the assess to made and CT≥0.5. ST when similar are conformers two that supposes algorithm where CT is determined in every point of this superposition. PubChem measure: similarity shape 3D a is Tanimoto(ST) Shape [16]. 4) (Fig. measures the in B and A for some feature, some for their common part (V two-stage a in determined that such conformers two are for superposition 3D the when (i) process: CT and ST different. are features 7. Ontologicalstrategies 6. Structure-propertiesstrategies Fig. 4): A and B which are absent in their common superposition, common their in absent are which B and A oos n acpos ctos ain, yrpoiiy rns C is a CT sum over features: rings. hydrophobicity, anions, cations, acceptors, and donors bonds hydrogen features: structural six for measure compatibility a is TanimotoColor superposition. their in B i A of volume common (CT) hB otlg (i. ) s sn tre uotlge: i chemical (i) subontologies: three using is 5) (Fig. ontology ChEBI example. an as show be can ChEMBL, database compounds bioactive (Chemical Entities of Biological Interest) ChEBI [18] ontology, hierarchization. which is a part order of and categorization In use descriptions. ontologies in that do unambiguity to the ensure to is task key Its problem. Ontology is set of terms which describe some field of science. this to solution a as proposed been has ontologies of Creation results. research the of analysis and comparison with problems have world the of parts different from scientist result, a As discoveries. new their scientist from whole world create their own nomenclature for describing engine ChemSpider.com. search chemical the in example, minute). for implemented, 1 is method than LASSO less takes compounds mln 1 through scan the that say (authors networks neural using provided be can activity biological highly similar. Fast selection of compounds which may have a specified or identical are descriptors LASSO their when similar are molecules with places There are 23 distinguished surface points, such as hydrophobic places, properties. specified with surface compound is on element points of every number which a in vector a is this – evaluated is descriptor LASSO the database the in molecule every For proteins). same the with bonds surfaces similar with (compounds properties similar have Order) method [17]: two ligands Similarity have Surface similar in activity Activity (Ligand when LASSO their is surfaces method An this activity.of example biological as such properties, similar with but diversity, chemical high with sets of creation the in importance particular of is similarity,and different, low.are is methods, classical by assessed This this to According structures their when even similar proposed. be can molecules two framework, also were properties the on based is similarity where methods structures, their of comparison and analysis • • As an effect of dynamic development of different fields of science, of fields different of development dynamic of effect an As on based similarity chemical assess which methods to addition In molecular moments comparison, for example CoMMA ( onformers’ 3D similarity measures. similarity 3D Conformers’ descriptors. parameter, spherical Weighted( TaftWHIM groups), substituent Molecular) Invariant Holistic volume, Waalsof description quantitative a for allow (which parameters STERIMOL shapes, der molecule van on based example are which descriptors of use center, center,mass to regard in moments molecular center,charge dipol between descriptors include which Analysis) Moments Molecular X iA V and AA fet, lcs ih yrgn od dnr ec Two etc. donors bonds hydrogen with places effects, p , V X BB i-th iB are volumes of conformer’s fragments for molecules for fragments conformer’s of volumes are are values of some attribute of compared molecules V element of the field. PubChem database uses two uses database PubChem field. the of element AA V uniaie D iiaiy 1] Oe f hm is them of One [15]. similarity 3D quantitative AB and AB ) is maximal is created and ST is determined, (ii) – volume of fragments which are in accordance V BB – volume of fragments on which given which on fragments of volume – Special measures were measures Special Comparative V AB • 417 is the is 0.8 ≥ 3 S in S

science • technique science • technique 418 • Literature 8. Summary – arenecarboxamide CHEBI:22645 CHEBI:22702 benzamides–CHEBI:7496nelfinavir – amide acid monocarboxylic organooxygen compound CHEBI:36963 – – CHEBI:37622 carboxamide compound – organochalcogen CHEBI:29347 CHEBI:36962 – entity – entity molecular heteroorganic CHEBI:33285 – entity molecular organic CHEBI:50860 group carbon CHEBI:33582 – entity molecular p-block CHEBI:33675 – entity molecular group main CHEBI:33579 than smaller elements atoms (e.g.,electrons,photons,nucleons). classifies which particles subatomic (iii) and acceptor,(e.g., role chemical their on or fuels), donor, ligand) solvent, coenzymes, hormones), on the usage by people (e.g., pesticides, drugs, classify compounds based on their biological functions (e.g., antibiotics, and composition structure (e.g., hydrocarbons, carboxylic their acids, amins), (ii) roles which on based compounds classify which entities specified proteinandsmallforanothergroup. (using allows strategy ChemSpider.com of database) to find molecules with large bioactivity for kind that currently, Even promising. very ontologies (including ChEBIdescribedearlier). different several provides it as area, this in exception an is PubChem engines. search few a only in implemented been has and popular yet not is tool This properties. biological or chemical similar 3D similarity(bothareavailableinPubChem). tool for this type of work. supporting Clusterization can be done useful based on 2D very and a is structures, similar in contain which grouped clusters are structures which in Clusterization, databases. large for evaluation of chemical similarity and strategies for searching through algorithms notably methods, cheminformatical of development the of effect an is chemical This properties. desirable with drugs) as (such compounds designing of process the in are screening which virtual for structures, necessary compounds with libraries virtual of creation the is databases compounds chemical in searching for tools different assisted organicsynthesis[19]). Superstructure or fuzzy substructures), another have only specific usage (for example similarity chemical on based strategies example (for chemists by used 5. 4. 3. 2. 1. protease inhibitor. Fulltreecontainsc.a.50branches:http://www.ebi. ac.uk/chebi/chebiOntology.do?chebiId=CHEBI:7496&treeView=tru- Fig. 5.SelectedbranchfromChEBIontologytreeofnelfinavir–HIV CHEBI:24431 chemical entity – CHEBI:23367 molecular entity – entity molecular CHEBI:23367 – entity chemical CHEBI:24431 chikey-collision-is-discovered-and-not-based-on-stereochemistry/. http://www.chemconnector.com/2011/09/01/an-in- 2011, reochemistry, Ste- on Based NOT and Discovered is Collision InChIkey An A.: Williams jcheminf.springeropen.com/articles/10.1186/s13321–015–0068–4. IUPAC International Chemical Identifier, J. Cheminform. 2015, 7, 23, http:// Heller S. R., McNaught A., Pletnev I., Stein S., Tchekhovskoi D.: InChI, the springeropen.com/articles/10.1186/1758–2946–5-7. http://jcheminf. 7, 5, 2013, Cheminform. J. standard, identifier structure chemical worldwide the – InChI I. Pletnev TchekhovskoiD., S., Stein A., iupac.org/home/publications/e-resources/inchi.html; http://www. 2012, IUPAC, Identifier, Chemical International IUPAC The smiles.html. http://www.daylight.com/dayhtml/doc/theory/theory.2011, Inc. Systems, Information Chemical Daylight Language, Chemical Simplified A – SMILES gquery. Entrez, the Life Sciences Search Engine; http://www.ncbi.nlm.nih.gov/sites/ seems strategies structure-properties of development Future with compounds for search rapidly can strategies Ontological of applications current important most the of one that seems It often more and more are here described strategies the of Some strategies which are used in computer in used are which strategies Precursors and e#vizualisation (17.02.2016) eller S., McNaught S., Heller 19. 18. 17. 16. 15. 14. 13. 12. 11. 10. 9. 8. 7. 6. systems forchemicalsciencesandrelated, e-learning. intelligent chemometrics, – interests Research publications. scientific 187 of awarded the Knight’s Cross of the Order of the Rebirth of Polish. The author was She Bioinformatics). and Biotechnology of Technology(Department of University Rzeszów at Professor Associate informatics. chemical and nology tech- chemical in sciences technical in DSc. Ph.D., Engineering); of (School assisted organicsynthesis),computerprocessingofstructuralinformation. (computer CAOS – interests Research Chemistry). of (Department nology Techof - University Rzeszów at student Ph.D. Technology(biotechnology). sis), computerprocessingofstructuralinformation,bioinformatics. synthe- organic assisted (computer CAOS – interests Research related. and blications, information systems and computer programs for chemical sciences Department of Biotechnology and Bioinformatics). Author of 77 scientific pu- Chemistry,Technologyof of (Faculty University Rzeszów in lecturer Senior University). (Jagiellonian chemistry in Ph.D. technology); (chemical nology e-mail: [email protected], phone:+4817–865–1358 Rzeszow in WSI of graduate a (Eng.), D.Sc., Ph.D., – DĘBSKA Barbara e-mail: [email protected],phone:+4817–865–1838 of University Rzeszów of graduate a (Eng.), M.Sc., – SKOMRA Mariusz e-mail: [email protected],phone:+4817–865–1838 *Grzegorz FIC – PhD., (Eng.), a graduate of Rzeszów University of Tech- raizyh sa atan i irni owj, re. hm 2011, Chem. Przem. 90(1), 78–83. rozwoju, kierunki i aktualny stan – organicznych syntez wieloetapowych projektowanie Komputerowe G.: Fic G., Nowak www.ncbi.nlm.nih.gov/pmc/articles/PMC3531142/. http:// issue):D456–63, 41(Database 2013, Res. Acids Nucleic 2013, for enhancements chemistry: relevant biologically for ontology and database shnan V, Owen G, Turner S, Williams M, Steinbeck C. The ChEBI reference Hastings J., de Matos P., Dekker A., Ennis M., Harsha B., Kale N.: Muthukri- Aided Mol.Des.2008,22(6–7),479–87. ce similarity order: a new tool for ligand based virtual screening, J. Comput. Reid D., Sadjad B. S., Zsoldos Z., Simon A.: activity LASSO-ligand by surfa- J. conformers. Cheminform. 2011,3,13,http://www.jcheminf.com/content/3/1/13. Similar PubChem3D: H.: S. Bryant S., Kim E., E. Bolton ley &Sons,2008. Todeschini R.,ConsonniV.: HandbookofMolecularDescriptors,JohnWi- www.daylight.com/dayhtml/doc/theory/theory.finger.html. Subspace Concept,J.Chem.Inf. Comput.Sci.1999,39(1),28–35. Receptor-Relevantthe Validation and Metric M.: K. Smith S., R. Pearlman Wiley-VCH, 2009. TodeschiniV.:Consonni R., Chemoinformatics, for Descriptors Molecular Model. 2010,50(5),742–754. Inf. Chem. J. Fingerprints, Extended-Connectivity M.: Hahn D., Rogers modules/html/MACCSKeys.html. http://www.mayachemtools.org/docs/ 2015, MACCS, MayaChemTools: Open Babelv2.3.0documentation,http://openbabel.org/docs/dev/. fingerprints.pdf. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_ specification: theory/theory.smarts.html. cal Information Systems, Inc. 2011, http://www.daylight.com/dayhtml/doc/ SMARTS – A Language for Describing Molecular Patterns, Daylight Chemi- 2000, 40,325–329. Sci., Comput Inf. Chem. J. Structures, Chemical in Fragments Molecular Sci. 1994, 19, 21–30; Dębska B., B.: Guzowska-Świder Fuzzy Definition of mical Syntheses by Means of Molecular Graphs, Found. Comput. Decision Hippe Z. S., Fic G., Nowak G.: Representation of Common Sense in Che- nr 8/2016 •tom 70 (Received –14.07.2016r.)