<<

DIPLOMARBEIT/DIPLOMA THESIS

Titel der Diplomarbeit / Title of the Diploma Thesis “An evaluation of the accuracy of – related InChI & InChIKey on ChemSpider, DrugBank, PharmXplorer, PubChem and Wikipedia“

verfasst von / submitted by

Joachim Tscherny

angestrebter akademischer Grad / in partial fulfilment of the requirements for the degree of Magister der Pharmazie (Mag.pharm.)

Wien, 2018 / Vienna, 2018

Studienkennzahl lt. Studienblatt / A 449 degree programme code as it appears on the student record sheet: Studienrichtung lt. Studienblatt / Diplomstudium Pharmazie degree programme as it appears on the student record sheet: Betreut von / Supervisor: Univ.-Prof. Mag. Dr. Gerhard Ecker

II

Acknowledgments

Foremost, I would like to thank my supervisor Gerhard Ecker for the opportunity for realizing this project. Through these thesis, I was able to expand my knowledge extensively.

Furthermore, I would like to thank Daniela Digles, who especially supported me at the beginning of my work with the Knime Analytics Platform. I would also like to express my gratitude to Norbert Haider, who provided the data from PharmXplorer.

I would like to express my appreciation to the developers of the KNIME Analytical Platform – I could not have carried out the type of computational work without access to this software.

I would like to take this opportunity to thank my family, especially Mom, Dad and my sister Katharina, for the continuous and unconditional support they have given me throughout my duration of study.

And finally, I extend my personal gratitude to Martina for all the love, patience, and guidance she has given me during the last years.

“lucundi acti labores” (Marcus Tullius Cicero, Brutus 70)

III

IV

Abstract

Freely available online resources such as ChemSpider, DrugBank, PubChem, and Wikipedia are widely used for obtaining information on . For pharmacy students of the University of Vienna, PharmXplorer is a commonly used source of information. This project investigates whether the drug-related InChI & InChIKey are consistent in the databases ChemSpider, DrugBank, PubChem, and Wikipedia. On the other hand, a gold-standard dataset was created based on the data of the consistency tests, which were used to validate the databases ChemSpider, DrugBank, PubChem, PharmXplorer, and Wikipedia.

The workflow tool KNIME Analytics Platform was used to obtain InChI & InChIKey for all drugs approved in Austria from ChemSpider, DrugBank, PubChem, and Wikipedia. The consistency test showed that the total consistency is 79.34%.

The database validation revealed that PubChem performed best with a correctness of 96.59%, followed by DrugBank (96.07%), ChemSpider (93.88%), Wikipedia (92.83%) and PharmXplorer (83.94%).

All in all, whenever International nonproprietary names used to query InChI & InChIKey in four different databases automatically, this results in at least two different InChIs & InChIKeys in 20% of the cases.

V

VI

Zusammenfassung

Frei verfügbare Onlineplattformen wie ChemSpider, DrugBank, PubChem und Wikipedia werden häufig genutzt um an Informationen über Arzneistoffe zu gelangen. Für Pharmaziestudenten der Universität Wien ist der PharmXplorer eine häufig genutzte Informationsquelle. Dieses Projekt untersucht, ob die von den Arzneistoffen zugehörige InChIs und InChIKeys in den Datenbanken ChemSpider, DrugBank, PubChem und Wikipedia konsistent sind. Auf der Grundlage der Ergebnisse des Konsistenztests wurde ein Goldstandard-Datensatz erstellt, der zur Validierung der Datenbanken ChemSpider, DrugBank, PubChem, PharmXplorer und Wikipedia diente.

Das Workflow-Tool KNIME Analytics Platform kam zum Einsatz, um die zugehörigen InChIs und InChIKeys aller in Österreich zugelassenen Arzneistoffen von ChemSpider, DrugBank, PubChem und Wikipedia zu erhalten. Das Ergebnis des Konsistenztestes brachte eine Übereinstimmung von 79.34% InChIs.

Die Validierung der Datenbanken unter Verwendung des Goldstandard-Datensatzes ergab, dass PubChem mit einer Korrektheit von 96.59% am besten abschnitt, gefolgt von DrugBank (96.07%), ChemSpider (93.88%), Wikipedia (92.83%) und PharmXplorer (83.94%).

Wenn der Internationalen Freinamen verwendet wird um automatisch in vier verschiedenen Datenbanken den zugehörigen InChI und InChIKey abzufragen, scheinen in 20% der Fälle mindestens zwei verschiedene InChIs und InChIKeys auf.

VII

VIII

Table of Contents

Acknowledgments ...... III

Abstract ...... V

Zusammenfassung ...... VII

Table of Contents ...... IX

List of Figures ...... XII

List of Table ...... XV

1 Introduction ...... 1

1.1 Motivation of the Thesis ...... 1

1.2 Statement of the problem ...... 1

1.3 Research Question ...... 3

1.4 Aim of the thesis ...... 3

2 Background Methodology ...... 4

2.1 The Internet – source of information ...... 4

2.2 Definitions ...... 4

2.2.1 ATC-Classification System...... 4

2.2.2 Molecule Representation ...... 7

2.2.3 Why InChI & InChIKey ...... 13

2.3 Databases ...... 14

2.3.1 ChemSpider ...... 14

2.3.1.1 Content of a ChemSpider entry ...... 14

2.3.1.2 Access ChemSpider webservices ...... 15

2.3.2 DrugBank ...... 17

2.3.2.1 Content of a DrugBank entry ...... 17

2.3.2.2 Access DrugBank Data ...... 18

2.3.3 PharmXplorer ...... 19

2.3.3.1 PharmXplorer information platform ...... 19

IX

2.3.4 PubChem ...... 21

2.3.4.1 Content of a PubChem Compound entry ...... 21

2.3.4.2 Access PubChem: The PubChem API - PUG REST ...... 22

2.3.5 Wikipedia ...... 24

2.3.5.1 Use of drug information ...... 24

2.3.5.2 How Wikipedia works ...... 25

2.3.5.3 Is Wikipedia reliable...... 26

2.3.5.4 Content of a drug article ...... 27

2.3.5.5 Use of the Media Wiki API ...... 31

2.4 Used Tools & Software ...... 34

2.4.1 Knime ...... 34

2.4.2 Pywinauto ...... 34

3 Development of the Methods ...... 35

3.1 Preparation for retrieval of Drug Information from ChemSpider, DrugBank, PubChem and Wikipedia ...... 35

3.1.1 Retrieval of international nonproprietary names...... 35

3.1.2 Extraction of ATC codes from Austrian Medicinal Product Index ...... 42

3.2 Retrieval of Drug Information from ChemSpider, DrugBank, PubChem and Wikipedia ...... 45

3.2.1 Retrieval of Drug Information from ChemSpider ...... 45

3.2.2 Retrieval of Drug Information from DrugBank ...... 48

3.2.3 Retrieval of Drug Information from PharmXplorer ...... 51

3.2.4 Retrieval of Drug Information from PubChem ...... 54

3.2.5 Retrieval of Drug Information from Wikipedia ...... 57

3.3 Data preparation for consistency test ...... 61

3.3.1 Concatenate Data from ChemSpider, DrugBank, PubChem and Wikipedia 61

3.3.2 InChI to Boolean ...... 63

X

3.3.3 Binary Code – 16 Cases ...... 64

3.4 Consistency Test ...... 65

3.5 Creating gold-standard dataset ...... 67

3.6 Comparing InChI related InChIKey from Public Databases versus gold- standard dataset ...... 68

3.6.1 Comparing InChI related InChIKey from ChemSpider, DrugBank, PubChem & Wikipedia versus gold-standard dataset...... 68

3.6.2 Comparing InChI related InChIKey from PharmXplorer versus gold- standard dataset ...... 71

4 Results and Discussion ...... 72

4.1 Results of consistency Test ...... 72

4.2 Validation with gold-standard dataset ...... 74

4.3 Overview of Errors ...... 75

5 Conclusion and Outlook ...... 77

6 References ...... 79

7 Appendix ...... 82

7.1 List of Abbreviation ...... 82

7.2 Scripts ...... 83

7.2.1 Python script for batch conversion of .skc files to Mol files ...... 83

XI

List of Figures

Figure 1 L-Alanine in a molfile V2000 format, screenshot from Wikipedia [8] ...... 8 Figure 2 SMILES string of Ciprofloxacin, adopted and modified from [11] ...... 9 Figure 3 InChI of (S)-Carboxy(chloro)methyl]azanium from InChI-Trust Technical FAQ [19] ...... 11 Figure 4 The InChI and InChIKey of (-)-Menthol drawn from [20] ...... 12 Figure 5 Screenshot of ChemSpider Entry for amlodipine CID 2077 [23] ...... 15 Figure 6 Screenshot of DrugBank entry for Amlodipine (small part) [25] ...... 18 Figure 7 PharmXplorer entry of hydrochlorothiazide [27] ...... 20 Figure 8 Screenshot of PubChem entry for amlodipine (small part) [29] ...... 22 Figure 9 Conceptual framework (=URL Path) PUG-REST services [30] ...... 23 Figure 10 Amlodipine as an example of a Wikipedia drug article ...... 27 Figure 11 Full single-drug template with extended fields [47] ...... 30 Figure 12 An example of setting up an API query in the MediaWiki API sandbox .... 31 Figure 13 Response of MediaWiki API in JSON format (fragment) ...... 33 Figure 14 Screenshot of ATC-index entry for A01BB from https://www.whocc.no/atc_ddd_index/?code=A10BB ...... 35 Figure 15 Workflow to retrieve International Nonproprietary Names; cyan highlighted ...... 36 Figure 16 Inside Metanode_2_ATC-Index- WHO-Crawler...... 37 Figure 17 Workflow snippet of Metanode_2_1_WHO_Crawler Part 1 ...... 37 Figure 18 Input of table creator node: ...... 38 Figure 19 The ready URL’s listed in the „URL-CALL” column ...... 38 Figure 20 Workflow snippet of Metanode_2_1_WHO_Crawler Part 2 ...... 39 Figure 21 Snippet of XML cell for https://www.whocc.no/atc_ddd_index/?code=A10BB generated via HTMLParser node ...... 39 Figure 22 Result of XPath Expression //dns:a ...... 40 Figure 23 Href result of //dns:a/@href ...... 40 Figure 24 Workflow snippet of Metanode_2_1_WHO_Crawler Part 3 ...... 41 Figure 25 ATC index English ...... 41 Figure 26 Screenshot from https://aspregister.basg.gv.at ...... 42 Figure 27 Inside Metanode_3_Austrian Medicinal Product Index: workflow to extract ATC codes ...... 43 Figure 28 Excel table input with the dataset of Austrian Medicinal Product Index .... 43 XII

Figure 29 Workflow to retrieve Drug Information from ChemSpider ...... 45 Figure 30 Generic Web Service Client settings to retrieve ChemSpiderID’s ...... 46 Figure 31 Generic Web Service Client settings to retrieve InChI, InChIKey, and Smiles from ChemSpider ...... 47 Figure 32 Segment from Table of Extracted Drug Information from ChemSpider .... 47 Figure 33 Workflow to retrieve Drug Information from DrugBank dataset...... 48 Figure 34 Column Filter settings to retrieve DrugBank Database_ID, InChI, InChIKey, Generic name ...... 49 Figure 35 Segment from Table of Drug Information from DrugBank ...... 50 Figure 36 Process of retrieving InChI & InChIKey from PharmXplorer ...... 51 Figure 37 Workflow “Retrieval of Drug Information from PharmXplorer”...... 53 Figure 38 Workflow to retrieve Drug Information from PubChem ...... 54 Figure 39 PUG REST URI for amlodipine to retrieve Drug Information from PubChem ...... 55 Figure 40 JSON Path node preferences to extract InChI, InChIKey and CID ...... 55 Figure 41 Segment from Table of Extracted Drug Information from PubChem ...... 56 Figure 42 Workflow to retrieve Drug Information from Wikipedia ...... 57 Figure 43 The API query string to retrieve the content of Wikipedia drugbox amlodipine ...... 57 Figure 44 API response structure for amlodipine ...... 58 Figure 45 JSON Path node setting to extract “*” property, title & pageid ...... 59 Figure 46 Extraction of InChI from Wikipedia Drugbox ...... 59 Figure 47 Extraction of InChIKey from Wikipedia Drugbox ...... 60 Figure 48 Segment from Table of Extracted Drug information from Wikipedia ...... 60 Figure 49 Fragment of concatenating table of retrieved data from ChemSpider, DrugBank, PubChem, and Wikipedia...... 61 Figure 50 Configuration of Row Splitter node to include standard InChIs ...... 61 Figure 51 Regular expression to include valid InChIKeys...... 62 Figure 52 Fragment of filtered concatenate table with InChI & InChIKey ...... 62 Figure 53 Fragment of Table with 1.265 Generic names and their related InChI & InChIKey ...... 63 Figure 54 Screenshot of overview if InChI database entry was found ...... 63 Figure 55 Workflow overview test for consistency ...... 65 Figure 56 Consistency test for case 16 – InChIKey included in all four databases ... 66

XIII

Figure 57 Result table consistency test (fragment) ...... 67 Figure 58 Flowchart showing the process of generating a gold-standard dataset .... 68 Figure 59 Workflow overview comparing InChIKey from ChemSpider versus gold- standard dataset ...... 69 Figure 60 selected columns to prepare for comparison ...... 69 Figure 61 Rule Engine node configured to prove if InChIKey ChemSpider matches InChIKey Pharmaquiz gold-standard ...... 70 Figure 62 chart of total consistency ...... 73 Figure 63 Chart showing the correctness of InChI on public available databases in relation to the gold-standard dataset ...... 75 Figure 64 Error categories by percentage ...... 76

XIV

List of Table

Table 1 Main groups of First Level ATC code [5] ...... 5 Table 2 ATC-Levels of glibenclamide [6] ...... 6 Table 3 Multiple SMILES strings of ethanol [12] ...... 10 Table 4 URI parts pf PUG-REST request ...... 54 Table 5 16 Cases of InChI ...... 64 Table 6 result of consistency test in absolute numbers ...... 72 Table 7 result of the correctness of InChI on public available databases in relation to the gold-standard ...... 74 Table 8 Overview of error type ...... 76

XV

XVI

1 Introduction

1.1 Motivation of the Thesis

Once upon a time, I read a drug article on Wikipedia about a selective competitive vasopressin receptor 2 antagonist called tolvaptan. Soon I fastened my eyes on the drugbox of the article and clicked randomly chosen on the linked DrugBank Identifier DB06212. After a few seconds, the DrugBank entry of tolvaptan opened in my web browser, and looked at and inspected the chemical structure of tolvaptan. I thought “Why did I see here a specified stereocenter? – According to Wikipedia tolvaptan is a racemate, 1:1 mixture of the S and R enantiomer – there should not be a defined stereocenter”. I scrolled down and compared the InChIs from the DrugBank and the Wikipedia and came to the same conclusion stereocenter specified vs. unspecified. I just found a difference comparing the InChI strings. So I thought it would be fascinating to search automatically for the generic name, e.g., tolvaptan in publicly accessible databases and extract the InChI from the databases and subsequently compare the InChI strings. If the strings would be different, there is probably a mistake. Through fortunate circumstances, I happened to talk to Professor Gerhard Ecker about this issue, and he immediately suggested that this point of interest would be an attractive master thesis topic. That’s how it all started. So the incentive of this project is to investigate if the chemical structure related InChI are consistent in public accessible databases. A literature research was done to check if there is already was an existing a free available goldstandard database to compare the InChI strings of the related topics.

1.2 Statement of the problem

The first literature research resulted in two basic findings: there is no goldstandard for the content of pharmaceutical structures, and the quality of internet chemistry has a wide range. Williams & Ekins published their study “A quality and call for improved curation of public chemistry databases” in Drug Discovery Today in September 2011. As the main result,

1 both researchers state that there is an urgent need to improve the quality of internet chemistry and limit the arising errors and wasted efforts in public online databases because they have become trusted valuable sources upon researchers rely for chemical structures and scientific information [1]. In 2012 a study conducted by Akhondi, Kors & Muresan showed that consistency of chemical identifiers and their corresponding MOL representation differ in large expanse between data sources (37.2% to 98.5%). Stereochemistry was shown to hava a great impact – when disregarded the consistency increased (84.8% to 99.9%) [2].

Williams, Ekins & Tkachenko (2012, p.686) stated in their publication “Towards a goldstandard: regarding quality in public domain chemistry databases and approaches to improving the situation”: “Unfortunately, for chemistry databases there are as yet no agreed upon standards and there is no freely available gold standard structure database which we can yet rely on. Despite the decades of experience that underpin the assembly of commercial molecule databases (e.g. Scifinder, MDDR, among others) primarily depending on skilled staff for curation and data checking, the delivery of online databases commonly appears to focus more on the development of the underlying cheminformatics architecture and platform rather than the delivery of a high quality resource of data.”[3].

2

1.3 Research Question

Out of the following listed problem subsequent research questions arise:

1. When generic names/ INN (approved in Austria) are searched for in publicly accessible databases such as ChemSpider, DrugBank, PubChem, and Wikipedia how consistent are their specific chemical structures?

2. Are the accessible public databases more valid than the PharmXplorer (e- learning platform – University Graz, Innsbruck, Vienna)?

3. Is it possible to extract a gold-standard out of the combined public accessible databases?

1.4 Aim of the thesis

The current thesis aims to investigate the consistency of the structural data in publicly accessible databases as well as the possibility to generate a gold-standard dataset by combining data from several public databases. If a gold standard dataset could be created, this might later be used for composing a chemistry learning application.

3

2 Background Methodology

2.1 The Internet – source of information

Nowadays people live in a fast moving world and use the Internet as a quick search tool for getting the information they are interested in, instead of consulting scientific books or researchers. According to Maurice de Kunder https://www.worldwidewebsize.com on Wednesday, February 21th 2018, there are at least 4.52 billion indexed Web pages [4]. This tremendous amount of information possibly originates from scientists and experts who publish their latest results. However, it could come from companies, whose attempt is to affect people to buy their specific products. Alternatively, it could happen not valid content gets spread as veritable information. For that reason, there is no certainty that pieces of information originate from reliable sources, or have been published by experts in their scientific research area. It gets problematic if people believe this information in the vital topic of health and especially on drugs they take.

2.2 Definitions

2.2.1 ATC-Classification System

The Anatomical Therapeutic Chemical (ATC) Classification System is controlled by the World Health Organization Collaborating Centre (WHOCC) for Drug Statistics Methodology [5] and is used for the classification of active ingredients of drugs. According to The World Health Organization “Introduction to Drug Utilization Research” published in 2003 (p. 32-33) “The ATC classification system divides the drugs into different groups according to the organ or system on which they act and according to their chemical, pharmacological and therapeutic properties. Drugs are classified in groups at five different levels. The drugs are divided into 14 main groups (first level), with two therapeutic/pharmacological subgroups (second and third levels). The fourth

4 level is a therapeutic/pharmacological/chemical subgroup and the fifth level is the chemical substance. The second, third and fourth levels are often used to identify pharmacological subgroups when these are considered to be more appropriate than therapeutic or chemical subgroups”[6].

First level The first level ATC code consists of one letter and indicates the main anatomical group. The 14 main groups of first level ATC code are displayed in Table 1.

Table 1 Main groups of First Level ATC code Code Contents A Alimentary tract and B and blood forming organs C Cardiovascular system D Dermatologicals G Genito-urinary system and sex hormones H Systemic hormonal preparations, excluding sex hormones and insulins J Antiinfectives for systemic use L Antineoplastic and immunomodulating agents M Musculo-skeletal system N P products, insecticides and repellents R Respiratory system S Sensory organs V Various

Table 1 Main groups of First Level ATC code [5]

Second level The second level consists of two digits and indicates the main therapeutic group. 5

Third level The third level consists of one letter and indicates the therapeutic/pharmacological subgroup.

Fourth level The fourth level consists of one letter and indicates the chemical/therapeutic /pharmacological subgroup.

Fifth level The fifth level consists of two digits and indicates the chemical substance.

Glibenclamide was chosen as an example to illustrate the structure of the complete ATC classification system, as shown in Table 2.

Table 2 ATC-Levels of glibenclamide [6] A Alimentary tract and metabolism (first level, main anatomical group) A10 Drugs used in (second level, main therapeutic group) A10B Oral blood-glucose-lowering drugs (third level, therapeutic /pharmacological subgroup) A10B B Sulfonamides, urea derivatives (fourth level, chemical/therapeutic /pharmacological subgroup) A10B B01 Glibenclamide (fifth level, a subgroup for chemical substance)

Table 2 ATC-Levels of glibenclamide [6]

6

2.2.2 Molecule Representation

The representation of molecules in a computer readable format can be done in many ways, but there exist significant differences. These differences exist mainly in:

• conformation information • human readability • length (characters per atom  data size) • stereochemical information • uniqueness

Structural Representations To represent molecules in a digital way the Chemical table file (CT File) is one possibility to solve this issue. The chemical table file belongs to the text-based chemical file formats, and there are several file formats in the family. The basic principle of these files is that the information about atoms, bonds, connectivity, and coordinates of a molecule are stored in form of a connection table [7]. An example of the MDL Molfile structure of L-Alanine is given in Figure 1. Inside the connection table, the stereochemical information is contained in the atom block and the bond block. Due to the fact that structural representation can relay conformational data, they are unique. The substantial memory requirement for structural representation is definitively their principal disadvantage. It does not matter whether in memory or on file, a structural representation always requires the provisioning of storage. In consequence of this fact, the use of structural representations as unique identifiers is questionable, especially with expanding information volumes (e.g. enlargement of characters per atom). The human readability is not a strength of a structural representation of a molecule because these structural files are complex to interpret. Although it is a human readable format that can be opened with any simple editor, it is very difficult to visualize the molecule without the use of special tools (visualization software), especially in 2D or the 3D environment.

7

Figure 1 L-Alanine in a molfile V2000 format, screenshot from Wikipedia [8]

String Representations To represent molecules as strings there exist several different formats. The most widely used string representation is undoubtedly the Simplified Molecular Input Line Entry Specification (SMILES) format [9]. It cannot be denied that the SMILES strings are human readable, and above all able to represent molecules in a very easy and fast way to use. (e.g., copy and paste a SMILES String from the Wikipedia info box to computer software like LigandScout [10] that allows creating three-dimensional pharmacophore models). In addition, creating a SMILES string can be done by anyone with a little practice. An example of a SMILE String is given in Figure 2.

8

Figure 2 SMILES string of Ciprofloxacin, adopted and modified from [11]

As can be seen in Figure 2, a SMILES String normally provides information about atoms types, bond orders, and connectivity. The representation of stereochemistry is possible but is unfortunately limited. Although the SMILES String offers many advantages, such as compact data size, the main disadvantage and the main reason that it is not eligible for this project is the point of uniqueness. For a molecule there exist usually many different ways to create a SMILES String. To demonstrate this issue different SMILES strings of ethanol are listed in Table 3.

9

CCO ethanol

OCC ethanol

C(O)C ethanol

[CH3][CH2][OH] ethanol

[H][C]([H])([H])C([H])([H])[O][H] ethanol

Table 3 Multiple SMILES strings of ethanol [12]

Therefore canonicalization algorithms have been developed to create one preferred unique SMILES string for each specific Molecule. The problem is that there is not ONE unique canonicalization algorithm [13]. Finally, different canonicalization algorithms lead to different SMILES string for the same molecule, and the uniqueness no longer exists. Various canonicalization algorithms are provided by software companies like Daylight Chemical Information Systems, OpenEye Scientific Software, MEDIT, Chemical Computing Group, MolSoft LLC [14].

InChI – International chemical Identifier The International chemical Identifier was developed under the patronage of the International Union of Pure and Applied Chemistry (IUPAC) [15]. In addition, NIST [16] and InChI-Trust [17] played a very important role in this project. IUPAC launched this project back in 2000 because at that time there was no open source and non- proprietary identifier available which made it possible to link chemical structures via the internet. The core of the InChI project is the InChI algorithm, which enables to convert the chemical structure of a molecule to a unique InChI. To accomplish this, the input of the chemical structure is transformed into a connection table, and then three fundamental steps are necessary to create a unique InChI. This includes normalization, canonicalization, and serialization [18]. The detailed description of the algorithm and other useful content is available at https://www.inchi- trust.org/downloads/.

10

To illustrate which parts the InChI consists an example is given in Figure 3.

Figure 3 InChI of (S)-Carboxy(chloro)methyl]azanium from InChI-Trust Technical FAQ [19]

As shown in Figure 3, the InChI consists of several layers. These layers provide several types of information. For instance, the main layer presents the chemical formula and describes how the different atoms are connected (canonical numbering), whereas the Stereochemical layer provides the stereochemical information. The InChI consists of different layers and thus, in principle, it is possible to generate different InChI strings for the same molecule with the InChI software if different settings are made (e.g., accounting for tautomerism or not). To solve this problem, the Standard InChI was launched in 2008. The Standard InChI takes the same details as the normal InChI into account but has a distinct advantage that it is truly unique. Whether it is an InChI or a standard InChI can be determined very easily in the main layer [InChI=1/’ (any InChI) or ‘InChI=1S/’ (Standard InChI)].

InChI was not designed for human readability and therefore it is logically barely readable for humans. Therefore human readability could be seen as a disadvantage.

11

Heller, McNaught, Stein, Tchekhovskoi, and Pletnev stated in their publication “InChI - the worldwide chemical structure identifier standard” that InChI is more like a bar code [20].

The length of InChI increases as the size of the structure increases. A very large structure with, e.g. many atoms (100+) and also many stereo centers results in a very long string and ultimately makes it impossible to search in a search engine successfully. To solve this issue, a shorter unique identifier was needed, and ultimately the InChIKey was developed [20]. The InChIKey is an SHA-256 hash based derivate of the InChI. A hash algorithm converts the InChI in a compact 27-character string. The InChI and InChIKey of (-)- Menthol are displayed in Figure 4.

Figure 4 The InChI and InChIKey of (-)-Menthol drawn from [20]

As can be seen from Figure 4 the 27 characters long InChIKey is divided by two dashes into three blocks. The first 14 letters encode the molecular skelton (block 1). The second 8 letters encode stereochemistry and isotopes. In the third block, one letter indicates the number of protons (N stands for neutral).

12

2.2.3 Why InChI & InChIKey

The International Chemical Identifier was selected because it is freely available under an even less restrictive license than the GNU Lesser General Public License. Over that design, layout and algorithms which are described above are convincing. The main reason why the InChI was chosen is by far the point of uniqueness. Although InChI is not directly intelligible to a human reader, this is not relevent for the purpose of this thesis. The standard InChI can be converted to the 27 characters long InChIKey, and thus it is possible to check for consistency automatically.

13

2.3 Databases

2.3.1 ChemSpider

ChemSpider (www.chemspider.com) is a chemical database created from a hobby work by the chemist Antony Williams. In 2009 ChemSpider was acquired by the Royal Society of Chemistry which enabled a good infrastructure for support and access to even more data. ChemSpider manages to combine datasets from more than 400 sources to create a ChemSpider entry primarily based on the chemical structure. As far as possible it is always stated where the data came from. ChemSpider has been able to combine data from Wikipedia, the PubChem Chemical Entities of Biological Interest (ChEBI) and The Kyoto Encyclopedia of Genes and Genomes (KEGG) [21]. As of July 2018 ChemSpider states on its homepage that ChemSpider now contains 67 million chemical structures of approximately 250 unique data sources [22]. According to Williams, some chemists describe ChemSpider as the Google for Chemistry and a Wikipedia for chemists [21]. ChemSpider offers a standard search and an advanced search option to access the corresponding ChemSpider entry. The standard search option allows searching for trade names, synonyms, register numbers and of course systematic names. Advanced search can be much more extensive. One can draw a chemical structure, looking for identifiers like the InChI or InChIKey, the molecular weight, calculated properties like logP and much more.

2.3.1.1 Content of a ChemSpider entry

If you get through the search or a cross-link to a ChemSpider entry, this entry always has a so-called unique ChemSpider ID. The focus of ChemSpider is certainly the chemical structure as mentioned above. Therefore, there is an image on the left side of the structure. To make it clear, the basic information is in the upper part. These include molecular formula, average mass, monoisotopic mass, and the ChemSpider ID. There is a screenshot of the ChemSpider entry of amlodipine in Figure 5.

14

Figure 5 Screenshot of ChemSpider Entry for amlodipine CID 2077 [23]

As can be seen in Figure 5 there is a more detail box which contains the InChI and InChIKey next to the systematic name and the smiles string. Below, there are several tabs like names and identifiers, properties, searches, spectra, articles, and vendors. For example, the vendor's tab lists very precisely where the data came from.

2.3.1.2 Access ChemSpider webservices

In order to gain access to the ChemSpider content through Knime, the Web Services Description Language (WSDL) is the medium of choice. ChemSpiderBlog provides a user manual at http://www.chemspider.com/blog/how-to-use-chemspider- webservices-from-knime.html. Essentially, two things are needed, the WSDL file with the specifications of the services and for certain operations with limited access an API key. The WSDL file can be downloaded directly from http://www.chemspider.com/blog/wp- content/uploads/2011/12/chemspiderSearchWSDL_no_soapencArray.zip. 15

The access token can be requested from the Royal Society of Chemistry. A more detailed description can be found in the methods section.

16

2.3.2 DrugBank

The DrugBank is a web capable drug database and freely available at www.drugbank.ca. DrugBank, as the name implies, specializes in pharmaceuticals and includes extensive data on molecular information, the mechanisms of action, interactions and their targets. DrugBank 1.0 was released in 2006, and since then the database has grown steadily. In 2018 DrugBank 5.0 contains all approved or previously approved drugs in Canada and the United States (2.358), as well as investigational drugs in Phase I, II, or III (4.501) [24].

2.3.2.1 Content of a DrugBank entry

If you get through the search or a cross-link to a DrugBank entry this entry always has a so-called unique DrugBank ID (Accession number) starting with DB followed by five digits. The focus of DrugBank is certainly on the pharmaceutical information. At the beginning of an entry, information is displayed whether it is a small molecule or a biological, approval status, and a brief explanation of the mechanism of action. It is extended by a picture of the structure and a list of other names (synonyms). The next section contains extensive information on product ingredients, product images, prescription products, generic prescription products, mixture products, unapproved/other products, international/other Brands, and categories. The next section contains the information that is necessary for this project. These are the chemical structure information that is also visible in the example of amlodipine in Figure 6 [25].

17

Figure 6 Screenshot of DrugBank entry for Amlodipine (small part) [25]

Furthermore, the DrugBank entry provides extensive information on , interactions, clinical studies, pharmacoeconomic, properties, targets, etc. In this project, however, everything focuses on the InChI.

2.3.2.2 Access DrugBank Data

Although DrugBank offers an API, it is not available for free. Fortunately, DrugBank offers records for download as SDF. These DrugBank datasets are released under Creative Commons Attribution-NonCommercial 4.0 International License. The data was therefore extracted from the freely available sdf file and is described in detail in the methods section.

18

2.3.3 PharmXplorer

The PharmXplorer is an online accessible multifunctional learning management system which provides a database with comprehensive data of pharmaceutical relevant information. The project was launched in 2003 by the Universities of Graz, Innsbruck, and Vienna as part of the "New Media in Education" campaign of the Ministry of Education [26].

PharmXplorer is divided into different platforms that access the same databases and are integrated into an overall system:

• Information platform • Learning platform • Continuing education platform

In the case of this project, only the Information platform for approved drugs is relevant.

2.3.3.1 PharmXplorer information platform

The PharmXplorer information platform provides information about approved drugs in Austria including chemical, physical, pharmaceutical, and pharmacological properties. From the PharmXplorer entry of hydrochlorothiazide in Figure 7 can be seen that an image of the chemical structure and an ATC code is present. To get to the desired entry of a drug several different search queries are available. It is possible to search directly for the drug name, by indication group, or ATC code [27].

19

Figure 7 PharmXplorer entry of hydrochlorothiazide [27]

A point that distinguishes the PharmXplorer from all other databases is that students partly provide the content. At the university of Vienna, pharmacy students of pharmaceutical chemistry participate in a seminar where they are divided into groups to expand an entry of a certain drug and keep it up to date. The contents are mutually controlled by the student and then uploaded to the platform. The PharmXplorer has no InChI and InChIKey and does not have an application programming interface, so the retrieval of drug information from PharmXplorer was fundamentally different (detailed in the methods part).

20

2.3.4 PubChem

PubChem is a publicly acessible database containing chemical substances and their biological activities. In 2004 PubChem was launched as a component of the Molecular Libraries Roadmap Initiatives of the US National Institutes of Health (NIH), and in 2015 it had more than 60 million unique chemical structure entries. On top of that, there are over one million biological assay descriptions, covering about ten thousand unique protein target sequences. In general, PubChem is made up of three interconnected databases:

• PubChem Substance • PubChem Compound • PubChem BioAssay

PubChem Substance contains chemical information, PubChem Bioassay contains biological activity data. Both databases are filled with data from more than 350 contributors. The sources are listed at https://pubchem.ncbi.nlm.nih.gov/sources. The PubChem Compound database contains only unique chemical structures that are generated from the other two databases based on a standardization procedure. More information about the standardization procedure can be found in the publication “PubChem Substance and Compound Databases” by Kim et al. Nucleic Acids Research, 2016, Vol. 44. [28]. Logically, the PubChem Compound database was used for this project because it offers all the necessary information.

2.3.4.1 Content of a PubChem Compound entry

If you get through the search or a cross-link to a PubChem Compound entry this entry always has a so-called unique PubChem Compound Identifier (CID). The PubChem Compound entry starts with a brief overview with basic information such as the CID, chemical names, molecular formula, molecular weight, InChIKey as well as a brief description of how to use the substance if known.

21

Of course, a picture of the chemical structure is displayed (2D structure as well as 3D conformer). Figure 8 shows a screenshot part of the PubChem Compound entry of amlodipine.

Figure 8 Screenshot of PubChem entry for amlodipine (small part) [29]

As can be seen from Figure 8 in the index "content" the information provided by PubChem is very extensive - from point one 2D structure to point 18 Information source. Furthermore, it can be seen that the information needed for this project, the InChI, is under point 3 "names and identifiers".

2.3.4.2 Access PubChem: The PubChem API - PUG REST

PubChem offers with the PUG (Power User Gateway) REST, a web interface for accessing PubChem data and services. The documentation of the PubChem API is available at https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest. A tutorial document on

22

PUG-REST is also available at https://PubChemdocs.ncbi.nlm.nih.gov/pug-rest- tutorial. The conceptual framework (= URL path) of the PUG-REST services is displayed in Figure 9 [30]. The PUG-REST is built around three unique PubChem Database identifiers. The three identifiers are SID for substance, CID for compounds, and AID for assays. The PUG-REST request consists of prolog, input, operation, and output. The PUG-REST request is submitted to the PubChem server. The first step consists of converting the input part into a unique database identifier. If this step was successful, the next step operation could be executed. During the operation step, different tasks could be processed, and the results are returned to the requester in the desired output format (examples are shown in Figure 9). If the request cannot be successfully completed error messages are returned.

Figure 9 Conceptual framework (=URL Path) PUG-REST services [30]

23

2.3.5 Wikipedia

According to its own database entry (accessed on February 21th, 2018), Wikipedia is defined as follows “Wikipedia is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content. Wikipedia is the largest and most popular general reference work on the Internet and is named as one of the most popular websites. The project is owned by the Wikimedia Foundation, a non-profit which "operates on whatever monies it receives from its annual fund drive” [31].

2.3.5.1 Use of drug information

Which search engines are used for searching information about drugs? Laurent & Vickers (2009) evaluated in their study the ranking of different data sources in the search results of a number of well-known search engines as Google [32], Google UK [33], Yahoo [34] and MSN [35]. The English Wikipedia was ranked among the first ten results in 75 - 85% of search engines and keywords from MedlinePlus, NHS Direct Online, and the National Organization of Rare Diseases.

Another result states the Wikipedia results exceeded MedlinePlus and NHS Direct Online in most of the cases excepting using Google UK. Furthermore, it was shown that Wikipedia pages corresponding to the search results read more often than the Medline plus topics [36].

Two years later Law, Mintzes & Morgan (2011) published their study evaluating the sources of online information about prescription drugs by evaluating search results of Bing [37], Google Canada & USA and Yahoo when consulting the names of the most dispensed prescribed drugs in the US (brand and generic names). Three quarters of the first result for both brand and generic names (Google USA) were linked to the National Library of Medicine. Opposite, Wikipedia was the first result for about 80% on the other three search engines (Bing, Google Canada, and Yahoo). On these sites, more than two third of brand name searches were led to industry sponsored sites. Opiates, benzodiazepines, , and were the Wikipedia pages 24 with the highest number of hits and visits. The authors state Wikipedia and the National Library of medicine are highly ranked among online drug searches. They also suggested that efforts to improve the quality of online data should focus on drugs, for which Internet users are searching the most [38]. To sum up, both studies showed Wikipedia is a widely accessed health information source.

Brokowski & Sheehan (2009) published a study which showed about 80% of pharmacists use the internet as a source for obtaining drug information. Between February 2nd – March 14th, 2009 1.067 questionnaires were completed and sent back to the authors. 11 questionnaires had to be excluded due to the fact that they were from pharmaceutical students. 35% of practicing pharmacists were using Wikipedia to obtain drug related information. 19% stated they considered Wikipedia as a trustworthy source. 12% would indicate Wikipedia to other pharmacists and 7% would recommend it to consumers or patients. The majority of pharmacists reported that they use Wikipedia to search indications of drugs. The authors reported concern because of the fact that 28% of the interrogated pharmacists had no knowledge of how Wikipedia content was created [40]. From all of the three previously mentioned studies, it can be deduced that a great number of people (medical students, pharmacists etc.) rely on Wikipedia drug related information. Due to this fact it is essential that the data contained within is correct, valid and up to date.

2.3.5.2 How Wikipedia works

There was an international project aiming to collaboratively create an encyclopedia that contains the entirety of human knowledge. Wikipedia is one of the projects coming up [41].

Everybody accessing the Wikipedia website can create and modify specific content. This approach is called collaborative editing. Today, Wikipedia is available in 298 languages, thirteen of which contain over one million articles. The English Wikipedia ranks first on the list with almost 5.6 million articles (11.7%) [42].

25

2.3.5.3 Is Wikipedia reliable

As previously stated a great number of people are using Wikipedia to find information regarding topics as health, disease, and drugs. In case users rely on this web page concerning such vital issues, the lack of confidence in data arises concern. There have been numerous published studies on this topic, and in each, a different approach and method were used to evaluate the quality of Wikipedia data. Koppen, Phillips & Papageorgiou (2015) analyzed the reference sources, which are quoted at the bottom of each Wikipedia article. They pooled a small sample of 22 drugs (very small subset), which appeared on the Food Drug Administration’s (FDA) MedWatch [43] website during a pre-defined period of seven months. Drugs are featured on MedWatch when there are concerns reported about its safety or whenever drugs are recalled [44]. Due to this fact, the suggestion was that any piece of information reported on MedWatch should be relevant and important. That is why the study measured how long it took until the MedWatch citation was added to the references of the article – thus how long it took for the content of the article to be updated with the newly available information. As a result, it was found that, on average, these references appeared 5.9 days after the MedWatch alert was published. Wikipedia collaborators most commonly cited peer- reviewed journal articles (49.2%) and newspaper articles (12.0%). It was also pointed out the lack of evidence based guidelines in the references, as these contain important drug information for physicians. The authors concluded that due to the nature of the information sources referenced, Wikipedia could be useful to healthcare providers in gaining insight into the amount and nature of the information that their patients may access. However, this information may not be up-to-date [45]. Kraenbring et al. (2014) analyzed the accuracy and completeness of drug information of Wikipedia (German & English version) in comparison to standard textbooks of pharmacology. Additionally, references, revision, and readability were evaluated. For 100 curricular drugs, which were retrieved from the German standard textbooks of general pharmacology and were compared with the Wikipedia articles in both languages. The accuracy of drug information in Wikipedia was 99.7% (± 0.2%) compared to the textbook data in both languages. There was a difference between the two languages, as this score for the German site was 83.8% while on the English drug pages it was 91.1% at the degree of completeness, which means the amount of

26 information present in the standard textbooks also present on Wikipedia. The quoted references were also evaluated: peer-reviewed academic publications (24.5%), textbooks (22.9%) and scientific databases and prescribing information (36.4%) were cited most commonly, and potentially biased sources such as news media (online or print) accounted for only 3.2% of the references. The authors suggested Wikipedia is an accurate and comprehensive source of drug- related information for undergraduate medical education [46].

2.3.5.4 Content of a drug article

Independent of the language all Wikipedia pages are structured in a smilar. Figure 10 shows an example of a Wikipedia page.

Figure 10 Amlodipine as an example of a Wikipedia drug article The main menu is located on the left hand side; there are different options available. Under the header Article, the main text of the article is found in the middle of the web page. Most of the drug pages contain on the right hands a table called infobox. For chemicals, this box is called chembox whereas for a drug it is referred to as drugbox or infobox drug. Drugbox was used in the current thesis for gaining data on drug information.

27

The drugbox is structured into subsections, including: structural representation of the molecule, IUPAC name, clinical data (e.g. tradename, link to Drugs.Com, routes of administration, legal status), pharmacokinetic data (e.g. bioavailability, elimination_half-life), identifiers (e.g. CAS number, ATC code) chemical and physical data (e.g. molecular formula, molecular weight, InChI, InChIKey, density, melting point). A full single-drug template with extended fields is shown in Figure 11. For this thesis, InChI and InChIKey had the major importance.

{{Infobox drug | drug_name = | INN = | type = | IUPAC_name = | image = | width = | alt = | image2 = | width2 = | alt2 = | imageL = | widthL = | altL = | imageR = | widthR = | altR = | caption = | pronounce = | tradename = | Drugs.com = | MedlinePlus = | licence_CA = | licence_EU = | DailyMedID = | licence_US = | pregnancy_AU = | pregnancy_AU_comment = | pregnancy_US =

28

| pregnancy_US_comment = | pregnancy_category= | dependency_liability = | addiction_liability = | routes_of_administration = | legal_AU = | legal_AU_comment = | legal_BR = | legal_BR_comment = | legal_CA = | legal_CA_comment = | legal_DE = | legal_DE_comment = | legal_NZ = | legal_NZ_comment = | legal_UK = | legal_UK_comment = | legal_US = | legal_US_comment = | legal_UN = | legal_UN_comment = | legal_status = | bioavailability = | protein_bound = | metabolism = | metabolites = | onset = | elimination_half-life = | duration_of_action = | excretion = | CAS_number = | CAS_supplemental = | class = | ATCvet = | ATC_prefix = | ATC_suffix = | ATC_supplemental = | PubChem = | PubChemSubstance = | IUPHAR_ligand = 29

| DrugBank = | ChemSpiderID = | UNII = | KEGG = | ChEBI = | ChEMBL = | NIAID_ChemDB = | PDB_ligand = | synonyms = | chemical_formula = | C= | H= | Ag= | Al= | As= | Au= | B= | Bi= | Br= | Ca= | Cl= | Co= | F= | Fe= | Gd= | I= | K= | Li= | Mg= | Mn= | N= | Na= | O= | P= | Pt= | S= | Sb= | Se= | Sr= | Tc= | Zn= | charge= | molecular_weight = | SMILES = | Jmol = | StdInChI = | StdInChI_comment = | StdInChIKey = | density = | density_notes = | melting_point = | melting_high = | melting_notes = | boiling_point = | boiling_notes = | solubility = | sol_units = | specific_rotation = }}

Figure 11 Full single-drug template with extended fields [47]

30

2.3.5.5 Use of the Media Wiki API

Magnus Manske developed MediaWiki as a free and open-source wiki software. It is written in PHP programming language and enables access to Wikipedia, Wiktionary and Wikimedia Commons [48]. MediaWiki offers an API, which is called MediaWiki web service API. The documentation of the API is available at http://www.mediawiki.org. Additionally, users can tryout all queries quickly at the provided API sandbox at https://www.mediawiki.org/wiki/Special:ApiSandbox. Figure 12 displays an example of how the sandbox works.

Figure 12 An example of setting up an API query in the MediaWiki API sandbox

Based on the example above, the parameter action was set to query, so that the API call will request information from Wikipedia, and the format parameter was set to json (JavaScript Object Notation), an open standard file format commonly used for the asynchronous browser–server communication because its organized hierarchically. With action=query, additional parameters have to be set, some of them optional. For instance, the titles parameter (in the action = query tab) in this case was set to amlodipine. This means that the call is requesting information about the Wikipedia article titled “amlodipine”.

31

These titles can be derived from the URI of a Wikipedia article, for example: https://en.wikipedia.org/wiki/Amlodipine

Instead of the title, the pageids or revids can be selected, but only one at a time. Both IDs are assigned to each page and revision, respectively. Wikipedia calls each version of an article a revision, and with each update, a new revision of the article is created and stored. As a result, older versions of the pages can also be accessed via the MediaWiki API. The revisions are specified in the prop parameter (query tab), which tells the API that “revisions” property of the amlodipine page is being called for, the consequence being that revisions tab opens. In the revisions tab, rvprops is set to content.

Clicking on the “Make request” button results in that the API displays the URI of the GET request used for this query:

/w/api.php?action=query&format=json&prop=revisions&titles=Amlodipine&rvprop=co ntent

All information which was selected in the sandbox is included, but in order to be able to use the query in a web browser, the API endpoint also needs to be specified. https://en.wikipedia.org/

This endpoint ensures that the MediaWiki knows that the English language Wikipedia is being requested.

The response of the MediaWiki is hierachically structured in JSON format, and a fragment of the response is shown in in Figure 13.

32

Figure 13 Response of MediaWiki API in JSON format (fragment)

The API request can be made directly by pasting the query into the web browser or any other tool, e.g., in this case, the workflow tool Knime.

33

2.4 Used Tools & Software

2.4.1 Knime

Knime is an open source workflow tool first released by a team of developers from Silicon Valley headed by Michael Berthold at the University of Konstanz in 2006 [49]. Knime (derived from Konstanz Information Miner) provides an environment where the user can visually assemble and customize the flow of analysis. In Knime the central units are called nodes. The nodes can perform many tasks:

• Read in data • Modify data • Transform data • Visualize data • Output Data

The nodes can be connected to others. A simple workflow starts with a node that reads in data. Then it will be connected to a next node that processes data. The processing node is then connected to an output node, which completes a simple workflow. [50] This allows automated data analysis without extensive programming or scripting knowledge or even none at all.

2.4.2 Pywinauto

Pywinauto [51] is a software that allows you to automate tasks normally performed by humans directly by mouse and keyboard input. So Pywinauto is a graphical user interface (GUI) automation software. The software developer Mark Mc Mahon coined the development since 2006. The focus was laid on the graphical user interface (GUI) of Windows and the project utilized the strength of the programming language Python. Since 2015, an open source community has been responsible for further development and maintenance, with lead maintainer Vasily Ryabov. Pywinauto is listed in github at http://pywinauto.github.io/. There, all information, updates, and instructions can be found to optimally use pywinauto in practice.

34

3 Development of the Methods

3.1 Preparation for retrieval of Drug Information from ChemSpider, DrugBank, PubChem and Wikipedia

In order to retrieve the InChI and InChIKey from ChemSpider, DrugBank, PubChem and Wikipedia, a list of International Nonproprietary Names (INN) was needed in the first place. Furthermore, all drugs which are listed in the Austrian Medicinal Product Index were needed to be matched with the INN.

3.1.1 Retrieval of international nonproprietary names

Due to the fact that in the Austrian Medicinal Product Index ATC codes are included, it was searched for a solution to link ATC codes with INN.

Figure 14 Screenshot of ATC-index entry for A01BB from https://www.whocc.no/atc_ddd_index/?code=A10BB

35

Figure 14 shows that a fourth level ATC-Classification entry from WHO ATC-Index includes level one to level four and all fifth Level codes with all drugs related to the superior fourth level. The “name” of fifth level ATC-code is almost equal to the International nonproprietary name. Therefore, a WebCrawler in Knime was built to get the INN and ATC codes. Figure 15 shows the workflow that was used to retrieve International nonproprietary names of approved drugs.

Figure 15 Workflow to retrieve International Nonproprietary Names; cyan highlighted

In “Metanode_1_ATC-Index-Hauptverband“ an ATC index in the German language is read in via File reader node. The German ATC index is included in the “Elektronischer Erstattungskodex” which was downloaded from http://www.hauptverband.at. All Fourth level ATC codes were extracted and sent to the next Metanode “Metanode_2_ATC- Index- WHO-Crawler”.

Meta nodes look like a single node, although they can contain many nodes and like in “Metanode_2_ATC-Index- WHO-Crawler” even more meta nodes (shown in Figure 16) [52].

36

Figure 16 Inside Metanode_2_ATC-Index- WHO-Crawler

Inside Metanode_2_ATC-Index- WHO-Crawler the Metanode_2_1_WHO_Crawler is responsible for retrieving the ATC Index.

Figure 17 Workflow snippet of Metanode_2_1_WHO_Crawler Part 1

Figure 17 shows how the URL’s for the crawler is generated. The Fourth Level ATC codes are coming in through the WrappedNode Input and joined via Cross Joiner node with Table creator node. (Input of table creator node shown in Figure 18)

37

Figure 18 Input of table creator node:

The “$$$$” of the Pre-URL http://www.whocc.no/atc_ddd_index/?code=$$$$ are replaced via String Manipulation node to the particular fourth level ATC code. A small sample is shown in Figure 19.

Figure 19 The ready URL’s listed in the „URL-CALL” column

The “URL-CALL” column was finally sent to the HttpRetriever node. The HttpRetriever node sent an HTTP GET request to the respective URL (i.e. http://www.whocc.no/atc_ddd_index/?code=A10BB), and the results were provided as HttpResult cell Type. According to (accessed February 13th, 2018) “The HttpResult type bundles the actual binary content of the result, status code and all response headers” [53]. In the next step, the HttpResult was sent to the HtmlParser node shown in Figure 20.

38

Figure 20 Workflow snippet of Metanode_2_1_WHO_Crawler Part 2

The HtmlParser node converted the HTTPResult into XML cells (shown in Figure 21).

Figure 21 Snippet of XML cell for https://www.whocc.no/atc_ddd_index/?code=A10BB generated via HTMLParser node

The conversion in XML cell allowed in a further step to select via XPath node through an XPath query which was set to “//dns:a” specific select the HTML tag. The ungrouped results of the expression “//dns:a” is shown in Figure 22.

39

Figure 22 Result of XPath Expression //dns:a

The data was once again supplied to the XPath node, wherein the XPath query was set to //dns:a/@href to get specifically the hyperlinks stored in href. (Href result is shown in Figure 23).

Figure 23 Href result of //dns:a/@href

The Data was sent to RowSplitter node; this node allows row filtering according to certain criteria. The Matching criteria were set to only include rows of the Href column whose pattern matches hyperlinks with the structure of “*http://www.whocc.no/atc_ddd_index/?code=*”. The filtered url’s had a similar structure to http://www.whocc.no/atc_ddd_index/?code=A01AA01&showdescription=yes

40

At first, the http://www.whocc.no/atc_ddd_index/?code= part was removed via String Replacer node. So finally all requested ATC codes were received. According to the above example, the extracted Fifth Level ATC code is A01AA01. At this stage of the thesis, merely all ATC codes have been collected, with the descriptions of the ATC codes were still missing. The workflow overview for extracting ATC description is shown in Figure 24.

Figure 24 Workflow snippet of Metanode_2_1_WHO_Crawler Part 3

To extract data corresponding to HTML Tag hrf XPath query was set to //dns:a[contains(@href, '')]. The data were further processed, i.e. remove “Show text from Guidelines” via RowSplitter node. Full ATC Index in English was received (snippet of ATC-index shown in Figure 25).

Figure 25 ATC index English

41

3.1.2 Extraction of ATC codes from Austrian Medicinal Product Index

In order to extract ATC-Codes from Austrian medicinal product index, at first, the dataset had to be downloaded. The complete dataset was downloaded from https://aspregister.basg.gv.at/ October 19th, 2017. At that time 17.781 human and veterinary medicinal products were listed.

Figure 26 Screenshot from https://aspregister.basg.gv.at

The workflow of ATC code extraction was performed in Metanode_3_Austrian Medicinal Product Index (previously shown in Figure 15). The dataset from the Austrian Medicinal Product Index was read in Inside Metanode_3_Austrian Medicinal Product Index via ExcelReader node shown in Figure 27 (Excel table input is shown in Figure 28).

42

Figure 27 Inside Metanode_3_Austrian Medicinal Product Index: workflow to extract ATC codes

Figure 28 Excel table input with the dataset of Austrian Medicinal Product Index

In the next steps, the ATC code column was grouped and further processed that in the end all Level Fifth ATC-Codes of the Austrian medicinal product index were extracted. In total 1.595 unique Fifth Level ATC-Codes were obtained. The Level Fifth ATC codes were joined via Joiner Node with the obtained ATC index in English from Metanode_2_ATC-Index- WHO-Crawler. Although there had been 1.595 Fifth Level codes extracted, only 1.563 were joined to ATC Index dataset. The main reason for this circumstance are alterations in the ATC-Classification. According to World Health Organization: Introduction to Drug Utilization Research (2003, p. 35) “Changes to the ATC classification would be made when the main use of

43 a drug had clearly changed, and when new groups are required to accommodate new substances or to improve the specificity of the groupings” [6]. These alterations are published annually and can be found at https://www.whocc.no/atc_ddd_alterations__cumulative/atc_alterations. The ATC codes are entered by hand by employees of the Austrian Federal Office for Safety in Health Care (BASG). So typing errors are another reason. All in all, in a database with 17.781 medicinal product entries there are only 32 (2.01%) not up to date/wrong Level 5 ATC codes. Furthermore, the 1.563 matched ATC codes were added by 39 Fifth Level codes because these 39 are only listed with their fourth Level ATC code in the Austrian medicinal product index (this mostly concerns EU approvals). All ATC codes were removed, and the ATC description grouped. The final dataset of ATC description/International nonproprietary names has 1.458 entries, that’s because i.e. acetylsalicylic acid has three different ATC Codes as mono preparation (A01AD05, B01AC06, N02BA01). After finishing the steps mentioned above, the result was a list of INN/ATC description names, which are listed in the Austrian Medicinal Products Index. This list is called result list 1 (Workflow column q).

44

3.2 Retrieval of Drug Information from ChemSpider, DrugBank, PubChem and Wikipedia

3.2.1 Retrieval of Drug Information from ChemSpider

To retrieve Data from ChemSpider through Knime the WSDL file is needed which was downloaded from http://www.chemspider.com/blog/wp- content/uploads/2011/12/ChemSpiderSearchWSDL_no_soapencArray.zip. A screenshot of the workflow for retrieving Drug Information from ChemSpider is shown in Figure 29.

Figure 29 Workflow to retrieve Drug Information from ChemSpider

The Generic drug names from result list 1 (column q) were sent to the Generic Web Service Client node. The WSDL file was read in and as Operation SimpleSearch was chosen. For the SimpleSearch Operation, an access token is required. The Input parameter query was set to the “Mapped column” column q (Generic drug names from result list 1), and the token was put in the constant value field. The settings of the Generic Web Service Client node is displayed in Figure 30.

45

Figure 30 Generic Web Service Client settings to retrieve ChemSpiderID’s

The result of the SimpleSearch operation was a list with of ChemSpiderIDs - the ChemSpider database identifier to the related Generic drugname. This list with ChemSpiderIDs was used in a second Generic Web Service Client node to retrieve the InChI & InChIKey. Therefore, the operation GetCompundInfo was used. The specific settings are shown in Figure 31.

46

Figure 31 Generic Web Service Client settings to retrieve InChI, InChIKey, and Smiles from ChemSpider

The result of the GetCompundInfo operation was a table with the Generic drug names and their related InChI, InChIKey, and Smiles. Through other Knime nodes, the data was further processed, and the final result of the Retrieval of Drug Information from ChemSpider was a table with the following information shown in Figure 32.

Figure 32 Segment from Table of Extracted Drug Information from ChemSpider

47

3.2.2 Retrieval of Drug Information from DrugBank

The whole dataset was downloaded from https://www.drugbank.ca/releases/ on October, 20th 2017: format SDF “all drugs”. The dataset is released under a Creative Common’s Attribution-NonCommercial 4.0 International License. It can be used freely in non-commercial projects. An overview of the workflow is given in Figure 33.

Figure 33 Workflow to retrieve Drug Information from DrugBank dataset

The workflow process starts with reading in the DrugBank SDF file via SDF Reader node. The SDF Extractor node was used for extracting properties from the SDF column. The result was a table with over 30 columns of specific drug-information. For this project, we filtered with the Column Filter node Information for our specific research interest (shown in Figure 34).

48

Figure 34 Column Filter settings to retrieve DrugBank Database_ID, InChI, InChIKey, Generic name

In the next steps the selected columns were renamed, the Generic_Name column strings were converted via Case Converter node to lowercase, and all columns were striped to remove non visible spaces. To retrieve the Drug Information according to the Generic drug names of result list 1 from the DrugBank SDF file a Joiner node was selected to accomplish a “Full Outer Join”. To the Top input (or ‘left’ table) the DrugBank_Generic_Name column had been assigned. To the Bottom Input (or ‘right’ table) the q column (Generic names from result list 1) was selected. If the input (Generic drug names) in the respective row of both columns is equal, this assumes that all letters and blank characters are in the same order, both tables were joined. If the joining of both tables failed because the Generic drug names were different the specific columns of the other tables were left empty and labeled in the Joined Table with a red “?”. The joined table was processed repeatedly to remove Generic drug names, which are not included in the result list 1. The final result of the Retrieval of Drug Information from DrugBank was a table with the following information shown in Figure 35.

49

Figure 35 Segment from Table of Drug Information from DrugBank

50

3.2.3 Retrieval of Drug Information from PharmXplorer

PharmXplorer offers no API, and at the information platform, the chemical structures are provided as .gif images. There are no InChI or InChIKey entries existing. Hence, a solution had to be found to get the related InChI to the displayed .gif images. Professor Haider, the responsible person for PharmXplorer at the university of Vienna, generously provided the chemical structures as .skc files which are related to the .gif images. It should be noted that also MDL molfiles, containing the calculated 3D structures of the drug compounds, are freely available for download from the respective page. They were not examined in the present investigation. The Flowchart of retrieving InChI & InChIKey from PharmXplorer is displayed in Figure 36.

Figure 36 Process of retrieving InChI & InChIKey from PharmXplorer

Even though there was no freely available Knime SKC Reader node found, a Molfile Reader node is contained in KNIME Base Chemistry Types & Nodes provided by KNIME GmbH, Konstanz, Germany. It was decided to convert the .skc files to MOL file format in order to allow the usage of the freely available Molfile Reader node.

51

Therefore ACD/ChemSketch [54] was used, a program that does not only allow to draw chemical structures but also to import and export structure files. To convert .skc files to Molfile using the ACD/ChemSketch GUI the following steps must be performed:

1. click on File/Open, 2. select the desired .skc file 3. click on import button 4. click on File/Export, 5. select the mol format 6. rename the file 7. click on export button

By hand, this conversion process seemed like a Sisyphus task because over a thousand skc-files had to be converted. Pywinauto [51], a GUI automation library, written in python, was used to automate the above described steps. The python batch conversions script is related to the python script that can be found at https://nextmovesoftware.com/blog/2012/09/14/using- python-for-batch-conversion-of-chemsketch-files-to-mol-files/. The modified python script is attached in the appendix.

After the conversion process was finished, the chemical files were read in via. Molfile reader into the Knime workflow “Retrieval of Drug Information from PharmXplorer” (overview of workflow shown in Figure 37).

52

Figure 37 Workflow “Retrieval of Drug Information from PharmXplorer”

The .skc files were named according to the PharmXplorer Database Identifier called “Hauptnummer” plus .skc for example “00002.skc”. Via String Manipulation and String to number node, the “Hauptnummer” was extracted in case of the previous example the number 2. The PharmXplorer data is provided in German, which is why it was impossible to join generic names in German directly to the English generic names of the “Goldstandard dataset”. On the bright side, Prof. Haider also provided an SDF file that includes the database identifier “Hauptnummer” and the Fifth Level ATC codes. The converted Molfiles were joined to their Fifth Level ATC. Finally, InChI and InChIKey were generated from the converted Molfile via OpenBabel node.

53

3.2.4 Retrieval of Drug Information from PubChem

The data from PubChem was retrieved using the PUG-REST service. An overview of the workflow is shown in Figure 38.

Figure 38 Workflow to retrieve Drug Information from PubChem

The basic construction of the PUG-REST request was set to https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/$$$$/property/InChI,InCh ikey,IUPACName/json and added to the workflow using the Table Creator node. A short overview of the parts of the URL is shown in Table 4.

https://pubchem.ncbi compound/name/ $$$$/ property/InChI,InC json .nlm.nih.gov/rest/pu hikey,IUPACName g/ / prolog input Placeholder operation format

Table 4 URI parts pf PUG-REST request

Via Cross Joiner node a table was created with the column of the Generic drug names from result list 1 (left sided) and a column with the basic construction of the PUG-REST URI shown in Table 4 (right sided). The String Manipulation node replaced the “$$$$” (placeholder part of PUG-REST URI) to each Generic drug name of result list 1.

54

Figure 39 displays the finished URI for amlodipine to retrieve InChI, InChIKey and the IUPAC name.

Figure 39 PUG REST URI for amlodipine to retrieve Drug Information from PubChem

The finished table of the URI’s was sent to the Get Request node, and the requests were sent to the PubChem server. The PubChem servers converted the Generic Drug name to CID (PubChem Compound Identifier) and returned the requested InChI, InChIKey and IUPAC name in desired JSON format. The returned JSON string was turned into a data tree using the JSON Path node. Therefore the JSON Path node was set to extract InChI, InChIKey, and CID (Preferences shown in Figure 40).

Figure 40 JSON Path node preferences to extract InChI, InChIKey and CID

The JSON Path node output was processed over multiple nodes, and the final result of the Retrieval of Drug Information from PubChem was a table with the following information shown in Figure 41.

55

Figure 41 Segment from Table of Extracted Drug Information from PubChem

56

3.2.5 Retrieval of Drug Information from Wikipedia

A Workflow was built to retrieve the content of the drugboxes of each entry of result list 1.

Figure 42 Workflow to retrieve Drug Information from Wikipedia

Figure 42 shows that the workflow starts with a Table Creator node. Via Table Creator node the basic query structure was set to https://en.wikipedia.org//w/api.php?action=query&format=json&prop=revisions&titles= AAAA&redirects=1&rvprop=content&rvsection=0. With the aid of Cross joiner node, a table was created which contains the drug names of result list one on the lift side and the basic structure on the right side. The String Manipulation node replaced the “AAAA” in the basic query structure to each entry of result list one. In addition to that all API query strings were built.

Figure 43 The API query string to retrieve the content of Wikipedia drugbox amlodipine

Figure 43 shows how the query was built using the example of amlodipine. The action parameter was set to query - format was set to json. Then the prop parameter was defined and set to revision. This adjusted the query in a way that the revisions were being requested. Titles were set to each entry of result list one (in the above example set to amlodipine). Redirects were set to 1, which makes it possible if the title is e.g. acetylsalicylic acid it redirects automatically to Wikipedia article of Aspirin. Rvprop was set to content so that the content of the revision got retrieved and the rvsection was set to 0. As a result only the drugbox (section 0) would be retrieved.

57

The uri data were sent to Get Request node, and the requests were sent to the API. The JSON string returned was turned into a data tree using JSON Path node.

Figure 44 API response structure for amlodipine

Figure 44 shows the structure of the resulting data tree. InChI, InChIKey and all the other properties of interest are still inside on unstructured string of text inside the “*” property. Therefore, the JSON Path node was set to extract the “*” property in addition to the title, and the pageid were extracted (Setting shown in Figure 45).

58

Figure 45 JSON Path node setting to extract “*” property, title & pageid

The extracted data of JSON Path node is presented respectively in an own column. The whole data were sent to two separate Java Snippet nodes. The first Java Snippet node was configured to extract the InChI from the Drugbox (*s column). Therefore, regular-expressions were used in a short script shown in Figure 46.

Figure 46 Extraction of InChI from Wikipedia Drugbox

59

The other Java Snippet node was configured similarly. The search term was changed from StdInChI to StdInChIKey (script displayed in Figure 47).

Figure 47 Extraction of InChIKey from Wikipedia Drugbox

The separate Java Snippet output was processed over multiple nodes and finally joined together. The final result of the Retrieval of Drug Information from Wikipedia was a table with the following information shown in Figure 48.

Figure 48 Segment from Table of Extracted Drug information from Wikipedia

60

3.3 Data preparation for consistency test

3.3.1 Concatenate Data from ChemSpider, DrugBank, PubChem and Wikipedia

The retrieved data from ChemSpider, DrugBank, PubChem, and Wikipedia were concatenated via concatenate node into one table. A fragment of the concatenated table is shown in Figure 49.

Figure 49 Fragment of concatenating table of retrieved data from ChemSpider, DrugBank, PubChem, and Wikipedia.

The data in the table shown in Figure 49 were filtered via Row Splitter nodes. At first the InChI column should only include InChIs. The configuration of the Row Splitter node is displayed in Figure 50.

Figure 50 Configuration of Row Splitter node to include standard InChIs 61

In a further step, another Row Splitter node was used to remove incorrect InChIKeys. Therefore regular expression was used and the configuration of the Row Splitter node is shown in Figure 51.

Figure 51 Regular expression to include valid InChIKeys

The result of the InChI and InChIKey filtering was that only data remained with at least one InChI and one InChIKey remained. A fragment of the filtered table is displayed in Figure 52.

Figure 52 Fragment of filtered concatenate table with InChI & InChIKey

The data were processed over multiple nodes, and finally, a table with 1.265 various Generic names was created. A fragment of this table is shown in Figure 53.

62

Figure 53 Fragment of Table with 1.265 Generic names and their related InChI & InChIKey

3.3.2 InChI to Boolean

A future goal was to get an overview in which database an InChI is found to the related generic name. In a first step to achieve this future goal was set to TRUE = 1 when found in a sepcific database, and if no InChI was found it was set to FALSE = 0. A table was obtained which shows for each drug whether an InChI exists in the databases or not. A section of this table can be seen in Figure 54.

Figure 54 Screenshot of overview if InChI database entry was found

63

3.3.3 Binary Code – 16 Cases

To determine exactly in which database an InChI is found, a binary code is used. For an InChI that can be found in up to 4 databases, this results in two options (entry found = 1 | not found = 0) and in the following combinations: 24 = 16. These 16 possible cases and their related binary string is provided in Table 5.

ChemSpider DrugBank PubChem Wikipedia Case Case_string Case number 0 0 0 0 no entry 0000 1 found 0 0 0 1 Wikipedia 0001 2 0 0 1 0 PubChem 0010 3 0 0 1 1 PubChem + 0011 4 Wikipedia 0 1 0 0 DrugBank 0100 5 0 1 0 1 DrugBank + 0101 6 Wikipedia 0 1 1 0 DrugBank + 0110 7 PubChem 0 1 1 1 DrugBank + 0111 8 PubChem + Wikipedia 1 0 0 0 ChemSpider 1000 9 1 0 0 1 ChemSpider 1001 10 + Wikipedia 1 0 1 0 ChemSpider 1010 11 + PubChem 1 0 1 1 ChemSpider 1011 12 + PubChem + Wikipedia 1 1 0 0 ChemSpider 1100 13 + DrugBank 1 1 0 1 ChemSpider 1101 14 + DrugBank + Wikipedia 1 1 1 0 ChemSpider 1110 15 + DrugBank + PubChem 1 1 1 1 ChemSpider 1111 16 + DrugBank + PubChem + Wikipedia

Table 5 16 Cases of InChI

64

3.4 Consistency Test

The InChIs are converted to their related InChIKey through the ChemSpider Web Services (Description can be found at https://www.chemspider.com/InChI.asmx?op=InChIToInChIKey [55].

A fragment of the workflow of consistency test is displayed in Figure 55.

Figure 55 Workflow overview test for consistency

The data table „Preparation for consistency check” was filtered via multiple Rule-based RowFilter nodes which were configured that only the data of the respective case number is allowed through. The respective case number data is sent in the next step to Rule Engine nodes. These Rule Engine nodes were configured to prove if the converted InChIKey from different databases matches to the according generic drug name. The configuration of The Rule Engine nodes for case 16 (InChIKey included in all four databases) is shown in Figure 56. The configuration for the other 15 cases was carried out analogously.

65

Figure 56 Consistency test for case 16 – InChIKey included in all four databases

If all conditions were correct, the result was a “true” label - otherwise an empty row was generated. The output of the sixteen Rule Engine nodes was concatenated through multiple concatenate nodes and subsequently sent to a Rule-based Row Splitter node. The Rule-based Row Splitter made sure that the match had been separated from the mismatches. The mismatches were labeled as inconsistency, and the matches ware labeled as consistency. Furthermore, the mismatches were labeled 0, and the matches labeled 1. The data were further processed and resulted in a table shown in Figure 57.

66

Figure 57 Result table consistency test (fragment)

Additionally, the data were exported with the CSVWriter node in .csv format for data evaluation.

3.5 Creating gold-standard dataset

The gold-standard dataset was created starting from the result table consistency test Figure 57. The data were divided based on the results of the InChI consistency test. The InChIs were added to the gold-standard dataset without control If the respective Generic names were listed in at least three different databases and the outcome of the InChI consistency test was congruent. If the data were inconsistent or listed only in two databases, the InChIs were added to the group for manual control. A flowchart of gold- standard dataset generation is shown in Figure 58.

67

Figure 58 Flowchart showing the process of generating a gold-standard dataset

Finally, the data were assembled, and the gold-standard dataset had been created.

3.6 Comparing InChI related InChIKey from Public Databases versus gold-standard dataset

3.6.1 Comparing InChI related InChIKey from ChemSpider, DrugBank, PubChem & Wikipedia versus gold- standard dataset

The workflow for comparing InChI from ChemSpider, DrugBank PubChem & Wikipedia is built in a very similar manner. Therefore, ChemSpider was chosen as an example to explain the construction of the workflow. The workflow for comparing InChIKey from ChemSpider versus gold-standard dataset is displayed in Figure 59.

68

Figure 59 Workflow overview comparing InChIKey from ChemSpider versus gold-standard dataset

The data from ChemSpider and gold-standard (“Pharmaquiz”) were joined via Joiner node. The GroupBy node was configured to select the columns of InChI_Pharmaquiz, InChIKey_Pharmaquiz, the database ID from the specific database in case of ChemSpider the csid, InChI & InChIKey from the respective database, in this case, InChI_ChemSpider and InChIKey_ChemSpider. Additionally, the name column was selected to label the name of the database in each row. A Fragment of the table is shown in Figure 60.

Figure 60 selected columns to prepare for comparison

The comparison of the InChIKey from the two different data sources was performed through Rule Engine node. The Rule Engine node was configured to prove if the InChIKey from the specific public database matches with the Pharmaquiz “gold- standard”. Configuration of Rule Engine node is displayed in Figure 61.

69

Figure 61 Rule Engine node configured to prove if InChIKey ChemSpider matches InChIKey Pharmaquiz gold-standard

The Rule Engine node returns the result of the match check in the appended “TestChemSpider” column (this column is called, e.g. TestPubChem for PubChem database). In the next step, the whole table was sent to a RowSplitter node which was configured to validate the InChIKey according to the InChIKey_ChemSpider column. For this purpose regular expression was used and the matching criteria was set to [a- zA-Z]{14}-[a-zA-Z]{10}-[a-zA-Z]{1}. If the data did not match the regular expression, it meant that there was no valid InChIKey found and thus no entry was retrieved from ChemSpider. Via Constant Value Column node, the column called “TestChemSpider” was replaced with the constant value “2” for “no entry”. Additionally, another Constant Value Column node was used to append a new column called “result” with the constant value “no entry”.

70

If the data matched the regular expression, this meant that there is an existing entry from ChemSpider because a valid InChIKey is included, but it was not obvious whether the InChIkey from ChemSpider matched the gold-standard or not. Therefore the data was sent to Row Splitter node which was configured to split according to the TestChemSpider column. The matching criteria were set to use the pattern matching term “true” for case sensitive match. If the data matched the term “true”, this implied that the InChIKey from ChemSpider matched the gold-standard. So the constant value of TestChemSpider column was set to “1” and the constant value “match” was added to the result column. If the data failed to match the “true” term, it could be concluded that the InChIKey from ChemSpider and the reference InChIKey to validate are not the same. Therefore, the constant value of TestChemSpider column was set to “0” and the constant value “no match” was added to the result column. After that, the separated labeled data were concatenated via Concatenate node.

3.6.2 Comparing InChI related InChIKey from PharmXplorer versus gold-standard dataset

Overall the workflow for Comparing PharmXplorer to the gold-standard dataset was built similarly. The only difference was that the data sources were not merged through the generic names due the fact that the PharmXplorer data does not include English generic names. Therefore, the Fifth Level ATC codes were used to link both data sources together.

71

4 Results and Discussion

4.1 Results of consistency Test

There were a total of 1.031 drugs found in DrugBank, ChemSpider, PubChem, and Wikipedia that met the inclusion criteria. In order to meet the inclusion criterion for the consistency test, an InChI & InChIKey had to exist in at least two of the four databases. The exact results of the consistency test are shown in Table 6. It should be noted here that that the correctness of the data was not taken into consideration.

Result Database inconsistent consistent Grand Total ChemSpider + Wikipedia 2 3 5 ChemSpider + DrugBank 2 3 5 ChemSpider + DrugBank + Wikipedia 6 0 6 ChemSpider + PubChem 31 38 69 ChemSpider + DrugBank + PubChem 15 69 84 ChemSpider + PubChem + Wikipedia 25 68 93 ChemSpider + DrugBank + PubChem + Wikipedia 132 637 769 Grand Total 213 818 1.031

Table 6 result of consistency test in absolute numbers

It can be seen only seven different combinations of databases occurred in practice, although 12 different cases are possible. As can be seen , the worst result, in this case, is the combination ChemSpider & DrugBank & Wikipedia. Here are 100% inconsistent, however, only six drugs are listed. The best result was obtained when the InChI & InChIKey was present in all four databases. Here were 637 consistent and 132 inconsistent, which corresponds to an 82,83% consistency. The total number of n=769 is also more representative. Nevertheless, one should consider all cases altogether in order to obtain the total consistency and thus higher meaningfulness. The total consistency chart is displayed in Figure 62.

72

total consistency

100% 90% 79.34 % 80% 70% 60% 50% 40% 30% 20.66 % 20% 10% 0% inconsistent consistent n=213 n=818

Figure 62 chart of total consistency

As can be seen from the chart of total consistency, the total consistency is 79.34 %, which means that 20.66% of InChIs are not equal to the drug being interrogated.

73

4.2 Validation with gold-standard dataset

The gold-standard dataset includes in total 914 drugs. This means that there are exactly 117 drugs less than in the consistency dataset. This is because no clear InChI and InChIKey could be assigned to these drugs. One example here is the drug reboxetine, which has two specific stereocenters and therefore basically four possible configurations (RR, SS, RS, SR). Only the combination of RR and SS are approved. However, it is not possible to generate an InChI without further effort which represents exactly these two possibilities. The results of the data comparison of the publicly accessible database against the gold-standard dataset are shown in Table 7.

Result Database match no match Total drugs ChemSpider 859 56 915 PubChem 877 31 908 Wikipedia 777 60 837 DrugBank 783 32 815 PharmXplorer 554 106 660

Table 7 result of the correctness of InChI on public available databases in relation to the gold-standard

As can be seen from Table 7, the number of total drugs obtained is different ( in the range of 660 to 915). Only ChemSpider provides an InChI & InChIKey for every questioned drug. Here it has to be noted that only ChemSpider and PubChem were queried with an API of the supplier. For Wikipedia, the Mediawiki API was used. PharmXplorer was manually linked via ATC codes. No API was used for DrugBank because it is not free available, so here the record was linked using DrugBank SDF file. However, it can still be stated that the ChemSpider regarding quantity is the clear leader in this project. To estimate the quality, it is much easier if the percentage of the matches can be considered. The chart in Figure 63 provides an overview.

74

% of Correct InChI & InChIKey 100.00% 3.41% 3.93% 6.12% 7.17% 90.00% 16.06%

80.00%

70.00%

60.00%

50.00% 96.59% 96.07% 93.88% 92.83% 40.00% 83.94%

30.00%

20.00%

10.00%

0.00% PubChem DrugBank ChemSpider Wikipedia PharmXplorer match no match

Figure 63 Chart showing the correctness of InChI on public available databases in relation to the gold-standard dataset

As can be seen from the chart in Figure 63, PubChem is regarding quality ranked 1st with correctness of 96.59%. With correctness of 83.94%, PharmXplorer is ranked 5th and therefore in the last place. The other publicly accessible databases show results with almost a minimum correctness of 93%.

4.3 Overview of Errors

Based on the results on the comparative public available databases in relation to the gold-standard dataset, all drugs that received the label “no match“ were manually checked and examined to classify the type of error. Thereby proceeding type of error on a scheme that the following main groups were selected: • Charge • Drug Preparations (e.g. ester, or water) • Salt • Stereochemistry • Wrong structure 75

Error type ChemSpider DrugBank PharmXplorer PubChem Wikipedia Total error Charge 1 1 3 1 0 7 Drug preparation 5 1 0 2 5 13 Salt 4 0 1 0 6 11 Stereochemistry 33 27 68 23 29 178 Wrong structure 13 3 34 4 20 75 Total error 56 32 106 31 60 285

Table 8 Overview of error type

Table 8 lists the errors according to their class. In order to gain clarity on which are the most common mistakes, the pie chart in Figure 64 is useful.

Error categories by percentage

2.11% 4.56% 3.86%

25.96%

63.51%

Charge Drug preparation Salt Stereochemistry Wrong structure

Figure 64 Error categories by percentage

The pie chart in Figure 64 shows that by far the biggest source of error is the point of stereochemistry with a share of 63.51%, followed by the category wrong structure with 25.96%. At the end, and thus the smallest share with 2.11% has the error category charge.

76

5 Conclusion and Outlook

A very important outcome of the research is that it if generic names of approved drugs in Austria are used to query the related InChI & InChIKey from DrugBank, ChemSpider, PubChem, and Wikipedia the total consistency of the related InChI & InChIKey is 79.34 %. Conversely, that means over 20% of InChIs do not match, which is certainly more than expected. Furthermore, it is shown that publicly accessible databases perform considerably better than the PharmXplorer regarding quality. The validation with the gold-standard dataset shows that PubChem with 96.59% correctness is 12.65% more accurate than the PharmXplorer with the correctness of 83.94%. The latter figure refers only to the examined 2D structures in PharmXplorer, as indicated in section 3.2.3. A comparison with the 3D structures in PharmXplorer would have given a significantly better outcome, most probably. In order to validate the data, a gold-standard dataset had to be created in advance. This has been created to a large extent automated if the InChI of the drug is completely identical in at least three databases. It must be concluded here that it is in principle possible that the databases ultimately take over data from each other, and so mistakes creep in. In one special case were this occurred is the structure of flupentixol, which should be discussed. Flupentixol has been added to the gold-standard dataset automatically in cis-form because it was identical in all four databases (ChemSpider, DrugBank, PubChem & Wikipedia). The cis-form is probably the more active form but it was originally patented the trans-form of flupentixol. This example shows that identical systematically errors in databases could occur. The case of flupentixol was only discovered because in the PharmXplorer the trans-form was displayed.

Nevertheless, in the author’s view, it is a good approach to validate data. If, for example, the InChIs do not agree with all the databases queried, it is very likely that at least one database has an error and can, therefore, be subjected to manual control. However, to reach a higher level of accuracy, it still requires the human in itself. The author has tried his best to check the data for accuracy, but of course, he probably has not reached 100% correctness in the gold-standard dataset, because wherever people work mistakes could occur. 77

As a product of this project, a list of incorrect drug entries with specified error type and correct InChI & InChIKey was created for ChemSpider, DrugBank, PharmXplorer PubChem, and Wikipedia.

Another side-product of the project is the created gold-standard dataset of approved drugs in Austria. This gold-standard dataset could be used in the future to develop a chemistry learning app. For example, it is possible to compose SVG images from the InChIs dataset.

Finally, a method has been developed, which could be used for even larger sets of data to conduct manual verification pre-selection in the future.

78

6 References

1. Williams, A.J. and S. Ekins, A quality alert and call for improved curation of public chemistry databases. Drug Discov Today, 2011. 16(17-18): p. 747-50. 2. Akhondi, S.A., J.A. Kors, and S. Muresan, Consistency of systematic chemical identifiers within and between small-molecule databases. J Cheminform, 2012. 4(1): p. 35. 3. Williams, A.J., S. Ekins, and V. Tkachenko, Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov Today, 2012. 17(13-14): p. 685-701. 4. Kunder, M.d. The size of the World Wide Web (The Internet). 21.02.2018]; Available from: http://www.worldwidewebsize.com/. 5. WHO Collaborating Centre for Drug Statistics Methodology. ATC/DDD Index 2018. 2018 23.02.2018]; Available from: https://www.whocc.no/atc_ddd_index/. 6. World Health Organization, Introduction to Drug Utilization Research. 2003. 7. Dalby, A., et al., Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. Journal of Chemical Information and Computer Sciences, 1992. 32(3): p. 244-255. 8. Wikipedia. Molfile of L-Alanine. 2018 02.07.2018; Available from: https://en.wikipedia.org/w/index.php?title=Chemical_table_file&oldid=8481246 64. 9. Weininger, D., SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 1988. 28(1): p. 31-36. 10. Inteligand. LigandScout. 2018 [cited 2018 08.08.2018]; Available from: http://www.inteligand.com/ligandscout/. 11. Wikimedia Commons contributors. SMILES.png. 2018 4 July 2018 09:47 UTC]; Available from: https://commons.wikimedia.org/w/index.php?title=File:SMILES.png&oldid=245 273554. 12. Craig A. James. Writing SMILES: Normalizations. 06.07.2018]; Available from: http://opensmiles.org/spec/open-smiles-4-output.html. 13. O'Boyle, N.M., Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI. J Cheminform, 2012. 4(1): p. 22. 14. Wikipedia. Simplified molecular-input line-entry system. [cited 2018 06.07.2018]; Available from: https://en.wikipedia.org/w/index.php?title=Simplified_molecular-input_line- entry_system&oldid=844937808. 15. IUPAC - International Union of Pure and Applied Chemistry. IUPAC 2018 [cited 2018 19.03.2018]; Available from: https://iupac.org/. 16. NIST. National Institute of Standards and Technology: Home. [cited 2018 11.07.2018]; Available from: https://www.nist.gov/. 17. InChI-Trust. InChI Trust: Home. 11.07.2018]; Available from: https://www.inchi- trust.org/. 18. Warr, W.A., Many InChIs and quite some feat. J Comput Aided Mol Des, 2015. 29(8): p. 681-94.

79

19. InChI-Trust. Technical-Faq. 13.07.2018]; Available from: https://www.inchi- trust.org/technical-faq/#4.4. 20. Heller, S., et al., InChI - the worldwide chemical structure identifier standard. J Cheminform, 2013. 5(1): p. 7. 21. Pence, H.E. and A. Williams, ChemSpider: An Online Chemical Information Resource. Journal of Chemical Education, 2010. 87(11): p. 1123-1124. 22. ChemSpider. What is ChemSpider? 24.07.2018]; Available from: http://www.chemspider.com/AboutUs.aspx. 23. ChemSpider. Amlodipine. 24.07.2018]; Available from: http://www.chemspider.com/Chemical-Structure.2077.html. 24. Wishart, D.S., et al., DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res, 2018. 46(D1): p. D1074-D1082. 25. DrugBank. Amlodipine. 26.07.2018]; Available from: https://www.drugbank.ca/drugs/DB00381. 26. Norbert Haider. Diplomstudium Pharmazie: großflächiger Einsatz von Blended Learning in der Studieneingangsphase. 2007; Available from: http://merian.pch.univie.ac.at/elearning/. 27. Schweiger, K., et al., PharmXplorer. Zeitschrift für Hochschulentwicklung; ZFHD 01DO - 10.3217/zfhd01/04, 2011. 28. Kim, S., et al., PubChem Substance and Compound databases. Nucleic Acids Res, 2016. 44(D1): p. D1202-13. 29. PubChem. Amlodipine. 30.06.]; Available from: https://pubchem.ncbi.nlm.nih.gov/compound/2162. 30. Kim, S., et al., PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res, 2015. 43(W1): p. W605-11. 31. Wikipedia. 2018 21.02.2018]; Available from: https://en.wikipedia.org/w/index.php?title=Wikipedia&oldid=826715619. 32. Google com. Google.com. 2018 [cited 2018 21.02.2018]; Available from: https://www.google.com/. 33. Google UK. Google 2018 21.02.2018]; Available from: https://www.google.co.uk. 34. Yahoo. Yahoo. 2018 [cited 2018 21.02.2018]; Available from: https://www.yahoo.com/. 35. MSN. MSN. 2018 21.02.2018]; Available from: https://www.msn.com. 36. Laurent, M.R. and T.J. Vickers, Seeking health information online: does Wikipedia matter? J Am Med Inform Assoc, 2009. 16(4): p. 471-9. 37. Bing. Bing. 2018 22.02.2018]; Available from: https://www.bing.com/. 38. Law, M.R., B. Mintzes, and S.G. Morgan, The sources and popularity of online drug information: an analysis of top search engine results and web page views. Ann Pharmacother, 2011. 45(3): p. 350-6. 39. Judd, T. and G. Kennedy, Expediency-based practice? Medical students' reliance on Google and Wikipedia for biomedical inquiries. British Journal of Educational Technology, 2011. 42(2): p. 351-360. 40. Brokowski, L. and A.H. Sheehan, Evaluation of pharmacist use and perception of Wikipedia as a drug information resource. Ann Pharmacother, 2009. 43(11): p. 1912-3. 41. Mesgari, M., et al., “The sum of all human knowledge”: A systematic review of scholarly research on the content of Wikipedia. Journal of the Association for Information Science and Technology, 2015. 66(2): p. 219-245.

80

42. Wikipedia. List of Wikipedias. 2018 21.02.2018]; Available from: https://en.wikipedia.org/w/index.php?title=List_of_Wikipedias&oldid=82637021 3. 43. U.S. Food & Drug Administration. MedWatch: The FDA Saftey Information and Adverse Event Reporting Program. 2018 21.02.2018]; Available from: https://www.fda.gov/Safety/MedWatch/default.htm. 44. Hyman, W.A., Finding Recalls. Biomedical Safety & Standards, 2015. 45(7): p. 49-50. 45. Koppen, L., J. Phillips, and R. Papageorgiou, Analysis of reference sources used in drug-related Wikipedia articles. J Med Libr Assoc, 2015. 103(3): p. 140- 4. 46. Kraenbring, J., et al., Accuracy and completeness of drug information in Wikipedia: a comparison with standard textbooks of pharmacology. PLoS One, 2014. 9(9): p. e106930. 47. Wikipedia. Wikipedia Infobox_drug. 21.02.2018]; Available from: https://en.wikipedia.org/wiki/Template:Infobox_drug. 48. MediaWiki. MediaWiki. 2018 22.02.2018]; Available from: http://www.mediawiki.org 49. KNIME AG. KNIME Open Source Story. 11.2.2018]; Available from: https://www.knime.com/knime-open-source-story. 50. Berthold, M.R., et al., KNIME - the Konstanz information miner: version 2.0 and beyond. SIGKDD Explor. Newsl., 2009. 11(1): p. 26-31. 51. Mark Mc Mahon and contributors. Pywinauto. 2016 [cited 2018 20.03.2018]; Available from: https://pywinauto.github.io/. 52. Knime AG. Metanode. 12.2.2018]; Available from: https://www.knime.com/metanodes. 53. Knime AG. HTTP Nodes. 13.02.2018]; Available from: https://www.knime.com/book/http-nodes. 54. ACD/Labs. ACD/ChemSketch Available from: http://www.acdlabs.com/resources/freeware/chemsketch/. 55. ChemSpider. GenerateInChIKey. 2018 08.03.2018]; Available from: https://www.ChemSpider.com/InChI.asmx?op=GenerateInChIKey.

81

7 Appendix

7.1 List of Abbreviation

API Application Programming Interface ATC Anatomical Therapeutic Chemical FDA Food and Drug Administration GUI Graphical User Interface InChI International Chemical Identifier INN International Nonproprietary Name International Union of Pure and Applied IUPAC Chemistry JSON JavaScript Object Notation Simplified Molecular Input Line Entry SMILES Specification URI Universal Resource Identifier

82

7.2 Scripts

7.2.1 Python script for batch conversion of .skc files to Mol files import pywinauto import os import doctest from pywinauto import application from time import sleep from pywinauto.timings import Timings

Timings.Slow() Timings.after_sendkeys_key_wait =0.015 Timings.after_menu_wait =0.1 Timings.after_click_wait =0.2 Timings.after_closeclick_wait =0.2 path_to_files = "C:\\Pharmhaider\\skcToMol\\input\\" path_to_output = "C:\\Pharmhaider\\skcToMol\\output\\" def escape(filename): """Escape the filename >>> escape("(3-ethoxypropyl)mercury bromide.skc") '{(}3-ethoxypropyl{)}mercury bromide.skc' """ newfilename = filename.replace("(", "{(}").replace(")", "{)}")

return newfilename if __name__ == "__main__": doctest.testmod()

dir = os.listdir(path_to_files); 83

app = application.Application().connect(process = 2404) dontConvert = False # Set to True to skip successfully converted files for filename in dir: if filename.endswith("ContinueHere.skc"): dontConvert = False if dontConvert: continue window = app.window_(title_re='ACD/ChemSketch.*\.sk2') window.TypeKeys("%fi")

window2 = app.window_(title_re='Import') window2.TypeKeys("%n") filename1 = path_to_files + escape(filename)

window2.TypeKeys(filename1 +"%n", with_spaces = True), window2.TypeKeys("^c") , window2.TypeKeys("{ENTER}");

sleep(0.2) window.TypeKeys("%f" + "e")

window3 = app.window_(title_re='Export') window3.TypeKeys("C:\\Pharmhaider\\skcToMol\\output\\"+ escape(filename) + ".mol"), window3.TypeKeys("{ENTER}");

window = app.window_(title_re='ACD/ChemSketch.*\.sk2') sleep(0.5)

84