Machine learning techniques applied to chemotaxonomy and ligand-based virtual screening

Dissertation

Dimitar P. Hristozov

2007

Machine learning techniques applied to chemotaxonomy and ligand-based virtual screening

Den Naturwissenschaftlichen Fakultäten der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades

vorgelegt von Dimitar P. Hristozov aus Plovdiv/Bulgarien

Als Dissertation genehmigt von den Naturwissenschaftlichen Fakultäten der Universität Erlangen-Nürnberg

Tag der mündlichen Prüfung: 27. September 2007

Vorsitzender der Promotionskommission: Prof. Dr. E. Bänsch

Erstberichterstatter: Prof. Dr. J. Gasteiger

Zweitberichterstatter: Prof. Dr. T. Clark This work would not have been possible without the help, support, and advice of my supervisor

Prof. Dr. Johann Gasteiger whom I would like to thank for giving me the opportunity to work on this fascinating subject and for the support and inspiring discussions during the years in his group.

I would like to thank the coauthors of the publications originating from this work. Prof. Dr. Fernando Batista Da Costa from the University of São Paulo, Ribeirão Preto, SP, Brazil – Eu necessito a página inteira dizer tudo que eu quero, assim que eu direi apenas agradeço-o muito para tudo - começando com ciência e terminando com cerveja (ou o versa vice). Prof. Dr. Tudor I. Oprea from the University of New Mexico, Albuquerque, USA for the excellent scientific collaboration and for his support and kindness during my stay in Albuquerque.

Furthermore, my special thanks go to Dr. Simon Spycher, Dr. Eric Pellegrini, Dr. Duŝica Vidović, Vladimir Sykora, Maria Ester Camargo and Jörg Marusczyk for many inspiring scientific discussions and for the nice time we spent together.

Many thanks to the MOSES development team – Dr. Thomas Kleinöder, Dr. Achim Herwig, and Jörg Marusczyk for the nice system they have created and for the countless hours they have spent helping me to use it.

Additionally, I would like to thank all my present and former colleagues in this group for the nice working atmosphere and their valuable help with numerous scientific and administrative tasks.

I am also grateful to our secretaries Ulrike Scholz, Karin Holzke and Carolin Hidalgo – without their help I would have been lost in the administrative jungle.

I would like to express my special thanks to Prof. Dr. Georgi Andreev, Dr. Veselin Kmetov, Dr. Violeta Stefanova, and Deyana Georgieva from the University of Plovdiv, Bulgaria. Without them my involvement with science would have been much harder and much less fun. Muchas gracias to Prof. Dr. Antonio Canals from the University of Alicante, Spain for the exceptional time I have spent in his group. Merci beaucoup to Dr. Jean-Pierre Kocher from Mayo Clinics, USA for his valuable advices.

A big thank you to my family – my mother Mariya, my father Panayot, my brother Ljubomir, his wife Rosalina, and my grandparents Gana and Ljuben. Without your endless support I would have been lost. Обичам ви.

And last but not least a big, big thank you to my wife Mariya. There is not enough space to list even half of the reasons. I am lucky I have you in my life. Обичам те.

Contents i

Contents

CONTENTS...... I

1 INTRODUCTION ...... 1

1.1 MOTIVATION ...... 1 1.2 SCIENTIFIC BACKGROUND...... 2 1.2.1 Machine learning ...... 2 1.2.2 Chemotaxonomy ...... 3 1.2.3 Virtual screening...... 6 1.3 OBJECTIVES AND STRUCTURE OF THIS WORK ...... 8 1.4 REFERENCES...... 10

2 SESQUITERPENE LACTONES-BASED CLASSIFICATION OF THE FAMILY USING NEURAL NETWORKS AND K-NEAREST NEIGHBORS...... 13

OVERVIEW...... 13 ORIGINAL ARTICLE ...... 16 2.1 INTRODUCTION ...... 17 2.2 MATERIALS AND METHODS ...... 20 2.2.1 Data sets...... 20 2.2.2 Structure representation ...... 22 2.2.3 Classification methods...... 26 2.2.4 Model validation...... 27 2.2.5 Determination of Prediction Space ...... 28 2.3 RESULTS ...... 30 2.3.1 CPG neural network models...... 30 2.3.2 k-NN models ...... 31 2.3.3 Comparison between the obtained models ...... 32 2.3.4 Predictions on the second test set...... 33 2.3.5 Defining the prediction space...... 34 2.4 DISCUSSION ...... 36 2.4.1 Classification methods...... 37 2.4.2 Structural descriptors...... 38 2.4.3 Applicability ...... 38 2.4.4 Chemotaxonomic analysis and majority vote...... 39 2.4.5 Prediction space ...... 40 2.5 CONCLUSIONS...... 41 2.6 ACKNOWLEDGMENTS ...... 42 ii Contents

2.7 REFERENCES...... 42 FURTHER COMMENTS AND DISCUSSION...... 48

3 MULTI-LABELED CLASSIFICATION APPROACH TO FIND A SOURCE FOR TERPENOIDS...... 51

OVERVIEW...... 51 ORIGINAL ARTICLE ...... 53 3.1 INTRODUCTION ...... 54 3.1.1 Multi-labeled classification...... 55 3.1.2 Literature on multi-labeled classification...... 58 3.1.3 Assessing the performance of a multi-labeled classification...... 59 3.2 MATERIALS AND METHODS ...... 62 3.2.1 Data sets...... 62 3.2.2 Structure representation ...... 63 3.2.3 Classification methods...... 64 3.2.4 Model validation and performance measures...... 65 3.3 RESULTS ...... 67 3.3.1 Cross-training SVM (ct-SVM) models ...... 67

3.3.2 ML-kNN models ...... 68 3.3.3 Measures based on the label rankings...... 69 3.3.4 Comparison with single-labeled classifier ...... 69 3.4 DISCUSSION ...... 70 3.4.1 Comparison of the classification methods...... 70 3.4.2 Comparison between the P-, T-, and C-criterion...... 73 3.4.3 Analyses of the ct-SVM results under C-criterion ...... 76 3.5 CONCLUSIONS...... 79 3.6 ACKNOWLEDGMENTS ...... 79 3.7 REFERENCES...... 80 FURTHER COMMENTS AND DISCUSSION...... 83

4 LIGAND-BASED VIRTUAL SCREENING BY NOVELTY DETECTION WITH SELF- ORGANIZING MAPS...... 85

OVERVIEW...... 85 ORIGINAL ARTICLE ...... 90 4.1 INTRODUCTION ...... 91 4.2 MATERIALS AND METHODS ...... 93 4.2.1 Data sets...... 93 4.2.2 Chemical structure representation...... 95 4.2.3 Virtual screening methods ...... 96 4.2.4 Performance measures ...... 103 Contents iii

4.2.5 Validation ...... 104 4.3 RESULTS AND DISCUSSION...... 105 4.3.1 Training set selection...... 106 4.3.2 Validation ...... 107 4.3.3 Method comparison...... 109 4.3.4 Methods complimentary ...... 110 4.3.5 Scaffold analysis...... 113 4.3.6 Rejection rates...... 124 4.3.7 Multiple structure representations with Mahalanobis distance...... 125 4.3.8 Comparison at different ranks ...... 129 4.4 CONCLUSIONS...... 132 4.5 ACKNOWLEDGMENTS ...... 133 4.6 REFERENCES...... 133 FURTHER COMMENTS AND DISCUSSION...... 139

5 VIRTUAL SCREENING APPLICATIONS – A STUDY OF LIGAND-BASED METHODS AND DESCRIPTORS IN FOUR DIFFERENT SCENARIOS...... 141

OVERVIEW...... 141 ORIGINAL ARTICLE ...... 142 5.1 INTRODUCTION ...... 143 5.1.1 Scenario 1: Prioritizing compounds for a subsequent HTS (SC.1)...... 143 5.1.2 Scenario 2: Selecting compounds for a subsequent lead-optimization (SC.2)...... 144 5.1.3 Scenario 3: Is a given compound active? (SC.3)...... 144 5.1.4 Scenario 4: Identification of the most active compound (SC.4)...... 145 5.2 MATERIALS AND METHODS ...... 146 5.2.1 Chemical databases...... 146 5.2.2 Biological targets ...... 146 5.2.3 Virtual screening protocol ...... 147 5.2.4 Assessing the performance ...... 148 5.2.5 Virtual screening methods ...... 153 5.2.6 Structure representation ...... 156 5.3 RESULTS AND DISCUSSION...... 158 5.3.1 Scenario 1: Prioritizing compounds for a subsequent HTS...... 158 5.3.2 Scenario 2: Selecting compounds for a subsequent lead-optimization...... 172 5.3.3 Scenario 3: Is a given compound active?...... 192 5.3.4 Scenario 4: Identification of the most active compound...... 195 5.4 CONCLUSIONS...... 196 5.5 REFERENCES...... 198

6 CONCLUSION AND OUTLOOK ...... 205 iv Contents

7 SUMMARY ...... 211

8 ZUSAMMENFASSUNG ...... 213

APPENDIX A. PUBLICATIONS...... 217

APPENDIX B. LEBENSLAUF ...... 219

1

1 Introduction

1.1 Motivation

Today we are overwhelmed with data. The amount of data we are confronted with is rapidly increasing and there is no end in sight. It has been estimated that the amount of data stored in the world’s databases doubles approximately every twenty months.1 However, there is a growing gap between the generation of data and our ability to understand them. The amount of data is constantly increasing but the proportion of it that people can handle decreases rapidly. Lying hidden in these data is information, potential knowledge that is hard to find.

The knowledge, on the other hand, has become a vital part of modern economy. Luis Suarez-Villa, in his book “Invention and Rise of Technocapitalism” argues that:

“Technocapitalism is a new form of market capitalism that is rooted in technological invention and innovation. It can be considered an emerging era, now in its early stage, that is supported by such intangibles as creativity and knowledge. Intangibles are at the core of technocapitalism. Creativity and knowledge are to technocapitalism what tangible raw materials, factory labor and capital were to industrial capitalism.”2 The use of the vast amounts of data in a way which leads to knowledge is vital for any scientific discipline and for any modern company. Around 60% of the total U.S. employment in 2000 consisted of information workers.3 However, even a highly skilled professional may discover only a very limited part of the knowledge, hidden amongst millions of data records. A novel scientific field named machine learning has emerged in an attempt to bridge the discrepancy between the amount of data and the human ability to comprehend them. The major focus of machine learning research is to extract information and knowledge from data automatically by computational and statistical methods. However, human intuition cannot be entirely eliminated. The designer of the system must specify how the data are to be represented and what mechanisms will be used to search for knowledge in the data and it is ultimately a human who will make use of the discovered knowledge.

Chemistry has generated vast amounts of data in the recent years. The Google directory of chemical databases4 alone lists 69 different sources of chemical information. Only one of TM these sources - the Chemical Abstracts Service (CAS) REGISTRY database “contains more than 31 million organic and inorganic substances and over 59 million 2 Introduction

5 sequences and is updated daily“, according to the CAS website. Proprietary databases containing millions of chemical compounds are nowadays common in the pharmaceutical industry.6 Thus, the application of the advances made in the field of machine learning to chemistry related problems is highly relevant. In fact, a whole field, known as chemoinformatics, has evolved in the past few years focusing on the use of computer and information techniques, applied to a range of problems in the field of chemistry. The enormous breadth and the exciting prospects which this field offers are reflected in the “Handbook of Chemoinformatics”.7 In addition, teaching of chemoinformatics has become important. Chemoinformatics is being introduced as a discipline in many universities and a comprehensive textbook8 is available.

In the rest of this introduction, first a general overview of the field of machine learning and some of the methods it offers to discover knowledge from data is given. Afterwards, the field of chemotaxonomy, which attempts to relate the taxonomic classification of living species and their biochemistry, is introduced. The application of different classification techniques to gain understanding in the relationships between the taxonomic classification and the secondary metabolism of and to facilitate the search for sources of desired natural compounds is the subject of Chapter 2 and Chapter 3 of this work. Finally, an overview of virtual screening – a technique in daily use in the pharmaceutical industry, is presented. Virtual screening provides an example where machine learning techniques help in navigating through vast amounts of chemical data and is the subject of Chapter 4 and Chapter 5 of this work.

1.2 Scientific background

1.2.1 Machine learning

Machine learning is concerned with the design and development of algorithms and techniques that allow computers to "learn". The word “learn” is put into quotation marks because it is hard to prove that a machine can actually learn in the sense a human does.1 However, since the term machine learning is widely accepted we will avoid this rather philosophical discussion.

The major focus of machine learning research is to extract information from data automatically by computational and statistical methods. There are, generally speaking, two broad classes of machine learning algorithms, termed supervised and unsupervised learning. Scientific background 3

Supervised learning aims at creating a function from training data. The training data consist of pairs of input objects (typically vectors), and known outputs. The output of the function can be a continuous value (called regression), or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations. Chapter 2 and Chapter 3 of this thesis make use of this type of machine learning in an attempt to relate the secondary metabolism of plants to their taxonomic classification.

Unsupervised learning is a method in which a model is fit to observations. In contrast to the supervised learning there is no a priori known output. Thus, learning in this case is basically an optimization process. Perhaps the most common form of unsupervised learning are different types of clustering. In hierarchical clustering, for example, the system is presented only with a set of objects and two distances: between two objects and between two groups of objects. The system “learns” how to organize the objects into groups and the groups into hierarchy using only the above conditions.9 Unsupervised learning is also useful for data visualization.10,11 Chapter 4 and Chapter 5 of this thesis make use of unsupervised learning in an attempt to find new chemical compounds, which may have a given biological activity.

Finally, the discovery of new, otherwise inaccessible knowledge is not limited to the use of large databases. Often by using relatively small amounts of data together with the appropriate machine learning technique useful relationships can be discovered and described in intuitive terms. In addition, such models usually can be used to predict the behavior of yet unseen data – an ability, which is widely exploited in chemoinformatics.

1.2.2 Chemotaxonomy

With chemistry in mind more often than not the data consist of chemical compounds. A large part of chemoinformatics is devoted to the attempt of discovering relationships between the chemical structure and a given property (biological activity, octanol-water partitioning constants, etc.). Such relationships are termed in general structure-property relationships (SPR) and, when they exist, can provide a very fast and economic way of calculating the values of properties, which otherwise would require extensive and expensive laboratory measurements. 4 Introduction

Predicting properties of chemical compounds is not the only area in which machine learning techniques can be used. Recently, a large amount of efforts has been made towards understanding complex biological systems. In fact, the “sister” of chemoinformatics, a field called bioinformatics, is gaining more and more popularity and was claimed2 as one of the hallmarks of the twenty-first century. According to the National Institutes of Health (NIH)12

bioinformatics is being defined as “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data”. It tries to shed light onto how genes express proteins and regulate biochemical pathways. Despite that chemistry plays a big role in any biological system the relationship between chemo- and bioinformatics is frequently overlooked.

Chemotaxonomy is defined as “the classification or of organisms based on differences and similarities in biochemistry” and clearly stands on the border between chemo- and bioinformatics. The ultimate goal is the classification of biological organisms, while the actual data used to arrive at this classification consist of chemical structures and reactions.

Chemotaxonomy can be subdivided into two primary approaches, macro- and micromolecular.13

The macromolecular approach involves techniques as allozyme electrophoresis, amino-acid or deoxyribonucleic acid (DNA) sequencing. This approach is gaining more and more importance with the recent technology advances, which have made the application of some of these techniques routine.

The micromolecular approach is based on secondary metabolites that often have restricted occurrence amongst plants and, therefore, are of taxonomic value. Amongst the useful secondary metabolites are phenolics, terpenoids, alkaloids, and glucosinolates. Taxonomically useful secondary metabolites are usually rather large molecules which allow a diversity of molecular types that may be of limited occurrence. More complex molecules which require many steps for their biosynthesis are likely to have limited distribution and, therefore, to be of greatest taxonomic value.13,14 Chart 1.1 shows an example of one such class of secondary metabolites - sesquiterpene lactones. The biosynthesis of the sesquiterpene lactone (+)- costunolide in chicory, proposed in reference 15, is shown. The (+)-costunolide is generally accepted15,16 as the parent compound of the three major classes of sesquiterpene lactones – Scientific background 5 guaianolides, germacranolides, and eudesmanolides, their skeletons are shown in Chart 1.1 as well.

Chart 1.1 Biosynthetic route for (+)-costunolide, an intermediate in sesquiterpene lactone biosynthesis. I, Cyclization of farnesyl diphosphate (1) to (+)-germacrene A (2) by (+)-germacrene A synthase (a sesquiterpene synthase). II, Hydroxylation of the isopropenyl side chain by (+)-germacrene A hydroxylase, a cytochrome P450 enzyme. III, Oxidation of germacra-1(10),4,11(13)-trien-12-ol (3) via germacra-1(10),4,11(13)-trien-12-al (4) to germacra-1(10),4,11(13)-trien-12-oic acid (5) catalyzed by NAD(P)+-dependent dehydrogenase(s). IV, Postulated hydroxylation at the C6-position of germacratrien-12-oic acid (5) and subsequent lactonization will yield (+)-costunolide (6). The skeletons of the three major STL groups accessible from (+)-costunolide (6) - guaianolides, germacranolides, and eudesmanolides are shown as well. Adapted from reference 15.

NADPH NADP+ - OPP 1 9 O2 HO2 10 2 8 3 14 12 III7 4 5 6 11 OPP CH OH 15 13 2 farnesyl diophosphate (1 ) (+)-germacrene A (2 ) 3 NAD(P)+ III NAD(P)H

O CHO O 4

guaianolides + HO2 NAD(P) III NAD(P)H NADP+ NADPH

HO2 HO2 O2

O O IV IV COOH O OH COOH O germacranolides (+)-costunolide (6 ) germacrene acid (5 )

O O eudesmanolides

The application of sesquiterpene lactones to a chemotaxonomic study of the plant family Asteraceae is a subject of discussion in Chapter 2 and Chapter 3 of this work. 6 Introduction

1.2.3 Virtual screening

Over the past decade, high-throughput screening (HTS) has become a cornerstone technology of pharmaceutical research.6,17 Through a combination of modern robotics, data processing and control software, liquid handling devices, and sensitive detectors, HTS allows a researcher to effectively conduct millions of biochemical, genetic or pharmacological tests in a short period of time. This process allows the rapid identification of compounds which exhibit biological activity. The results of HTS experiments provide starting points for drug design and for understanding the interaction or role of a particular biochemical process in biology.

Screening of one million compounds per target was predicted to become a standard for the major pharmaceutical companies as early as 2003.17 Currently HTS robots exist which can test up to 100,000 compounds per day.18 However, despite the increased research and development budgets and the large investments in HTS technologies the number of new drugs introduced per year has remained constant at best over the past years.19 Thus, it is sought to replace the brute-force approach of HTS by a knowledge-based approach in which fewer, but “smarter”, experiments are performed.

Various computational approaches are available to complement the array of high- throughput discovery technologies, and among these, virtual screening is one of the most popular. In essence, virtual screening methods are designed for searching large compound databases with the help of computer (often termed in silico screening) and selecting a limited number of candidate molecules for testing to identify novel chemical entities that have the desired biological activity.20,21

The main methods for virtual screening are protein-structure-based compound screening or docking22 and chemical-similarity searching based on small molecules.23

The first approach requires the 3D structure of the target protein and is known as structure- based virtual screening. A detailed discussion of the multitude of available docking algorithms24 utilized for structure-based virtual screening is beyond the scope of this work. In summary, the docking process involves the prediction of ligand conformation and orientation within the binding site of the target. There are two aims of docking studies: accurate structural modeling and correct prediction of activity. A detailed review of docking and scoring in virtual screening can be found in reference 24. Scientific background 7

The second approach is called ligand-based virtual screening. It requires one or more small molecule (ligand) which are known to be active against the target protein and is employed when the 3D structure of the target protein is unknown. However, even when the structure of the target protein is available, ligand-based screening is still useful, due to the fact that hit or lead information is still the predominant source of data in many cases.

There are a lot of different methods to perform ligand-based virtual screening. When a single active structure is available, similarity searching23 is the preferred approach, and popular tools include various two-dimensional (2D) and three-dimensional (3D) structural queries, pharmacophore models, 2D and 3D fingerprints, volume-matching techniques and complex molecular descriptors – some of these are schematically presented in Figure 1.1

docking

Cl

NH2 N

2D 3D

N 2D representation substructure volume/surface matching 3D pharmacophore

Figure 1.1 A known acetylcholine esterase inhibitor is shown as a template molecule for virtual screening. For single templates, similarity search is a preferred approach, and some of the used approaches include various two-dimensional (2D) and three-dimensional (3D) structural queries, pharmacophore models, volume and shape matching techniques, and complex structural descriptors. Adapted from reference 17.

8 Introduction

Other methods, including almost all machine learning techniques, require multiple molecules as input (such as a series of inhibitors). These approaches include 2D or 3D quantitative structure-activity relationship (QSAR) models and diverse clustering and partitioning methods and are schematically presented Figure 1.2. Chapter 4 and Chapter 5 of this work focus on ligand-based virtual screening with multiple molecules as input.

O Cl O O N N N N O N

Figure 1.2 Various virtual screening approaches, including almost all machine learning techniques, require multiple active molecules as the three acetylcholine esterase inhibitor shown. These approaches include 2D or 3D quantitative structure-activity relationships (QSAR) models or diverse clustering and partitioning methods. Adapted from reference 17.

1.3 Objectives and structure of this work

The main objective of this work was to discover knowledge from chemical data. To achieve this, different machine learning techniques were applied to two fields – chemotaxonomy and ligand-based virtual screening. Objectives and structure of this work 9

Chapter 2 presents the application of different classification methods to the assignment of a special class of plant secondary metabolites – sesquiterpene lactones (STL) – to seven tribes of the plant family Asteraceae. The extent to which the secondary metabolism of Asteraceae plants corresponds to their taxonomic classification is the main question investigated in this study. In addition, a few chemoinformatics related questions – comparison between different supervised learning techniques and structure representations and comparison between different approaches to defining the applicability domain of a model – are investigated. The results of this study have been published in the Journal of Chemical Information and Modeling.

Chapter 3 presents a logical extension of the study described in Chapter 2. To overcome a common problem, encountered in chemotaxonomy – the appearance of secondary metabolites in several taxa, the concept of multi-labeled classification is introduced. Multi-labeled classification is a machine learning technique which allows the assignment of an object to more than one class simultaneously. Multi-labeled classification models capable of relating STLs from Asteraceae to the tribe(s) from which they come from taking into account skeletal types and substitutional patterns are developed and the results are interpreted from a chemotaxonomic and from a practical point of view. The results of this study have been submitted for publication in the Journal of Chemical Information and Modeling.

Chapter 4 presents an attempt to explore the vast amount of data stored in current chemical databases. To achieve this and to show how a machine learning technique can help in the discussed transition from brute-force HTS approach to knowledge-driven experiments, virtual screening for five activity classes in a large database of biologically active compounds is performed. The work described in Chapter 4 has several aims. First, to examine the applicability of novelty detection – a class of machine techniques, devised for working with single-class data, in the field of ligand-based virtual screening. Second, to compare a particular novelty detection implementation, based on Self-Organizing Maps11 (SOM) with the most commonly used ligand-based virtual screening method – similarity search with binary fingerprints. Finally, to study the effect of different methods for subset selection on the results of a virtual screening experiment. The results of this study have been accepted for publication in the Journal of Chemical Information and Modeling.

Chapter 5 continues the exploration of ligand-based virtual screening in more concrete practical scenarios. Four such scenarios: prioritizing compounds for a subsequent HTS 10 Introduction

experiment; selecting compounds for a subsequent lead-optimization; assessing the probability that a given structure will exhibit a particular biological activity; and the identification of the most active structure are examined. The aim of the work presented in Chapter 5 is to test the applicability of different ligand-based virtual screening methods and of different chemical structure representations in each of the above scenarios. Different measures for the success of the virtual screening experiment in each scenario are presented and discussed. The optimal size of the training set, the difference in the chemical spaces covered by two large databases of biologically active compounds – MDL Data Drug Report (MDDR)25 and World Of Molecular BioAcTivity (WOMBAT),26 the bias introduced by the training set selection, the differences in the compounds recovered by different methods or/and descriptors are discussed and the best method-descriptor combination is identified for each scenario. The results of this study have been submitted for publication in the Journal of Computer-Aided Molecular Design.

Each chapter is preceded by a short introduction, which presents some concepts specific to the particular study and is followed by short comments putting the results in the perspective of the overall work.

1.4 References

(1) Witten, I. H.; Eibe, F. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: San Francisco, 2000.

(2) Suarez-Villa, L. Invention and the Rise of Technocapitalism; Rowman & Littlefield Publishers, Inc: New York, NY, USA, 2000.

(3) Wolff, E. N. The Growth of Information Workers in the U.S. Economy. Commun. ACM 2005, 48, 37-42.

(4) Google Directory of Chemical Databases, http://www.google.com/Top/Science/Chemistry/Chemical_Databases/, accessed 06.2007

(5) Chemical Abstract Services. http://www.cas.org/expertise/cascontent/registry/index.html, accessed 06.2007 References 11

(6) Bleicher, K. H.; Böhm, H. J.; Müller, K.; Alanine, A. Hit and Lead Generation: Beyond High-Throughput Screening. Nat. Rev. Drug Discov. 2003, 2, 369-378.

(7) Handbook of Chemoinformatics, Gasteiger, J., Ed., 4 Volumes; Wiley-VCH Verlag: Weinheim, 2003.

(8) Chemoinformatics, Gasteiger, J., Engel, T. Eds.; Wiley-VCH Verlag: Weinheim, 2003.

(9) Zupan, J.; Gasteiger, J. Neural Networks in Chemistry and Drug Design; 2nd Edt., Wiley-VCH: Weinheim, 1999.

(10) Venables, W. N.; Ripley, B. D. Modern Applied Statistics With S; Springer: New York, USA, 2002.

(12) Kohonen, T. Self-Organizing Maps; Springer: Berlin, 2001.

(12) Bioinformatics Definition Committee. NIH Working Definition Of Bioinformatics And Computational Biology. National Institutes of Health, 2000, http://www.bisti.nih.gov/CompuBioDef.pdf, accessed 06.2007

(13) Beaman, J. H. Plant Taxonomy. Clin. Dermatol. 1986, 4, 23-30.

(14) Hegnauer, R. Phytochemistry and Plant Taxonomy -- an Essay on the Chemotaxonomy of Higher Plants. Phytochemistry 1986, 25, 1519-1535.

(15) de Kraker, J. W.; Franssen, M. C.; Joerink, M.; de Groot, A.; Bouwmeester, H. J. Biosynthesis of Costunolide, Dihydrocostunolide, and Leucodin. Demonstration of Cytochrome P450-Catalyzed Formation of the Lactone Ring Present in Sesquiterpene Lactones of Chicory. Plant Physiol. 2002, 129, 257-268.

(16) Seaman, F. C. Sesquiterpene Lactones As Taxonomic Characters in the Asteraceae. Bot. Rev. 1982, 48, 123-551.

(17) Bajorath, J. Integration of Virtual and High-Throughput Screening. Nat. Rev. Drug Discov. 2002, 1, 882-884.

(18) Hann, M. M.; Oprea, T. I. Pursuing the Leadlikeness Concept in Pharmaceutical Research. Curr. Opin. Chem. Biol. 2004, 8, 255-263.

(19) Smith, A. Screening for Drug Discovery: The Leading Question. Nature 2002, 418, 453-459. 12 Introduction

(20) Walters, W. P.; Stahl, M. T.; Murcko, M. A. Virtual Screening - an Overview. Drug Discov. Today 1998, 3, 160-178.

(21) Bajorath, J. Selected Concepts and Investigations in Compound Classification, Molecular Descriptor Analysis, and Virtual Screening. J. Chem. Inf. Model. 2001, 41, 233-245.

(22) Kuntz, I. D. Structure-Based Strategies for Drug Design and Discovery. Science 1992, 257, 1078-1082.

(23) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical Similarity Searching. J. Chem. Inf. Model. 1998, 38, 983-996.

(24) Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and Scoring in Virtual Screening for Drug Discovery: Methods and Aplications. Nat. Rev. Drug Discov. 2004, 3, 935-949.

(25) MDL Drug Data Report, version 2006.1, MDL Information Systems, http://www.mdli.com/, accessed 06.2007

(26) Olah, M.; Mracec, M.; Ostopovici, L.; Rad, R.; Bora, A.; Hadaruga, N.; Olah, I.; Banda, M.; Simon, Z.; Mracec, M.; Oprea, T. I. WOMBAT: World of Molecular Bioactivity. In Cheminformatics in Drug Discovery; Oprea, T. I., Ed.; Wiley-VCH: New York, 2003.

13

2 Sesquiterpene Lactones-Based Classification of the Family Asteraceae Using Neural Networks and k- Nearest Neighbors

Overview

Our trip in learning from chemical data begins with an exploration of the relation between the distribution of a special class of plant secondary metabolites – sesquiterpene lactones – and the taxonomic classification of plants from the family Asteraceae.

The Asteraceae (known also as Compositae) family is one of the largest angiosperm plant families. It constitutes a group of plants spread widely across the world, comprising about 23,000 species. Generally, a family of plants is divided into this decreasing ranking: subfamilies, tribes, subtribes, genera and species. The subdivision of the Asteraceae family into groups is not strictly the same for different botanists. The work presented in this chapter follows the most commonly used taxonomic classification proposed by Bremer.1

According to Bremer, the Asteraceae family is divided into three subfamilies and seventeen tribes: subfamily Barnadesioideae (tribe Barnadesieae); subfamily Cichorioideae (tribes Arctoteae, Cardueae, Lactuceae, Liabeae, Mutisieae, Vernonieae); and subfamily (tribes , Astereae, Calenduleae, Eupatorieae, Gnaphalieae, Helenieae, Heliantheae, Inuleae, Plucheeae, Senecioneae). The tribes Cardueae, Lactuceae, Vernonieae, Anthemideae, Eupatorieae, Heliantheae, and Inuleae are subject of the investigation in the rest of this chapter.

The plant species from the Asteraceae family have various uses and economic values:

• food: sunflower, artichokes, lettuce, chicory and herbal tea like camomile.

• fodder: sandbietou or tickberry (Chrysanthemoides monilifera subsp. pisifera), blouheuningkaroo (Felicia muricata), witheuningkaroo (Phymaspermum parvifolium) and bierbos (Pteronia membranacea)

• medicinal: wilde-als or African wormwood (Artemisia afra), kapokbos or wild rosemary (Eriocephalus africanus) and wild camphor bush (Tarchonanthus camphoratus), amongst many others 14 Sesquiterpene lactones classification of Asteraceae

• garden plants and cut flowers: edelweiss (Leontopodium alpinum), Barberton daisy (Gerbera species), asters, dahlias, chrysanthemums, cornflowers and sunflowers, amongst many others

The chemistry of Asteraceae is quite diverse. According to Zdero and Bohlmann,2 over 5000 species of the family have already been chemically studied, and 7000 micromolecular compounds isolated. These compounds include monoterpenes, sesquiterpenes, sesquiterpene lactones, diterpenes, triterpenes, polyacetylenes, flavonoids, coumarins, benzofuranes, acetophenones and phenylpropanes.

Amongst the terpenoids, the sesquiterpene lactones are that structural class of secondary metabolites which is most often used for chemotaxonomic studies in the family Asteraceae. The sesquiterpene lactones, being complex molecules which require many steps for their biosynthesis, are considered taxonomic markers within the Asteraceae family.3 They are widespread in most of the 17 tribes of Asteraceae and have been reported in about 90% of the investigated Asteraceae species. Currently there are more than 4,000 known sesquiterpene lactones with around 40 different carbon skeletons, some of which are depicted in Chart 2.1.

The large amount of chemical information is a rich source of data. The use of these data for the investigation of the relations between the distribution of sesquiterpene lactones and the taxonomic classification of plants from the Asteraceae family by various machine learning techniques is the subject of this chapter.

The remainder of this chapter corresponds exactly to the original paper published in the Journal of Chemical Information and Modeling other than the last section after the references list and the numbering.

References

(1) Bremer, K. Asteraceae: Cladistics and Classification; Timber Press: Portland, 1994.

(2) Zdero, C.; Bohlmann, F. Systematics and Evolution Within the Compositae, Seen With the Eyes of a Chemist. Plant Syst. Evol. 1990, 171, 1-14.

(3) Seaman, F. C. Sesquiterpene Lactones As Taxonomic Characters in the Asteraceae. Bot. Rev. 1982, 48, 123-551. Overview 15

Chart 2.1 Common sesquiterpene lactones skeletons.

O O O O O O O O germacranolide germacranolide 8 eudesmanolide eudismanolide 8

O O O O O O O O eremophilanolide eremophilanolide 8 elemaniolide elemanolide 8

O O O O O O O O 2-3-seco-germacranolide 2-3-seco-germacranolide 8 8-9-seco-Germacranolide 8-9-seco-Germacranolide 8

O O O O O O O O guanolide nor-guaianolide pseudo-guanolide 8 carabrone 8

O O O O O O O O xanthanolide xanthanolide 8 cadinanolide trichosalviolide

O O O O O O O O 10-1-seco-guanolide 3-4-seco-pseudo-guanolide lasiolaenolide 10-1-seco-eudesmanolide

16 Sesquiterpene lactones classification of Asteraceae

Original Article

Sesquiterpene Lactones-Based Classification of the Family Asteraceae Using Neural Networks and k-Nearest Neighbors

Dimitar Hristozov1, Fernando B. Da Costa1,2*, and Johann Gasteiger1

1Computer-Chemie-Centrum, Universität Erlangen-Nürnberg, Nägelsbachstr. 25, D-91052 Erlangen, Germany

2Laboratório de Farmacognosia, Faculdade de Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, Av. do Café s/no., 14040-903, Ribeirão Preto, SP, Brazil

Hristozov, D., Da Costa, F.B., Gasteiger, J., J. Chem. Inf. Model., 2007, 47 (1), pp. 9-19

Abstract

In a recent publication we described the application of an unsupervised learning method using self-organizing maps to the separation of three tribes and seven subtribes of the plant family Asteraceae based on a set of sesquiterpene lactones (STLs) isolated from individual species. In the present work, two different structure representations – atom counts (2D) and radial distribution function (RDF) (3D) –, and two supervised classification methods – counter- propagation neural networks and k-nearest neighbors (k-NN) were used to predict the tribe in which a given STL occurs. The data set was extended from 144 to 921 STLs and the Asteraceae tribes were augmented from three to seven. The k-NN classifier with k=1 showed the best performance while the RDF code outperformed the atom counts. The quality of the obtained model was assessed with two test sets, which exemplified two possible applications: 1) finding a plant source for a desired compound and 2) based on a plant species chemical profile (STLs): a) study the relationship between the current taxonomic classification and plant’s chemistry; b) assign a species to a tribe by majority vote. In addition, the problem of defining the applicability domain of the models was assessed by means of two different approaches – principal component analysis combined with Hotelling T2 statistic and an a posteriori probability-based rule. Introduction 17

2.1 Introduction

Plant chemotaxonomy is generally focused on the investigation of chemical relationships among taxa at different taxonomic hierarchical levels, like families, tribes, genera or species. For instance, these relationships are investigated taking into account the observation and the analysis of biosynthesis, occurrence and special trends of accumulation of certain structural classes of secondary metabolites within a given taxon. Among several possible applications, the obtained information can be used for classification of plants into groups. Hence, chemotaxonomy had a strong influence on plant systematics and new taxonomic classifications were developed taking into account the distribution of such metabolites.1 Moreover, the impact of this subject may also exert an effect on other areas of research like plant metabolism, biological activities, geographic distribution of secondary metabolites, analysis of cultivars and quality control of plant drugs.

Amongst the terpenoids, the sesquiterpene lactones (STLs) comprise the most used structural class of secondary metabolites for chemotaxonomic studies in the family Asteraceae. They are considered taxonomic markers, show many biological activities, and are of commercial as well as of ecological value.2,3 According to Bremer,4 the family Asteraceae has ca. 23,000 species arranged into 17 tribes, 82 subtribes, and around 1,500 genera where the STLs are widespread. Currently there are more than 4,000 known STLs and around 40 different carbon skeletons comprise the most important group.5,6 Consequently, this large amount of chemical information is a rich source of data which has great value for the investigation of relationships among taxa at tribal, subtribal, generic or species levels. Such investigation can be done by statistical methods but so far only principal component analysis (PCA) has been explored.6 Hence, machine learning techniques, for example classification methods, may provide useful ways for exploration of these data.

The use of classification methods for predictive purposes raises the question about the applicability domain of the built models. Given the fact that a machine learning technique is always trained on data covering a limited area of the chemical universe, it becomes important that it is able to differentiate between (i) compounds which are very far from the training data and (ii) compounds which are in or very near to the space, spanned by the training data. While in the field of QSAR and chemoinformatics the use of such methods is still in an early stage, as reviewed by Netzeva et al.,7 different approaches have been proposed in the field of machine learning, termed “novelty detection”. They can be separated in two major groups – 18 Sesquiterpene lactones classification of Asteraceae

statistical methods8 and neural networks.9 Targeting specifically the k-NN classifier, a number of rejection strategies have been proposed, 8,10,11 mostly utilizing the a posteriori probability estimates.

The representation of chemical structures is another important aspect that can change dramatically the quality of the obtained models. There are several ways to describe a chemical structure,12 e.g. based on the connectivity information alone (“2D”) or on the spatial arrangement of the atoms in a molecule (3D).

Recently we described the separation of seven subtribes of the plant family Asteraceae based on sesquiterpene lactones (STLs) using a data set which consisted of ca. 150 STLs.13 That approach was based on unsupervised learning through the projection of the 3D structures of STLs encoded by their Radial Distribution Functions (RDF) into Self-Organizing Maps (SOMs). The study provided useful insights into the STL-based chemosystematics.

In this work, we present classification models which predict the Asteraceae tribe from which a given STL has been isolated based on two different structure representations – atom counts augmented with stereo information14 (“2D”) and a description of the 3D structure by RDF code.15 The applicability of such classification is demonstrated with two use cases: (1) targeted collection of plant material – given a known and/or desired STL, the most probable tribe where it can be found is suggested by the classification, thus substantially limiting the possible plant sources (Figure 2.1); (2) once the STL profile of a given plant species is known, the proposed model can be used in the subsequent chemotaxonomic analysis and the species can be assigned to a tribe based on majority vote (Figure 2.2). The second use case can lead to new insights into the relationship between the current taxonomic classification and plant’s chemistry. Counter-propagation (CPG) neural networks16 and k-nearest neighbors (k-NN) were used to build the classification models. A comparison is presented between two different approaches for rejecting patterns which are likely to be misclassified: (i) distance metric based on PCA and Hotelling T2 statistic17 and (ii) a probability-based approach that takes into account the distances to the nearest neighbors as well as their class membership.11

Introduction 19

Figure 2.1 Illustration of the first use case of the proposed classification model: classifying compound(s) of interest with the aim of finding plant source(s) for them.

Figure 2.2 Illustration of the second use case of the proposed classification model: starting with an STL profile of a given species (STL1+STL2+STL3, first grey area), the classification is made for each single STL from this profile and the results are used to analyze chemical relationships between different tribes – the second grey area; in addition, the plant species can be assigned to a tribe by majority vote

20 Sesquiterpene lactones classification of Asteraceae

Thus, this work has the following objectives: (1) development of a classification model, which is capable of classifying STLs from plant species into their corresponding Asteraceae tribes; (2) comparison between different supervised learning techniques and structure representations; (3) comparison between different approaches for defining the applicability domain of a model. An advantage of this model is that it can deal with literature-new or previously unreported structures rather than only retrieving known STLs from a data set.

2.2 Materials and Methods

2.2.1 Data sets

The structures of the 921 STLs and the information concerning their taxonomic origin were taken from comprehensive surveys published in the literature.2,18-40 The taxa (tribes and genera) presented herein are in accordance to the division made by Bremer.4 This information was used to assign all chemical structures to their corresponding taxa. One hundred and nine (around 11%) STLs were reported in more than one tribe, i.e. they overlap. In this work, each of the overlapping STL was assigned to only one tribe. The criterion used to assign such structures was their predominance within the tribes from where they were reported. It means that if a certain STL occurs in more than one tribe it was placed in the one in which it was reported more times, i.e. in the one with the highest number of occurrences. The number of occurrences of each STL in the corresponding tribes was obtained through a systematic survey in SciFinder.41 This criterion does not lead to misrepresentation of tribes with respect to their actual chemical profile since only the most representative STLs of each tribe were considered. Thus, such representative STLs cover a broad array of chemodiversity since their skeletons and most common substitutional features are present in the data set. During the data collection, care was taken to select as many different STLs types as possible, i.e. to select structures that belong to a great variety of skeletal classes and subclasses among all tribes. The most common major biogenetic types of compounds, germacranolides, guaianolides and eudesmanolides, which together comprise the tripartite chemistry4 of the family Asteraceae, were selected in a major scale. Analogously, the biogenetically advanced but less abundant skeletal types, like pseudoguaianolides, elemanolides, xanthanolides and seco-derivatives were also taken into consideration. Ring functionalization and substitution features like side chain esters and sugar moieties were also taken into account as well as the presence of dimers Materials and Methods 21 and other minor components of low occurring nature. As nowadays the great majority of all reported structures of STLs have their stereo centers fully assigned, in this work only those with total assigned stereochemistry were selected. Table 2.1 gives the distribution of the 921 STLs into their corresponding tribes.

Table 2.1 Distribution of STLs in their corresponding Asteraceae tribes

Tribe Abbreviation Number of STLs Anthemideae ANT 175 Cardueae CAR 73 Eupatorieae EUP 202 Heliantheae HLT 296 Inuleae INU 50 Lactuceae LAC 48 Vernonieae VER 77 Total 921

The most represented tribe in the data set in terms of number of occurrences and skeletal diversity of STLs is Heliantheae (HLT), from which the largest number and diversity of STLs was reported so far. The number and variety of structures of STLs in the remaining six tribes followed the same criterion, i.e. the selection of STL-rich and representative taxa. Most of these taxa have already been subject of chemotaxonomic studies based only on their STL profile.2,3,13

The data set was then split semi-randomly – the corresponding distribution of STLs in the tribes was preserved – in three subsets and: training set with around 70% (644 structures) of the data set, cross-validation (135 structures) and test set (142 structures). The test set was not used in the model building but only after all parameters of the underlying classification method were established. It illustrates the first use case, i.e. given an STL that is of particular interest – even literature-new structures – the classification is made in order to find a plant source for it (cf. Figure 2.1).

To exemplify the second use case (cf. Figure 2.2) a second test set consisting of 72 STLs from nine plant species from the seven tribes was built. The information about these STLs, i.e. their plant sources and chemical structures, was collected from recent articles as indicated in Table 2.2. In this case, starting with an STL profile of a given species – including previously 22 Sesquiterpene lactones classification of Asteraceae

unreported structures –, the classification is made for each single STL and the results are used to analyze relationships among different tribes in terms of their chemical profiles. This second set is summarized in Table 2.2.

In addition to the chemotaxonomic analysis, the second test set was also used with the final model to predict the tribe to which a species belongs. A prediction was made for each single compound from a species chemical profile and the majority vote was used to determine the tribe. For example, given a profile, consisting of seven STLs, if five of them were predicted to belong to a certain tribe A and two of them to tribe B, then the species was assigned to tribe A (Figure 2.2).

Table 2.2 Second test set – STLs from nine plant species

Species Tribe Abbreviation Number of STLs Anthemis wiedemanniana42 ANT ANT. es1 3 Achillea depressa43 ANT ANT. es2 7 Centaurea babylonica44 CAR CAR.es1 6 Centaurea acaulis45 CAR CAR.es2 6 Stevia alpina var. glutinosa46 EUP EUP.es1 9 Viguiera eriophora47 HLT HLT.es1 11 Carpesium abrotanoides48 INU INU.es1 8 Leontodon palisae49 LAC LAC.es1 5 Vernonanthura lipeoensis50 VER VER.es1 17 Total 72

2.2.2 Structure representation

Two different structure representations – atom counts and radial distribution functions – were investigated.

2.2.2.1 Atom counts

This representation involves a histogram of atom counts with atom types, which was built for each structure. These atom types were previously defined and originally used for the prediction of 13C NMR spectra using artificial intelligence.14 In addition, as can be deduced from the 2D representation of the structures, the relative configuration of some centers, i.e. the orientation of some bonds, was taken into account. As already mentioned, the structures in Materials and Methods 23 the data set have fully assigned stereochemistry and in this work their corresponding 2D structures were drawn in such a way which is common in most of the publications regarding STLs. This code was built by considering the same atom types as different if, for instance, a sp3 carbon atom has an α- or a β-oriented substituent, i.e. a hydroxyl group or an ester moiety below or above of the plane, respectively. The originally reported atom types were chosen with the aim to describe any common organic structure.14 In our study, the great majority of the structures of STLs have mainly sp3 and sp2 carbon as well as oxygen atoms. As a consequence, only the following atom types were considered: –CH3, –CH2, –CH–, –C–,

=CH2, =CH, =C–, C=O, –C–OH, –C–O–, and =C–OH, as well as other minor types. The inclusion of stereo information as already mentioned resulted in 27 different atom types. However, the examination of the histograms which were generated using all the 27 atom types revealed that the number of compounds in the training set, e.g. those that have stereo information for certain atom types, was very low. So, their corresponding counts were removed and the final representation consists of a histogram with 17 bins (different atom types) for each STL: –CH3, –CH3 α-oriented, –CH3 β-oriented, –CH2–, –CH–, –CH– α- oriented, –CH– β-oriented, =CH2, =CH-, =C–, C=O, –C–OH, –C–OH α-oriented, –C–OH β-oriented, –C–O–, –C–O– α-oriented, and –C–O– β-oriented. Figure 2.3 gives an example of the histogram, obtained for the shown STL.

24 Sesquiterpene lactones classification of Asteraceae

Figure 2.3 Example of an STL coded with histogram of atom counts.

The rationale behind this descriptor is that different plant species from specific tribes have special trends of accumulation of compounds which show certain features at specific positions of their skeletons, e.g. α- or β-oriented hydroxyl groups and their corresponding esters, reduction of double bonds, epoxy groups, etc.

2.2.2.2 Radial Distribution Function (RDF)

The second representation is based on RDF codes15 calculated using the three dimensional structures. Single, low energy 3D conformations were generated for the STLs from their 2D constitution using CORINA.51,52 The RDF codes were calculated with the descriptor calculation package ADRIANA.CODE53 according to the following equation:

N −1 N 2 −B(r−rij ) g(r) = ∑∑Ai Aj e (2.1) i=>1 j i

where N is the number of atoms in a molecule, Ai and Aj are properties associated with the

atoms i and j, respectively, rij represents the distance between atoms i and j, and B is a smoothing factor. The above formula was applied with the parameter A set to the atomic Materials and Methods 25 number of the considered atom and 32, 64 and 128 dimensional RDF codes were calculated. The function g(r) was defined in the interval 2.0–9.0 Å. RDF of an ensemble of atoms can be interpreted as a probability distribution of the atoms in 3D space. This code is independent of the number of atoms, it is unique regarding the three dimensional arrangement of the atoms and is invariant against translation and rotation of the entire molecule. The RDF code can distinguish diasteroisomers as well as epimers but not enantiomers. Although the STLs show several chiral centers and may potentially occur as enantiomers pairs, the STL-producing plants do not biosynthesize stereoisomers and the fact that RDF does not reflect such differences is not relevant in this study. Figure 2.4 shows the same STL as in Figure 2.3 and its corresponding RDF code.

Figure 2.4 Example of an STL coded with 64 dimensional radial distribution function in the interval 2.0–9.0 Angstrom using atomic number as a property for each atom.

26 Sesquiterpene lactones classification of Asteraceae

2.2.3 Classification methods

2.2.3.1 Counter-propagation (CPG) neural network

A CPG neural network consists of a SOM block (input layer) and an additional output block. The input data are stored in a two-dimensional grid of neurons, each containing as many elements (weights) as there are input variables (Figure 2.5). All the units of the inputs (object + properties) are linked to the SOM block.16 In this work, the input data are the structures of STLs as represented by their atom counts or RDF codes. During the training, a winning neuron is chosen by the SOM block for each individual object (coded structure of a STL). Each neuron in the SOM block is in turn connected to a neuron in the output block (Figure 2.5). In this study, the output block consisted of seven layers, one layer for each of the seven tribes of Asteraceae. The weights of the output layers are adapted in order to become closer to the output value of the presented object. Predictions for new compounds are made by determining the winning neuron, defined as the neuron with the smallest Euclidian distance between its weight vector and the X-variables of the STL. CPG neural networks were calculated using SONNIA.54

Figure 2.5 Architecture of the counterpropagation neural network for the classification of STLs into seven Asteraceae tribes with seven output layers – one for each tribe. Materials and Methods 27

2.2.3.2 k-Nearest Neighbor (k-NN)

This method belongs to the family of “lazy” learners. It is memory-based and requires no model to be fit.55 All training examples are stored in the memory and the prediction of a new pattern is made by finding its k nearest neighbors in terms of some predefined distance measure and assigning the pattern to the class to which the majority of the nearest neighbors belong. In this study, the measure was the Euclidian distance, which is the most common used. Although it is a very simple method, k-NN has been successfully utilized in many real world applications. It is able to approximate highly irregular class boundaries which are inevitable when highly overlapping classes have to be classified. The choice of k is crucial and very important because it controls the bias-variance trade-off of the method. A low number of neighbors provides low bias and high variance, while high values of k tend to reduce the variance but increase the bias.56

2.2.4 Model validation

The quality of the initial model was assessed by means of the first test set (142 STLs, see section 2.2.1). The final model, which is based on the entire set of 921 STLs, was evaluated by means of the second test set (72 STLs, see Table 2.2) as well as with stratified 10-fold cross-validation.55 The data set was randomly divided into ten subsets, each of which contained approximately the same distribution of the seven classes as the whole data set. Then a model was fitted taking nine of these subsets as a training set and the remaining one as a test set. The procedure was repeated ten times until each subset has been used as a test set once. Due to the random splitting the estimates can vary; therefore, the whole procedure was repeated ten times, leading to a hundred model fitting runs. The confusion matrix of the average cross-validated predictions was computed at the end of the procedure. From it the following per class and overall statistics were derived:

- recall (also known as sensitivity), which is obtained by dividing the number of correctly classified compounds of a given tribe by the total number of compounds in this tribe;

- precision, which is obtained dividing the number of correctly classified compounds of a given tribe by the total number of compounds which were predicted to belong to this tribe;

2× recall × precision - F-measure, which is calculated as and combines the recall and recall + precision precision in a single efficiency measure (it is the harmonic mean of precision and recall); 28 Sesquiterpene lactones classification of Asteraceae

- Cohen’s kappa, which gives a measure of the classification accuracy after accounting for chance effects. It is independent of the prevalence of a given class57 and is therefore better suited for data sets in which the different classes are not evenly distributed than the overall correct classification rate. It takes values between 0 and 1, with values of 0.0 – 0.4 considered to indicate slight to fair model performance, values of 0.4 – 0.6 moderate, 0.6 – 0.8 substantial and 0.8 – 1.0 almost perfect.

2.2.5 Determination of Prediction Space

2.2.5.1 Principal Component Analysis (PCA) and Hotelling T2

Eriksson et al.17 gave a detailed description of applying PCA in concert with Hotelling T2 score as a method for defining the prediction space. The methodology was also used in a recent classification study.58 In summary, a PCA is performed on the training set and the obtained loading matrix is used to calculate scores on the external set. Using these scores and a given confidence level, a pattern from the external set is decided to belong to the prediction space if its Hotelling’s score satisfies the following criterion:

A(N 2 −1) T 2 < F( p = α) (2.2) i N(N − A)

where F(p=α) is a tabulated value for a F distribution using a confidence level α, A is the number of principal components used to build the Hotelling’s test, and N is the number of compounds of the training set, and

A t 2 T 2 = ia i ∑ 2 (2.3) a=1 sa

2 where sa is the variance explained by principal component a and tia is the score of compound i for principal component a.

A graphical overview of this approach is shown in Figure 2.6. It shows the PCA-score plot of our test set (142 STLs) described by 64-dimensional RDF codes. Materials and Methods 29

Figure 2.6 PCA-score plot of the test set (142 STLs) with 64 dimensional RDF codes. The ellipse defines the 95% confidence region, as calculated by using the training set (779 STLs). Each point represents an STL, the tribe from which it was isolated being indicated by the corresponding symbol.

The ellipse defines the 95% confidence region and was determined from the training set – 779 STLs. This number of STLs refers to the training set, which comprises 644 structures (around 70% of the data set), and more 135 structures from the cross-validation set (around 15%). All points outside the ellipse are supposed to be outside the prediction space and are subsequently not classified. Its application is not limited to the first two PCs as suggested in Figure 2.6. The ellipsoid can be defined by any sensible number of PCs. In this study, we used the first 5 PCs, which covered approximately 85% of the variance. This approach is rather global, i.e. it rejects patterns that fall clearly outside of the volume, spanned by the training set. However, if a pattern falls inside this volume it is always accepted, even if the training data density around it is sparse.

2.2.5.2 Reject rule

In a statistical classifier, an estimation pˆ(x | wi ) of the probability density function (p.d.f.) of each class in the feature space is computed from a set of training observations. In the 30 Sesquiterpene lactones classification of Asteraceae

voting k-NN classifier the decision rule is equivalent to the Bayes’ minimum error rule with the following p.d.f. estimation:11

ki pˆ(x | wi ) = (2.4) ni v

where ni is the number of prototypes of the class wi, ki is the number of prototypes from

class wi among the k nearest neighbors, and v is the volume of a hyper sphere comprising all of them. The Bayes’ theorem can be applied to this p.d.f. estimation to obtain a posteriori ˆ probability estimation P(wi | x) . However, for a finite size data sets and small values of k it is often preferable to apply a measure which takes into account the distances as well as the class membership of the nearest neighbors.11 A commonly used estimation and the one which was applied in this study, is:

1 ∑ d(x, y ) ˆ j∈si j P(wi | x) = k (2.5) ∑ 1 j=1 d(x, y j )

where d is the distance measure used and si is the set of sub-indices of the prototypes from

class wi among the k nearest neighbors retrieved y1 … yk. A suitable small value should be used when the distance computation gives a result of zero.

This measure can be interpreted as the confidence of the classifier on its decision. An

ambiguity threshold Ta ∈[]0,1 can be applied to reject the patterns that are not clearly

classified in one class, i.e. the value of the estimator is smaller or equal to Ta.

2.3 Results

2.3.1 CPG neural network models

To select the best size and topology of the SOM block, different size/topology combinations were examined with both structure representations. The initial size of the SOM block was set to5× N , where N is the number of compounds in the training set following an empirical rule59 and was increased until 0.8× N . Each size was tested with toroidal and rectangular topology of the SOM block, with both structure representations and with training Results 31 time of 50 epochs. The quality of the obtained models was judged using the validation set. After the best one has been selected – 19x22 dimensional SOM block with rectangular topology for both structure representations – the validation set was merged with the training set and a new model was built using the already determined best parameters. Table 2.3 gives the classification statistics of this model when applied to the first test set with atom counts and 64 dimensional RDF codes. Since the training of a SOM is a stochastic approximation process, the given values are averaged over ten runs with different seeds.

Table 2.3 Classification statistics of the CPG neural network models applied on the first test set (142 STLs, 19x22 dimensional SOM layer, rectangular topology, 50 epochs, average values over ten runs with different seeds)

Atom counts RDF codes, 64 dimensional Tribe Recall Precision F-measure Recall Precision F-measure Anthemideae 0.589 0.493 0.535 0.726 0.498 0.59 Cardueae 0.527 0.481 0.499 0.554 0.534 0.536 Eupatorieae 0.573 0.583 0.576 0.57 0.48 0.518 Heliantheae 0.693 0.582 0.631 0.582 0.547 0.562 Inuleae 0 NaNa NaN a 0 NaN a NaN a Lactuceae 0.444 0.895 0.589 0.344 0.718 0.458 Vernonieae 0.492 0.724 0.58 0 NaN a NaN a Kappa 0.439 0.38 a “not a number” – indicates that no STLs were predicted as occurring in the corresponding tribe.

RDF codes with different dimensionalities, i.e. 32 and 128, were examined as well. The obtained kappa values – 0.342 and 0.298 – were somehow lower and we concluded that 64 is the best RDF dimensionality for our case.

2.3.2 k-NN models

The selection of the optimal number of neighbors for the k-NN classifier was made by varying k from 1 to 9. Analogously to the CPG neural network approach, the validation set (135 STLs) was used at this point and once the optimal value for k had been determined this set was merged with the training data. The best results were obtained with k=1. Table 2.4 gives the classification statistics of the 1-nearest neighbor models applied to the first test set (142 STLs) with atom counts and 64 dimensional RDF codes. If more than one nearest 32 Sesquiterpene lactones classification of Asteraceae

neighbor was found, i.e. the tested pattern had the same distance to more than one pattern in the training set, the result was determined by majority vote with possible ties broken at random.

Table 2.4 Classification statistics of the 1-nearest neighbor models applied on the first test set

Atom counts RDF codes, 64 dimensional Tribe Recall Precision F-measure Recall Precision F-measure Anthemideae 0.759 0.824 0.790 0.741 0.741 0.741 Cardueae 0.636 0.621 0.628 0.818 0.6 0.692 Eupatorieae 0.777 0.747 0.762 0.8 0.857 0.828 Heliantheae 0.878 0.778 0.824 0.911 0.788 0.845 Inuleae 0.325 0.299 0.31 0.375 1 0.545 Lactuceae 0.455 1 0.625 0.667 0.857 0.75 Vernonieae 0.867 0.948 0.904 0.667 0.8 0.727 Kappa 0.691 0.723

As occurred with the CPG neural network, the 64 dimensional RDF code gave the best performance compared to 32 and 128 dimensional ones, which gave kappa values of 0.584 and 0.636, respectively.

2.3.3 Comparison between the obtained models

It is generally agreed that the more data the model has been trained on, the more likely it will be reliable.55 Therefore our final models were built using the entire data set, i.e. 921 STLs, with the settings described above. The CPG neural network classifier was dropped at this stage since Table 2.3 and Table 2.4 show that the 1-nearest neighbor performed significantly better on this type of data. The classification statistics of the 1-nearest neighbor models with both structure representations based on ten times 10-fold stratified CV are given in Table 2.5. Results 33

Table 2.5 Classification statistics of the final 1-nearest neighbor models based on ten times 10-fold cross-validation

Atom counts RDF codes, 64 dimensional Tribe Recall Precision F-measure Recall Precision F-measure Anthemideae 0.731 0.727 0.729 0.726 0.774 0.749 Cardueae 0.726 0.768 0.746 0.863 0.700 0.773 Eupatorieae 0.752 0.749 0.751 0.807 0.795 0.801 Heliantheae 0.780 0.729 0.754 0.780 0.740 0.760 Inuleae 0.460 0.548 0.500 0.500 0.641 0.562 Lactuceae 0.625 0.682 0.652 0.646 0.756 0.697 Vernonieae 0.792 0.871 0.830 0.649 0.714 0.680 Kappa 0.665 0.682

As can be seen in Table 2.5, the use of RDF codes gave slightly better results. However, there is substantial variance in each single CV run. Although the presented results are averaged over 10 CV runs, which in turn reduces the variance, the use of a statistical test rather than comparing the values directly has been suggested.55 In order to test if the difference between using atom counts and RDF codes with 1-nearest neighbor classifier is statistically significant, we used two-tailed paired t-test, using the average kappa values, produced in each single 10-fold CV run. The same splits of the data set were used with each structural descriptor, thus allowing the application of this test.55 The test produced a p-value of 0.0066, therefore the null hypothesis – that the two means of the kappa values follow the same distribution – can be rejected with high confidence. Due to the fact that the individual CV estimates were obtained on the same data set they are not actually independent and the above test can confirm only that these estimates are different and not the “true” classifier performance across different training sets. However, it has been shown that repeated 10-fold CV is one of the most accurate estimators available; thus, we conclude that the RDF code indeed outperforms the atom counts.

2.3.4 Predictions on the second test set

In addition to provide another measure for the quality of the built model, the purpose of the second test set (72 STLs) was to illustrate the applicability of the proposed model to chemotaxonomic analysis (cf. Figure 2.2 and Discussion) as well as the ability to correctly 34 Sesquiterpene lactones classification of Asteraceae

classify individual plant species rather than single STLs. The entire data set (921 STLs) was used as training data and the STL profiles of the plant species (Table 2.2) were classified using 1-nearest neighbor with 64 dimensional RDF codes. The classification based on single STLs produced a kappa value of 0.447. In addition, each species was assigned to a tribe using a majority vote and the results are summarized in Table 2.6.

Table 2.6 Assignment of the individual plant species into their corresponding tribes of Asteraceae using majority vote (the correctly assigned species are given in bold)

Species Number of STLs Correctly classified STLs Species classified as ANT. es1 3 2 ANT ANT. es2 7 4 ANT CAR.es1 6 4 CAR CAR.es2 6 1 ANT EUP.es1 9 8 EUP HLT.es1 9 8 HLT INU.es1 8 2 EUP LAC.es1 5 3 LAC VER.es1 17 5 VER

2.3.5 Defining the prediction space

The methods for defining the prediction space were tested initially on the first test set (142 STLs) using the remaining (779 STLs) of the entire data set for defining the space, and consequently on the second test set (72 STLs) using the full data set (921 STLs) for defining the prediction space. 1-nearest neighbor with 64 dimensional RDF codes was used as a classification method.

2.3.5.1 PCA and Hotelling T2

The rejection rate was varied by using different confidence levels α, cf. Equation 2.2. All STLs, which were identified outside the prediction space at a given level, were left out of the test set, i.e. were rejected and no prediction was made for them. Figure 2.7 shows the overall classification quality, which was obtained on the first and second test set given by the kappa statistic, at different rejection rates. All STLs were described by 64 dimensional RDF codes.

Results 35

Figure 2.7 Overall classification quality against the rejection rate with PCA based approach.

Even at high rejection rates no significant improvement in the classification quality can be observed. For the second test set, the quality even decreases with the rejection rate, thus rather than excluding the misclassified patterns some of the correctly classified ones have been deemed new.

2.3.5.2 Rejection rule

The rejection rate was varied using increasing values for the ambiguity threshold Ta. All STLs were first classified and the probability given by Equation 2.5 was calculated by using the five nearest neighbors. If the calculated probability for a given structure was smaller or equal to the predefined threshold, the structure was removed from the test set. After all such STLs have been removed, the classification statistics were calculated from the remaining and classified with high confidence patterns. Figure 2.7 shows the overall classification quality, which were obtained on the first and second test set given by the kappa statistic at different rejection rates. All STLs were described by 64 dimensional RDF codes.

36 Sesquiterpene lactones classification of Asteraceae

Figure 2.8 Overall classification quality against the rejection rate with reject rule.

The classification quality generally increases with increase in the rejection rate with small fluctuations. Even at small rejection rates – less than 20% – there is an improvement in the overall classification quality.

2.4 Discussion

The main goal of this study was to develop a methodology which allows the assignment of different STLs, isolated from taxa of the family Asteraceae, into their corresponding tribes. This methodology is a valuable tool in various cases – collection of new plant material with the aim of finding compounds with special structural features (cf. Figure 2.1), classification of novel compounds, and to help in studies about the relationship between taxonomy and chemistry (cf. Figure 2.2). The proposed method allows the classification of novel compounds (unreported or literature-new). This is important since currently one is limited to search in different data bases, which can deal only with already known structures. To achieve that, we used two supervised learning algorithms and two different structure representations. Discussion 37

2.4.1 Classification methods

With regards to the classification algorithms, comparing the kappa values given in Table 2.3 and Table 2.4 clearly show that the 1-nearest neighbor classifier outperforms the CPG neural network by approximately a factor of two. This observation together with the fact that 1-nearest neighbor outperforms all other number of neighbors tested suggests that the decision boundaries between the different tribes are highly irregular and can be approximated only locally56. This is not unexpected since it is known that often a given compound of high structural complexity occurs simultaneously in two or more taxa in the plant kingdom and this is certainly valid within Asteraceae. This natural overlap, or co-occurrence of natural compounds, makes the decision boundaries hard to capture and thus the 1-nearest neighbor classifier gave the best performance. In addition, the overall characteristic of structures of secondary metabolites may cause a dilemma in the assignment of a given structure into its corresponding class. In the present study, if a compound is reported to appear in more than one tribe it was assigned exclusively to the one in which it is more frequently found. Although this is the most common approach when dealing with such multi-labeled data,60 it additionally blurs the boundaries between the classes.

The most misclassified tribe was Inuleae (INU), regardless of the used classification method or structure representation. This can be attributed to various reasons: first, the number of STLs belonging to this tribe in the data set (50) might be not enough (although the STLs from this tribe are quite similar among each other and certainly an increase in the data set would not cause dramatic changes in the results); second, around 50% of the 50 present STLs have also been isolated from taxa belonging to some of the other six tribes, i.e. INU has the highest percentage of STLs which co-occur within the other Asteraceae tribes, thus indicating that it does not present a distinguishable pattern of STLs. The most common co-occurrence is between INU and HTL and subsequently almost all STLs marked to belong to INU are misclassified as belonging to HLT. This can be seen as an indication of chemical similarity between these two tribes with regard to certain skeletal types of STLs, like the guaianolides, and other closely related biogenetic derivatives, which is in agreement with previous chemotaxonomic studies. 2,3 38 Sesquiterpene lactones classification of Asteraceae

2.4.2 Structural descriptors

At a first glance, both structural descriptors gave similar performance, but as shown by the paired t-test the RDF code produce statistically significant improvement. Shifting to the individual tribes rather than comparing the overall performance, one can notice that by using the RDF code the recall for CAR and LAC is increasing while it is decreasing for VER. This can be attributed to the fact that in VER nearly all STLs have an 8α-oxygen function,3 which is better reflected by the atom counts. Moreover, the skeletal type variability of the STLs from VER is quite low when compared to HLT or EUP. The precision remains pretty much the same with regard to the different representations. However, using RDF code no STLs from other tribes are predicted as INU – precision of one (cf. Table 2.4) –, while in the case of atom counts some, mainly HLT, were misclassified as belonging to INU. Again it can be explained by the fact that some STLs from HLT (the major and more diversified group) show overlap with some compounds from INU. Despite the improvement brought by using RDF codes, the fact that the atom counts have approximately four times smaller dimensionality could still advocate for their usage. However, it should be stressed that the atom counts used in this study have been augmented with stereo information, i.e. consideration of the stereo bond types – α or β – was performed. This fact prevents this descriptor from being purely 2D, i.e. completely deducible from the connectivity alone and puts the requirement of some prior knowledge about the 3D configuration of the stereo centers. We tried to build a classification model by omitting the stereo augmentation, but the quality of the results decreased drastically with a resulting kappa value of 0.231, compared to 0.665 (cf. Table 2.5). This is understandable having in mind that, for example, the 8α- or 8β-oxygen orientation is very important for some tribes like VER, EUP or HLT from the chemotaxonomic point of view.2,3 Therefore, the RDF codes are preferable since they bring statistically significant improvement and are also a more general descriptor while the selected atom types for the histogram might be (slightly) influenced by the particularity of a certain subset of data.

2.4.3 Applicability

The proposed model allows the classification of a single STL into a corresponding Asteraceae tribe to be done with high confidence. As illustrated in Figure 2.1, such assignment can be utilized for the selection of a plant source for a given (or desired) natural compound, even if it is a novel structure. Although a tribe is somewhat high in the taxonomic hierarchy within the family, selecting the most probable tribe already limits the possible plant Discussion 39 sources significantly since there is a large number of species within Asteraceae. Thus, the proposed methodology allows for “targeted” collection of plant material based on chemotaxonomic relationships. When applied to real life situations, this feature can drastically narrow the search for a plant as well as to save time and money. This procedure is carried out in several plant screening projects and is certainly a valuable strategy for the discovery of biologically active natural products.61,62 In the same direction, another possible usage of this model is it usefulness in structure elucidation, since spectroscopic data of compounds from plant sources with similar STL profile can be easily assessed and used for further comparison.

From the other point of view, it is often the case in a phytochemistry laboratory when a plant material has been collected and its corresponding compounds were isolated and identified. In such a case the proposed model can help in the consequent chemotaxonomic analysis, especially when one is not familiar with the chemistry and other special features regarding STLs from this huge plant family. This is illustrated by our second test set in which the whole profile of a given Asteraceae species was investigated. This allowed us to assign the species as a whole to a given tribe by using majority vote on the predictions for each STL as well as to draw conclusions about the similarity in the secondary metabolite chemistry of the species across different Asteraceae tribes.

2.4.4 Chemotaxonomic analysis and majority vote

As can be seen from Table 2.6, seven out of nine species were assigned correctly to their actual tribe. However, while for some of them almost all STLs were classified correctly – ANT.es1, EUP.es1, HLT.es1 – this was not the case for the rest of the species. For both plant species which belong to ANT, all STLs from the second test set possess skeletons and substitution features which are typical for this specific tribe. Subsequently, they were classified correctly as expected. However, the decision for ANT.es2 was very close – with 4 STLs correctly classified as ANT and 3 assigned to HLT. This result suggests a high degree of similarity between the secondary metabolite chemistry of the species Achillea depressa and species from the tribe HLT. The first CAR species was correctly assigned with four correct classifications out of six, while the second species was completely misclassified – our model predicted correctly only one of the six reported STLs. From the remaining five structures, three were predicted as belonging to ANT, and two as HLT. The species was incorrectly classified as belonging to ANT although the confidence in this decision is rather low. Having in mind that the reported STLs for this species are very common across whole Asteraceae, 40 Sesquiterpene lactones classification of Asteraceae

including for example the compound costunolide, it is not surprising that the classification was incorrect and scattered over several tribes. One must have in mind that costunolide is regarded as the biogenetic precursor of all STLs and is widespread within Asteraceae, thus being not a typical compound of any tribe. The species from EUP and HLT were very confidently assigned to their correct tribe, being an expected result. It is because these tribes are the two most represented in the training set (Table 2.1) and also because at a certain extend their STL profile is characteristic. The INU.es1 was misclassified, with the majority of its compounds predicted as HLT. The STLs from INU were found generally hard to separate from the rest (Table 2.4 and Table 2.5) and these results confirm that based on the current taxonomic classification it is difficult to handle this tribe correctly based solely on the STLs profile. However, as already mentioned, INU and HLT are not very well separated with regards to their STL profiles, thus our results confirm the earlier observations.2,3 The single LAC species was classified correctly. The two misclassified STLs were assigned to ANT and CAR. The more complete profile from all species listed in Table 2.2 belongs to VER.es1. Some of the 17 reported STLs are frequently found in other tribes – mainly EUP and ANT. The rest are typical for VER, although somewhat unusual due to the position/type of substituents, low occurring skeletons and/or oxidative levels. Applying our model on this rich profile of VER.es1, we were able to classify the species correctly as belonging to VER based on five correct predictions (Table 2.6). In concert with the above observations, four STLs were misclassified as ANT, three as EUP, and three as HLT. Although we use the term “misclassified” it is not unlikely that some of this compounds have been isolated from species from these tribes as well. Building a classification which is capable of dealing correctly with such multi-labeled data is a possible extension of the method proposed herein, and this will be subject of further studies.

2.4.5 Prediction space

Another possible reason for misclassification with any model might be that the new instances, which are to be tested, lie far away from the prediction space. Although the models presented herein are aimed at specific class of secondary metabolites, i.e. STLs, the problem is still present since several novel STLs are reported each year,18-39 some of them with rather unusual structures, including for example dimers.63 This was what we examined in turn. As can be seen from Figure 2.7, the Hotelling test proved unsuitable for the task at hand. Even at high rejection rates the quality of the classification does not increase, or even decreases. This Conclusions 41 means that the test rejected the actually correctly classified STLs rather than rejecting the misclassified ones, which was our expectation. This can be attributed to the fact that, although this approach can identify global outliers, i.e. patterns which do lie relatively far from the space spanned by the whole data, it does not take into account neither the class membership nor it can deal with local spots of low data density. Due to a phenomenon known as “the curse of dimensionality”56 such spots are inevitable when high dimensional patterns are used. In contrast, the second approach – reject rule based on a measure which combines the a posteriori probabilities with the actual distances cf. Equation 2.5 – is rather local and does take the class membership into account. It is incapable of defining outliers, since even at large distances the a posteriori probability estimate can be rather high if all or most of the neighbors are from the same class. Nevertheless, as shown in Figure 2.8, it brings almost steady improvement to the classification quality at increased reject rates. Figure 2.8 shows the improvement in the overall classification quality but it does not contain any information about the species. We examined if the classification of the species listed in Table 2.2 (second test set) improves at a rejection rate of about 28%, i.e. 27 out of 72 STLs were rejected from the second test set. Initially, seven out of the nine species were classified correctly. After rejecting around one third of the data, the assignment of the species into their corresponding tribe by majority vote remains the same. However, the proportion of true votes, i.e. the confidence of the classification, increased. Interestingly almost all – 13 out of 17 – of the STLs in the VER.es1 were rejected. Thus, their unusual substitution patterns, skeletons as well as oxidative levels have been captured by the method.

2.5 Conclusions

A classification model capable of classifying a special type of secondary metabolites – sesquiterpene lactones (STLs) – into seven Asteraceae tribes was developed. The applicability of the presented model for (1) identifying plant sources for a given STL and (2) for studying the relationship in the secondary metabolism across different tribes of individual plant species was shown. The k-NN classifier with k=1 gave the best results, regardless of the used structural descriptor, thus suggesting highly irregular class boundaries. Two chemical structure descriptors – histogram of atom counts, augmented with stereo information and RDF codes – were investigated. The RDF code gave better results and the difference in the performance was statistically significant. Therefore, it is proposed in concert with the 1- 42 Sesquiterpene lactones classification of Asteraceae

nearest neighbor classifier, which clearly outperformed the CPG neural network. Two approaches of identifying patterns which are likely to be misclassified were studied: (1) distance metric based on principal component analysis and Hotelling T2 statistic17 and (2) rejection rule, based on the a posteriori probabilities estimates and the distances to the nearest neighbors.11 While the former did not bring any significant improvement, the latter provided a useful way to reject patterns, in which classification the method was not confident enough. This approach is an example of how statistical methods and machine learning tools can be combined to a real life problem in order to study the intricate mechanism of naturally occurring compounds from plants.

2.6 Acknowledgments

FBC is grateful to the Alexander von Humboldt Foundation (Germany) for a Research Fellowship at the Computer-Chemie-Centrum. Simon Spycher is thanked for his valuable comments.

2.7 References

(1) Wink, M. Evolution of Secondary Metabolites From an Ecological and Molecular Phylogenetic Perspective. Phytochemistry 2003, 64, 3-19.

(2) Seaman, F. C. Sesquiterpene Lactones As Taxonomic Characters in the Asteraceae. Bot. Rev. 1982, 48, 123-551.

(3) Zdero, C.; Bohlmann, F. Systematics and Evolution Within the Compositae, Seen With the Eyes of a Chemist. Plant Syst. Evol. 1990, 171, 1-14.

(4) Bremer, K. Asteraceae: Cladistics and Classification; Timber Press: Portland, 1994.

(5) Emerenciano, V. P.; Ferreira, M. J. P.; Branco, M. D.; Dubois, J. E. The Application of Bayes' Theorem in Natural Products As a Guide for Skeletons Identification. Chemometr. Intell. Lab. Sys. 1998, 40, 83-92.

(6) Alvarenga, S. A. V.; Ferreira, M. J. P.; Emerenciano, V. P.; Cabrol-Bass, D. Chemosystematic Studies of Natural Compounds Isolated From Asteraceae: References 43

Characterization of Tribes by Principal Component Analysis. Chemometr. Intell. Lab. Sys. 2001, 56, 27-37.

(7) Netzeva, T. I.; Worth, A. P.; Aldenberg, T.; Benigni, R.; Cronin, M. T. D.; Gramatica, P.; Jaworska, J. S.; Kahn, S.; Klopman, G.; Marchant, C. A.; Myatt, G.; Nikolova- Jeliazkova, N.; Patlewicz, G. Y.; Perkins, R.; Roberts, D. W.; Schultz, T. W.; Stanton, D. T.; van de Sandt, J. J. M.; Tong, W.; Veith, G.; Yang, C. Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships. ATLA, Alt. Lab. Anim. 2005, 33, 155-174.

(8) Markou, M.; Singh, S. Novelty Detection: a Review - Part 1: Statistical Approaches. Signal Process. 2003, 83, 2481-2497.

(9) Markou, M.; Singh, S. Novelty Detection: a Review - Part 2: Neural Network Based Approaches. Signal Process. 2003, 83, 2499-2521.

(10) Fumera, G.; Roli, F.; Giacinto, G. Reject Option With Multiple Thresholds. Pattern Recogn. 2000, 33, 2099-2101.

(11) Arlandis, J.; Perez-Cortes, J. C.; Cano, J. Rejection Strategies and Confidence Measures for a k-NN Classifier in an OCR Task. 16th International Conference on Pattern Recognition (ICPR'02) 2002, 1, 10576-10580.

(12) Gasteiger, J. A Hierarchy of Structure Representations. In Handbook of Chemoinformatics; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 1034-1061.

(13) Da Costa, F. B.; Terfloth, L.; Gasteiger, J. Sesquiterpene Lactone-Based Classification of Three Asteraceae Tribes: a Study Based on Self-Organizing Neural Networks Applied to Chemosystematics. Phytochemistry 2005, 66, 345-353.

(14) Gastmans, J. P.; Zurita, J. C.; Sahao, J.; Emerenciano, V. P. Prevision Des Spectres De Resonance Magnetique Nucleaire De 13C Par Intelligence Artificielle: Le Probleme De La Codification: Prediction of 13C-Nuclear Magnetic Resonance Spectra by Artificial Intelligence: the Problem of Coding Structures. Anal. Chim. Acta 1989, 217, 85-100.

(15) Hemmer, M. C.; Steinhauer, V.; Gasteiger, J. Deriving the 3D Structure of Organic Molecules From Their Infrared Spectra. Vib. Spectrosc. 1999, 19, 151-164. 44 Sesquiterpene lactones classification of Asteraceae

(16) Zupan, J.; Gasteiger, J. Neural Networks in Chemistry and Drug Design; Wiley-VCH: Weinheim, 1999.

(17) Eriksson, L.; Jaworska, J.; Worth, A. P.; Cronin, M. T.; McDowell, R. M.; Gramatica, P. Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of Classification- and Regression-Based QSARs. Environ. Health Persp. 2003, 111, 1361-1375.

(18) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1985, 2, 147-161.

(19) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1986, 3, 273-296.

(20) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1987, 4, 473-498.

(21) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1988, 5, 497-521.

(22) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1990, 7, 61-84.

(23) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1990, 7, 515-537.

(24) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1992, 9, 217-241.

(25) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1992, 9, 557-580.

(26) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1993, 10, 397-419.

(27) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1994, 11, 533-554.

(28) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1995, 12, 303-320.

(29) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1996, 13, 307-326.

(30) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1997, 14, 145-162.

(31) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1998, 15, 73-92.

(32) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1999, 16, 21-38.

(33) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 1999, 16, 711-730.

(34) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 2000, 17, 483-504.

(35) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 2001, 18, 650-673.

(36) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 2002, 19, 650-672. References 45

(37) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 2003, 20, 392-413.

(38) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 2004, 21, 669-693.

(39) Fraga, B. M. Natural Sesquiterpenoids. Nat. Prod. Rep. 2005, 22, 465-486.

(40) Herz, W. Chemistry of the Eupatoriinae. Biochem. Syst. Ecol. 2001, 29, 1115-1137.

(41) SciFinder Scholar, version 2004; American Chemical Society: Columbus, Ohio, 2003.

(42) Celik, S.; Rosselli, S.; Maggio, A. M.; Raccuglia, R. A.; Uysal, I.; Kisiel, W.; Bruno, M. Sesquiterpene Lactones From Anthemis wiedemanniana. Biochem. Syst. Ecol. 2005, 33, 952-956.

(43) Trifunovic, S.; Aljancic, I.; Vajs, V.; Macura, S.; Milosavljevic, S. Sesquiterpene Lactones and Flavonoids of Achillea depressa. Biochem. Syst. Ecol. 2005, 33, 317- 322.

(44) Bruno, M.; Rosselli, S.; Maggio, A.; Raccuglia, R. A.; Arnold, N. A. Guaianolides From Centaurea babylonica. Biochem. Syst. Ecol. 2005, 33, 817-825.

(45) Bentamene, A.; Benayache, S.; Creche, J.; Petit, G.; Bermejo-Barrera, J.; Leon, F.; Benayache, F. A New Guaianolide and Other Sesquiterpene Lactones From Centaurea acaulis L. (Asteraceae). Biochem. Syst. Ecol. 2005, 33, 1061-1065.

(46) Hernandez, Z. N. J.; Catalan, C. A.; Hernandez, L. R.; Guerra-Ramirez, D.; Joseph- Nathan, P. Sesquiterpene Lactones From Stevia alpina var. glutinosa. Phytochemistry 1999, 51, 79-82.

(47) Spring, O.; Zipper, R.; Klaiber, I.; Reeb, S.; Vogler, B. Sesquiterpene Lactones in Viguiera eriophora and Viguiera puruana (Heliantheae; Asteraceae). Phytochemistry 2000, 55, 255-261.

(48) Lee, J. S.; Min, B. S.; Lee, S. M.; Na, M. K.; Kwon, B. M.; Lee, C. O.; Kim, Y. H.; Bae, K. H. Cytotoxic Sesquiterpene Lactones From Carpesium abrotanoides. Planta Med. 2002, 745-747.

(49) Zidorn, C.; Ellmerer, E. P.; Konwalinka, G.; Schwaiger, N.; Stuppner, H. 13-Chloro-3- O-[Beta]--Glucopyranosylsolstitialin From Leontodon palisae: the First Genuine Chlorinated Sesquiterpene Lactone Glucoside. Tetrahedron Lett. 2004, 45, 3433-3436. 46 Sesquiterpene lactones classification of Asteraceae

(50) Pollora, G. C.; Bardon, A.; Catalan, C. A. N.; Griffin, C. L.; Herz, W. Elephantopus- Type Sesquiterpene Lactones From a Second Vernonanthura Species, Vernonanthura lipeoensis. Biochem. Syst. Ecol. 2004, 32, 619-625.

(51) Sadowski, J.; Gasteiger, J. From Atoms and Bonds to Three-Dimensional Atomic Coordinates: Automatic Model Builders. Chem. Rev. 1993, 93, 2567-2581.

(52) CORINA, version 3.0; Molecular Networks GmbH: Erlangen, Germany, 2003, http://www.molecular-networks.com.

(53) ADRIANA.CODE, version 1.0; Erlangen, Molecular Networks GmbH: Erlangen, Germany, 2006, http://www.molecular-networks.com.

(54) SONNIA - Self Organizing Neural Network for Information Analysis, version 4.1, Molecular Networks GmbH: Erlangen, Germany, 2002, http://www.molecular- networks.com.

(55) Witten, I. H.; Eibe, F. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: San Francisco, 2000.

(56) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, 2001.

(57) Forbes, A. D. Classification-Algorithm Evaluation: Five Performance Measures Based on Confusion Matrices. J. Clin. Monitor. 1995, 11, 189-206.

(58) Spycher, S.; Pellegrini, E.; Gasteiger, J. Use of Structure Descriptors To Discriminate Between Modes of Toxic Action of Phenols. J. Chem. Inf. Model. 2005, 45, 200-208.

(59) Vesanto, J.; Himberg, J.; Alhoniemi, E.; Parhankangas, J. SOM Toolbox for Matlab 5; Technical Report A57; Helsinki University of Technology, Neural Networks Research Centre: Espoo, Finland, 2000.

(60) Boutell, M. R.; Luo, J.; Shen, X.; Brown, C. M. C. Learning Multi-Label Scene Classification. Pattern Recogn. 2004, 37, 1757-1771.

(61) Hostettmann, K.; Wolfender, J. L. The Search for Biologically Active Secondary Metabolites. Pestic. Sci. 1997, 51, 471-482. References 47

(62) Cordell, G. A.; Shin, Y. G. Finding the Needle in the Haystack. The Dereplication of Natural Products Extracts. Pure Appl. Chem. 1999, 71, 1089-1094.

(63) Staneva, J.; Trendafilova-Savkova, A.; Todorova, M. N.; Evstatieva, L.; Vitkova, A. Terpenoids From Anthemis austriaca Jacq. Z. Naturforsch. C 2004, 59, 161-165.

48 Sesquiterpene lactones classification of Asteraceae

Further comments and discussion

The importance of secondary metabolites as taxonomic markers has recently been questioned.1 It is not uncommon that the same secondary metabolites are found in totally unrelated species due to adaptations and particular life strategies. Thus, the distribution of secondary metabolites has to be analyzed carefully and critically, as any adaptive trait.

In spite of this fact, the classification techniques applied in this study – counter propagation neural network and k-nearest neighbors – produced models with good quality. Thus, we can conclude that there is certain relationship between the Asteraceae secondary metabolism (in terms of sesquiterpene lactones) and the taxonomic division proposed by Bremer.2

On the other hand, as mentioned in the beginning of this chapter, the taxonomic classification of Asteraceae is disputed even amongst the botanists. Therefore, the obtained good classification models can be seen as supporting the division, proposed by Bremer. Of course, to fully support a given taxonomic division a comparative approach is needed. Thus, for example, a classification models can be built using different taxonomic classifications to assign a class label to each STL. Afterwards the quality of each classification model can be interpreted as the degree of agreement between the secondary metabolism of the Asteraceae plants and the corresponding taxonomic classification.

In addition to the chemotaxonomy related problems, the important problem of defining the applicability domain of the built models have been studied. This problem is gaining more and more attention in the chemoinformatics society.3 It is known4 from machine learning and statistical theory that most of the statistical methods are not well suited for extrapolation. Thus, an object which is far from the space covered by the data used to build the model is quite likely to be predicted wrongly. While this may not cause large problems in a chemotaxonomic study, it may lead to misleading and costly errors in other fields. Pursuing the development of a potential drug, which was wrongly predicted by a statistical model, for example, may cost millions. Thus, the two ways to define the applicability domain of a classification model which are discussed in the previous section are one of the important contributions of the presented work.

The coding of chemical structures in a way suitable for use with different machine learning techniques is a large area of chemoinformatics.5 We have demonstrated that the use of descriptor based on the 3-dimensional chemical structure performs better for the assignment Further comments and discussion 49 of STLs to the correct Asteraceae tribe. This result is not surprising having in mind that the secondary metabolites are formed into enzyme pockets, which are 3-dimensional entities. Of course, having in mind the high conformational flexibility of the STLs perhaps further improvements can be achieved by applying methods which are conformation aware. However, for the purpose of chemotaxonomic analysis the consideration of a single low- energy conformation proved to be sufficient.

With respect to the machine learning techniques it is worth noting that the two classification methods used in this study are only a tiny part of the array of classification techniques available at present. Having in mind that the k-nearest neighbor method which gave better results is a relatively simple technique, the use of other classification algorithms may produce better results.

On the other hand, there are at least two reasons to prefer relatively simple machine learning techniques. First, Occam’s Razor (named after the medieval philosopher William of Occam) states that, other things being equal, simple theories are preferable to complex ones. Secondly, and probably more important, the more simple the machine learning method is the better its interpretability is. Gaining knowledge is much easier from highly interpretable models compared to more sophisticated ones, even if the latter happens to produce better results. Thus, as will be shown in the next section, the use of more sophisticated classification technique – support vector machines6 – does lead to a better classification performance but the interpretability of the resulting model decreases.

Additional References

(1) Wink, M. Evolution of Secondary Metabolites From an Ecological and Molecular Phylogenetic Perspective. Phytochemistry 2003, 64, 3-19.

(2) Bremer, K. Asteraceae: Cladistics and Classification.; Timber Press: Portland, 1994.

(3) Netzeva, T. I.; Worth, A. P.; Aldenberg, T.; Benigni, R.; Cronin, M. T. D.; Gramatica, P.; Jaworska, J. S.; Kahn, S.; Klopman, G.; Marchant, C. A.; Myatt, G.; Nikolova- Jeliazkova, N.; Patlewicz, G. Y.; Perkins, R.; Roberts, D. W.; Schultz, T. W.; Stanton, D. T.; van de Sandt, J. J. M.; Tong, W.; Veith, G.; Yang, C. Current Status of Methods 50 Sesquiterpene lactones classification of Asteraceae

for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships. ATLA - Alt. Lab. Anim. 2005, 33, 155-174.

(4) Witten, I. H.; Eibe, F. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: San Francisco, 2000.

(5) Gasteiger, J. A Hierarchy of Structure Representations. In Handbook of Chemoinformatics; Gasteiger, J., Ed.; Wiley-VCH Verlag: Weinheim, 2003; Chapter 3.1, pp. 1034-1061

(6) Vapnik, V. N. Statistical Learning Theory; Wiley-Interscience: 1998.

51

3 Multi-labeled Classification Approach to Find a Plant Source for Terpenoids

Overview

Sesquiterpene lactones are terpenoid compounds characteristic of the Asteraceae.1 Many of them show different biological activities.1-4 However, finding specific sesquiterpene lactones in tens of thousands of Asteraceae species is like finding a needle in a haystack. Therefore, a method which helps in the identification of possible plant sources for specific sesquiterpene lactones has multiple applications. One such method based on different classification techniques has been presented in Chapter 2.

A limitation of both classification techniques described in Chapter 2 is that they are not able to assign sesquiterpene lactones to more than one tribe simultaneously. Around ten per cent of the sesquiterpene lactones in our data set were isolated from several plant species which belong to more than one Asteraceae tribe. In Chapter 2 we have assigned all such sesquiterpene lactones to the tribe in which they have been found more often. While this approach is reasonable, it causes some loss of information contained in the original data. This chapter tries to solve the problem of co-occurring sesquiterpene lactones by utilizing a class of machine learning techniques known as multi-labeled classification. The multi-labeled classification techniques are created especially for handling objects, which can belong to more than one class simultaneously.

Building on the results described in Chapter 2, here we explore the possibility of assigning sesquiterpene lactones to more than one Asteraceae tribe simultaneously. To achieve this, two multi-labeled classification techniques are applied to the data set described in Chapter 2. A detailed overview of multi-labeled classification in general and the specific techniques - cross-training with support vector machines and multi-labeled k-nearest neighbors is presented. The results are compared to the single-label classification presented in Chapter 2 and are analyzed from a chemotaxonomic point of view. The practical application of such models for targeted collection of plants is emphasized. Such knowledge-driven attempt may save a lot of time and effort in locating plant species, which are likely to provide a source for sesquiterpene lactones with desired properties. 52 Multi‐labeled Classification of Asteraceae

The remainder of this chapter corresponds exactly to the original paper submitted for publication in the Journal of Chemical Information and Modeling other than the last section after the references list and the numbering.

References

(1) Picman, A. K. Biological Activities of Sesquiterpene Lactones. Biochem. Syst. Ecol. 1986, 14, 255-281.

(2) Castillo, M.; Martinez-Pardo, R.; Garcera, M. D.; Couillaud, F. Biological Activities of Natural Sesquiterpene Lactones and the Effect of Synthetic Sesquiterpene Derivatives on Insect Juvenile Hormone Biosynthesis. J. Agric. Food Chem. 1998, 46, 2030-2035.

(3) Kumar, A.; Singh, S. P.; Bhakuni, R. S. Secondary Metabolites of Chrysanthemum and Their Biological Activities. Curr. Sci. 2005, 89, 1489-1501.

(4) Youl Cho, J. Sesquiterpene Lactones As a Potent Class of NF-κB Activation Inhibitors. Curr. Enz. Inhib. 2006, 2, 329-341.

Original article 53

Original article

Multi-labeled Classification Approach to Find a Plant Source for Terpenoids

Dimitar Hristozov1*, Fernando B. Da Costa1,2, and Johann Gasteiger1

1Computer-Chemie-Centrum, Universität Erlangen-Nürnberg, Nägelsbachstr. 25, D-91052 Erlangen, Germany

2Laboratório de Farmacognosia, Faculdade de Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, Av. do Café s/no., 14040-903, Ribeirão Preto, SP, Brazil

Hristozov, D., Da Costa, F.B., Gasteiger, J., 2007, submitted to J. Chem. Inf. Model.

Abstract

Recently we have built a classification model capable of assigning a given sesquiterpene lactone (STL) into exactly one tribe of the plant family Asteraceae from which the STL has been isolated. Although many plant species are able to biosynthesize a set of peculiar compounds, the occurrence of the same secondary metabolites in more than one tribe of Asteraceae is frequent. Building on our previous work, we explore in this paper the possibility of assigning an STL to more than one tribe (class) simultaneously. When an object may belong to more than one class simultaneously it is called multi-labeled. In this work, we present a general overview of the available techniques to deal with multi-labeled data. The problem of evaluating the performance of a multi-labeled classifier is discussed. Two particular multi-labeled classification methods – cross-training with Support Vector Machines

(ct-SVM) and multi-labeled k-Nearest Neighbors (ML-kNN) – were applied to the classification of the STLs into seven tribes from the plant family Asteraceae. The results are compared to a single-label classification and are analyzed from a chemotaxonomic point of view. The multi-labeled approach allowed us to (1) model the reality as closely as possible; (2) improve our understanding of the relationship between the secondary metabolite profiles of different Asteraceae tribes, and (3) significantly decrease the number of plant sources to be considered for finding a certain STL. The presented classification models are useful for the targeted collection of plants with the aim of finding plant sources of natural compounds which are biologically active or possess other specific properties of interest. 54 Multi‐labeled Classification of Asteraceae

3.1 Introduction

Natural products of plant origin play a crucial role in different areas of research. These compounds belong to different chemical classes (alkaloids, phenolics, terpenoids, etc.) and have chemically diverse and complex structures. Because of these complex structures many of the natural compounds are hard to synthesize.1 Natural products have a wealth of applications. Some of them are used as drugs while others possess important biological properties, or are used as dietary supplements, as dyes, as flavoring agents, or as ingredients in cosmetic industry.2 Nowadays, both academia and industry are interested in finding different plant sources of such compounds.3 However, finding specific chemical compounds in tens of thousands of plant species can be compared to finding a needle in a haystack.

A frequently used strategy for targeted collection of plants is the chemotaxonomic approach.4 It consists of selecting plants which are likely to produce the desired chemical compounds based on the relationship between the taxonomic classification in the plant kingdom and the secondary metabolism of the plants. For example, if one is interested in a special type of terpenoids – the sesquiterpene lactones (STLs) – the plant family of choice is Asteraceae, the sunflower family. Within this family, several subgroups of STL structures with special substitution patterns occur in one or more of the ten Asteraceae tribes.5,6 However, plants in each of the Asteraceae tribes synthesize STLs, which are somewhat typical for the particular tribe.7 Thus, the STLs are often used as taxonomic markers to characterize or cluster taxa (tribes, subtribes, genera, species, etc.).5 Moreover, many of the STLs show biological activities, like lactopicrin, the bitter principle from chicory (Cichorium intybus) roots, the antiphlogistic compound matricin from chamomile (Matricaria chamomilla), among others (Chart 3.1).

Chart 3.1 Biologically active STLs of Asteraceae: lactucin (1, tribe Lactuceae), matricin (2, tribe Anthemidae), isabelin (3, tribes Anthemidae and Heliantheae) and isoallantolactone (4, tribes Anthemidae, Heliantheae and Inuleae).

O O O O O O O O H O H O O O H O O 1 O 2 O 3 4 Introduction 55

The aforementioned characteristics of the STLs have always called the attention of many scientists making the STLs one of the most explored classes of terpenoids. Therefore, a method which helps in the identification of possible plant sources for a specific STL(s) has multiple applications. As can be seen from Chart 1, the co-occurrence – or overlap – of STLs across different tribes in Asteraceae is not uncommon. Therefore, a technique which is capable of assigning an STL to more than one tribe is preferable. Such a technique can predict more than one possible source and thus provides an option for selecting the most advantageous plant source. Thus, it may significantly lower the costs associated with the collection of plant material.

One technique capable of handling data which may belong to more than one class is the so- called collaborative filtering,8 which has been recently applied to a family of biological targets.9 Another approach is based on the so-called multi-labeled classification. In the following we first give a general overview of multi-labeled classification. Subsequently, a short survey of the literature on multi-labeled classification is presented. In the final part of the introduction some measures suitable for assessing the performance of multi-labeled classification are presented. Two concrete multi-labeled classification implementations are described in Materials and Methods. The results of their application to the problem of finding the Asteraceae tribe(s) in which a particular STL(s) can be found are presented in the Results section. The interpretation of the obtained classification models from a chemotaxonomic point of view is presented in the Discussion.

3.1.1 Multi-labeled classification

Classification is the task of assigning objects to a pre-specified set of classes. In traditional classification tasks, these classes are mutually exclusive by definition. Various learning methods have been developed to deal with such problems.10 With any of these methods errors occur when the classes overlap in the selected feature space – Figure 3.1.

56 Multi‐labeled Classification of Asteraceae

Figure 3.1 Typical classification problem: the two classes contain instances which are difficult to separate in the feature space.

However, in some classification tasks the assumption of mutual exclusiveness of the classes is violated by definition. Thus, for example, in text categorization a document may belong to multiple genres;11,12 a biologically active compound may exhibit more than one activity; in biochemistry a gene may have multiple functions, yielding multiple labels;13 a plant secondary metabolite may appear in more than one taxa5,7 (tribe, subtribe, genus, etc.). When an object may belong to more than one class it is called multi-labeled – Figure 3.2.

Figure 3.2 Multi-labeled classification problem: the data marked with both symbols belong to both base classes simultaneously.

The most common approach to deal with multi-labeled data is to assign each object to the class it is most likely to belong, by some perhaps subjective criterion. For example, one may Introduction 57 assign a co-occurring STL as belonging only to the tribe from which it was most frequently reported. Following this approach, we recently built a k-NN classifier with good performance.14 However, it has been shown that the decision borders in this case are additionally blurred by STLs which simultaneously occur in several tribes.

Another possible method for handling multi-labeled data is to ignore the multi-labeled instances when training the classifier. This will help the underlying method in finding the decision borders between classes. Serious drawbacks of this approach are: a) available data are discarded; b) predictions for multi-labeled samples are likely to be incorrect, or at best incomplete; c) important and characteristic chemical groups are not considered.

A straightforward approach is to consider all objects with multiple labels as a new class, i.e. the “mixed” class. An important limitation is that the data belonging to multiple classes are usually too sparse to build useful models. Although around 10% of STLs in our data set belong to more than one tribe (cf. Material and Methods, Table 3.1), many combined classes are extremely small (containing less then five STLs) which effectively prevents building any useful classification model for such “combined” classes.

Boutell et al.15 have introduced the concept of “cross-training”. In this approach, the multi- labeled data are used more than once when training the classifier, i.e., using each multi- labeled object as a positive example for each of the classes it belongs to. Figure 3.3 shows the decision borders obtained by cross-training in the two-class scenario. The multi-labeled instances (marked with a star and a circle) are considered to belong to the ‘star’ class when training a classifier for this class and to the ‘circle’ class when a classifier for the ‘circle’ class is built. They are not used as a negative example for neither the ‘star’ nor the ‘circle’ class. The central area between the decision borders (dashed curves) belongs to both classes simultaneously. When classifying a pattern in this area, both classifiers are expected to classify it as an instance of each class. In such a case, the pattern will obtain multiple labels, ‘star’ and ‘circle’.

58 Multi‐labeled Classification of Asteraceae

Figure 3.3 Decision borders obtained via cross-training. The area between the dashed lines belongs to both “star” and “circle” classes.

The advantage of the cross-training is that it allows most of the binary classifiers, which output real-valued scores, to be turned into multi-labeled classifiers. This is done by applying some predefined criteria (cf. Materials and Methods, Section 3.2.3.1) to transform the real- valued scores to labels. The cross-training does overcome the problem of the sparse multi- labeled data, since no ‘combined’ classes are created. An inherited limitation is that it is based on the one-against-all approach which may lead to highly unbalanced classifiers.16 In a chemical contents cross-training using multiple logistic regression as a classifier was applied17 to the classification of compounds into different toxic modes of actions.

3.1.2 Literature on multi-labeled classification

The literature on multi-labeled classification is generally related to text classification. The first approach addressing this problem was reported by Schapire and Singer12 and is called BoosTexter. It is an extended version of the popular ensemble learning method AdaBoost.18 Following this work, multi-labeled learning has attracted more attention. McCallum11 proposed a Bayesian approach to multi-labeled document classification. Elisseeff and Weston19 proposed a kernel method for multi-labeled classification based on a special cost function – “ranking loss” – and the corresponding margin for multi-labeled models. The popular C4.5 decision tree20 has been adapted13 to handle multi-labeled data through modifying the definition of entropy. Two probabilistic generative models for multi-labeled text called Parametric Mixture Models (PMM1 and PMM2) have been proposed.21 Introduction 59

Alternating decision trees22 have been extended to the multi-labeled case as well.23 As already mentioned, Boutel et al.15 applied multi-labeled learning to scene classification. They decomposed the multi-labeled problem into multiple independent binary classification problems – one per each base class – and utilized support vector machines (SVM) as classifier. Recently, Zhang and Zhou24 proposed an extended version of the k-nearest neighbors (k-NN) algorithm that is capable of handling multi-labeled data. Their approach is based on the label sets of the k nearest neighbors and utilizes the maximum a posteriori (MAP) principle to determine the label set of a new instance. Both the prior and the posterior probabilities are required and can be estimated from the training set. In addition, the method is capable of ranking each of the base class labels. The proposed multi-labeled k-NN – called

ML-kNN – was successfully applied to a yeast gene functional data set, yielding a performance that is comparable to an SVM-based method.

3.1.3 Assessing the performance of a multi-labeled classification

An important question related to multi-labeled classification is how to assess the model performance. The measures usually used in a single-label classification include precision, recall, accuracy and F-measure.10 In a multi-labeled case, the evaluation of the model performance is more complicated because a result can be fully correct, partly correct, or fully wrong. As a simple example, let us consider an object which belongs to two classes, c1 and c2.

All the following outcomes are possible: 1) c1, c2 – correct; 2) c1 – partly correct; 3) c1, c3 – partly correct; 4) c1, c3, c4 – partly correct; 5) c3, c4 – wrong; and all these results differ in their degrees of correctness. Given a data set D containing m instances with Q possible classes, for each pattern x let Yx be the set of truth labels and Px the set of predicted labels. The easiest way to access the accuracy of a multi-labeled classification is to use the Hamming loss, defined as:

1 m 1 HL()D = P ∅Y (3.1) ∑ xi xi m i=1 Q

where ∅ stands for the symmetric difference between the two sets. The smaller the value of HL(D), the better the classifier performance. When Y = 1 for all instances, a multi- xi labeled system is in fact a multi-class single-labeled one and the Hamming loss is 2 ÷ Q times the loss of the usual classification error. 60 Multi‐labeled Classification of Asteraceae

Another score was proposed by Boutell et al.15 They introduced the so-called “α- evaluation”, which is a generalized version of the Jaccard’s similarity metric.25 In addition to

the above definitions, let Mx = Yx – Px (missed labels) and Fx = Px – Yx (false positive labels). As a result, the prediction of each instance is scored according to:

α ⎛ βM + γF ⎞ ⎜ x x ⎟ score(x) = ⎜1− ⎟ (α ≥ 0,0 ≤ β,γ ≤ 1, β = 1| γ = 1) (3.2) ⎝ Yx ∪ Px ⎠

The constraints on β and γ are chosen to constrain the score to be non-negative. These parameters allow false positive and miss to be penalized differently, allowing the customization of the measure to the task at hand. Setting β = γ = 1 yields a simpler formula:

α ⎛ Y ∩ P ⎞ ⎜ x x ⎟ score(x) = ⎜ ⎟ (α ≥ 0) (3.3) ⎝ Yx ∪ Px ⎠

The parameter α is called “forgiveness rate” because it reflects how much to forgive errors made in predicting labels. Small values of α are more aggressive, and larger values are conservative (penalizing errors more severely). Using this score, the authors15 defined recall and precision of a multi-labeled classes as well as accuracy on a given testing set. The multi- labeled accuracy on a data set D of size m is defined as:

1 m accuracyML (D) = ∑ score(xi ) (3.4) m i=1

c In addition to the above definitions let H x = 1if c ∈Yx and c ∈ Px (“hit” label), 0

~ c ~ c otherwise. Analogously, let Yx = 1 if c ∈Yx , 0 otherwise, and let Px = 1 if c ∈ Px , 0 otherwise. Consequently, base-class recall and precision on a data set D are defined as follows:

c ∑ H x recall = x∈D c ~ c (3.5) ∑Yx x∈D

c ∑ H x precision = x∈D c ~ c (3.6) ∑ Px x∈D Introduction 61

The accuracyML (Equation 3.4) and the base-class recall (Equation 3.5) and precision (Equation 3.6) allow a comparison between multi-labeled and single-labeled classification schemes.

All the measures discussed so far are based on the actual labels assigned to an instance. However, in most cases, the learning method will produce a ranking function, f ()x,⋅ , which for a given instance x will order the labels in Ψ (where Ψ = {1,2,K,Q} is the complete set of labels, i.e., in this study this set contains the seven Asteraceae tribes). That is, a label, i.e., class, l1, is considered to be ranked higher than l2 if f (x,l1 ) > f (x,l2 ). Based on this ranking function, different performance measures can be defined. The first one is called one-error:

1 m one − errD ()f = ∑ H ()xi , where m i=1 (3.7) 0,if arg max f (x ,k)∈Y ⎧ k∈Ψx i xi H ()xi = ⎨ ⎩1,otherwise

The smaller the value of this measure, the better is the performance. For single-label classification problems, the one-error is identical to the ordinary classification error.

The second ranking-based measure is called coverage and is defined as:

1 m coverageS ()f = ∑ C()xi −1 , where m i=1

(3.8) C()xi = {}l f (xi ,l )≥ f (xi ,li′),l ∈ Ψ and

l′ = arg min f (x ,k) i k∈Yx i

It measures how far we need, in average, to go down the list of labels in order to cover all possible labels assigned to an instance. The smaller its value, the better is the performance.

The last ranking-based measure to be introduced is the average precision, which was originally used in information retrieval systems.26 Nevertheless, it can be used to measure the effectiveness of the label rankings:

62 Multi‐labeled Classification of Asteraceae

1 m 1 ave _precD = ∑ R(xi ) , where m i=1 Yx i (3.9) {}l f ()x ,l ≥ f (x ,k),l ∈Y i i xi R()xi = ∑ k∈Y {}l f ()x ,l ≥ f ()x ,k ,l ∈ Ψ xi i i

The average precision evaluates the average fraction of labels, ranked above a particular label l ∈Y , which are actually inY . When the value of the average precision is equal to xi xi one, the system achieves the perfect performance. The larger its value, the better is the performance.

We have briefly introduced the concept of multi-labeled classification and different methods to assess its performance. In the rest of this paper we first describe two particular multi-labeled classification techniques in more details and afterwards show their application to the classification of natural products from plants. The aims of the present work are: 1) to introduce the concepts of multi-labeled classification; 2) to build and to evaluate a multi- labeled classification model which is capable of relating STLs from Asteraceae to the tribe(s) from which they come from taking into account skeletal types and substitutional patterns; 3) to compare two different methods for multi-labeled classification: SVM-based cross-training

and multi-labeled k nearest neighbors – ML-kNN; and 4) to interpret the results from a chemotaxonomic point of view.

3.2 Materials and Methods

3.2.1 Data sets

The data set consisting of 921 STLs used in ref. 14 was used. Table 3.1 gives the distribution of the data set (N = 921) into the seven corresponding Asteraceae tribes, the abbreviation used in this paper for these tribes as well as the number of structures present in the training, validation and test sets, respectively. All structures were assigned to their corresponding tribe(s) according to the current taxonomic classification.27 For each STL all reported sources were checked. When a structure was reported in more than one tribe, it was assigned to each of these tribes, i.e. it has multiple labels. The occurrence of such cases is given in parentheses in Table 3.1. Materials and Methods 63

Table 3.1 Overview of the data set used in this study comprising 921 structures of STLs and the respective tribes from which the STLs were isolated. Single-labeled compounds occur in one tribe only; multi-labeled compounds occur in several tribes.

training set validation set test set labels tribe abbreviation single multiple single multiple single multiple Anthemideae ANT 109 26 20 10 25 5 Cardueae CAR 41 26 10 4 11 2 Eupatorieae EUP 123 36 27 8 29 7 Heliantheae HLT 178 51 41 11 41 6 Inuleae INU 29 21 5 4 5 7 Lactuceae LAC 30 11 6 3 6 4 Vernonieae VER 51 18 11 2 12 2 Total (921) 561 80a 120 18a 129 13a aThe number is smaller than the sum of the corresponding column because a multi-labeled STL is counted toward each individual tribe.

The complete data set of 921 STLs was split semi-randomly (the corresponding distribution of STLs in the tribes was preserved) in three subsets: training – around 70% of the data set, validation, and test set, each of the latter two containing around 15% of the structures, c.f. Table 3.1.

3.2.2 Structure representation

All STLs were represented by their RDF codes,28 which were calculated using the three dimensional structures. Single, low energy 3D conformations were generated for the STLs from their 2D constitution using CORINA.29,30 The RDF codes were calculated according to the following equation:

N −1 N 2 −B(r−rij ) g(r) = ∑∑Ai Aj e (3.10) i=>1 j i

where N is the number of atoms in a molecule, Ai and Aj are properties associated with the atoms i and j, respectively, rij represents the distance between atoms i and j, and B is a smoothing factor. The above formula was applied with the property A set to the atomic number of the considered atom and 64-dimensional RDF codes were calculated using the 64 Multi‐labeled Classification of Asteraceae

descriptor calculation package ADRIANA.Code.31 The function g(r) was defined in the interval 2.0–9.0 Å.

It should be noted that the STLs used in this study possess a certain degree of conformational flexibility. However, unlike the classification of compounds according to their biological activity where a change in the conformation may render a compound inactive, we were more concerned with the skeletal types and substitution features, which may help us to identify a plausible plant source for a given STL. Nevertheless, the STLs are ultimately formed into specific enzymes pockets and utilizing a 3D descriptor is relevant, as backed up by our previous experience.14

3.2.3 Classification methods

3.2.3.1 Cross-training with Support vector machines (ct-SVM)

A detailed description of the SVM learning technique is outside of the scope of this article. Several comprehensive texts on this subject exist16,32 as well as practical guides.33 In this work, SVM classifiers with radial basis function (RBF) kernel were used. All calculations were performed in R34 using the package e1071.35 The “one-against-all” strategy as described in the introduction was applied, resulting in seven binary classifiers – one for each tribe of Asteraceae. Each binary classifier was separately optimized for optimal performance as suggested33 via ten fold cross-validation and with the validation set, cf. Table 1. Three testing criteria to transform the SVM scores into labels were used:15

1) P-Criterion, in which the test data are labeled by all of the classes corresponding to positive SVM scores; if no score is positive, the pattern is labeled as “unknown”;

2) T-Criterion, which is similar to the P-Criterion, but uses the Closed World Assumption (CWA) in which all examples belong to at least one of the Q classes; if all SVM scores are negative, the pattern is labeled to the SVM producing the top (less negative) score;

3) C-Criterion, in which the decision depends on the closeness between the top SVM scores, regardless of their sign; how close two scores have to be can be determined via cross- validation, on a hold-out set or by using the maximum a posteriori (MAP) principle. Materials and Methods 65

3.2.3.2 Multi-labeled k-NN (ML-kNN)

This method belongs to the family of “lazy” learners. It is memory-based and requires no model to be fitted.36 All training examples are stored in the memory and the prediction of a new pattern is made by finding its k nearest neighbors in terms of some predefined distance measure and averaging the values of their known class. In this study, the measure was the Euclidian distance, which is the most commonly used one.

However, the commonly used k-NN assigns an instance to exactly one class. As mentioned in the introduction, Zhang and Zhou24 described a modified version of the algorithm, which can handle multi-labeled data – ML-kNN. In this work, the ML-kNN was implemented as an R34 script following the algorithm described in the original article. When compared to the standard k-NN classification, the main difference is that instead of selecting the class for a new instance by majority vote, the ML-kNN uses the maximum a posteriori (MAP) principle combining prior and posterior probabilities – calculated from the training set – to assign labels to an instance as well as to rank the labels. It is worth noting that although it is not discussed 24 by the authors as occurred in the SVM approach, there are cases where the ML-kNN failed to assign an instance to anyone of the known classes; therefore in this paper the same three testing criteria described in Section 2.3.1 (P-, T-, and C-criterion) were applied to the label rankings, outputted by ML-kNN (see Results). The best threshold used with the C-criterion as well as the best value of k were determined as to give the best performance when applied to the validation set.

3.2.4 Model validation and performance measures

The proposed models using cross-training SVM and ML-kNN were validated using a test set consisting of 142 STLs, cf. Table 3.1. This set was not used in the model building phase, i.e., in determining the best model parameters. Only after the final models had been built the test set was submitted through them and the performance was assessed. An overview of the multi-labeled classification measures has already been presented in the introduction. In this work, for the two models (ct-SVM and ML-kNN), six different performance measures arranged in two distinct groups were used: Hamming loss (Eq. 3.1), accuracyML(α=1) (Equation 3.4) and the base class metrics – recall and precision (Eq. 3.5 and Eq. 3.6). These measures require that a set of labels has been assigned to each instance and therefore are connected to the three testing criteria (P-, T- and C-criterion, Section 3.2.3.1). The other three measures are 66 Multi‐labeled Classification of Asteraceae

one-error, coverage and average precision (Eq. 3.7, 3.8 and 3.9, respectively). They are calculated based solely on the real-valued scores outputted by the classifier and as such are independent of the testing criteria. The overall workflow is outlined in Figure 3.4.

Figure 3.4 Workflow for obtaining a multi-labeled classifier and assessing its performance.

Using the training set, an initial classifier – ct-SVM or ML-kNN – is built (1). In step (2) the classifier parameters as well as the threshold for the C-criterion are optimized as to give best performance on the validation set. Afterwards the training and the validation sets are merged and a new classifier is built (3) using the parameters obtained in step (2). The final classifier obtained is applied to the test set in step (4). For each instance (structure of an STL) the base classes (tribes of Asteraceae) are ranked. Based on this ranking, the classifier performance can be assessed with the corresponding measures according to step (5). By utilizing the different criteria (6) described in section 3.2.3.1, any STL is assigned to the corresponding tribe(s). Based on this assignment, the classifier performance can be assessed by the additional measures, as shown in step (7). Results 67

3.3 Results

3.3.1 Cross-training SVM (ct-SVM) models

Table 3.2 gives an overview of the performance based on the actually assigned classes (tribes). The base-class metrics – recall and precision – calculated according to Eq. 3.5 and Eq. 3.6 are also given. Under P-criterion 30 STLs (ca. 21%) were not assigned to any of the seven classes.

Table 3.2 Overall performance measures and base-class recall and precision for the ct-SVM model applied to the test set. The number of STLs from the test set belonging to each tribe (including multi- labeled STLs) is given in parentheses. The measures are based on the classes (tribes) to which an STL was assigned under the corresponding criterion (P-, T-, and C-criterion).

P-criteriona T-criterion C-criterion Hamming Loss 0.054 0.080 0.081

AccuracyML(α=1) 0.838 0.746 0.753 Tribe Recall Precision Recall Precision Recall Precision ANT (30) 0.880 0.786 0.867 0.684 0.867 0.667 CAR (13) 0.727 0.800 0.692 0.750 0.769 0.667 EUP (36) 0.808 0.875 0.667 0.800 0.667 0.800 HLT (47) 0.974 0.860 0.830 0.830 0.872 0.837 INU (12) 0.333 0.750 0.333 0.667 0.333 0.667 LAC (10) 0.571 1.000 0.500 0.714 0.500 0.625 VER (14) 0.750 1.000 0.714 0.833 0.714 0.833 a30 STLs (ca. 21%) were not assigned to any of the seven Asteraceae tribes

68 Multi‐labeled Classification of Asteraceae

3.3.2 ML-kNN models

Table 3.3 gives an overview of the ML-kNN performance based on the actually assigned classes (tribes). In addition, the base-class metrics – recall and precision – calculated according to Eq. 5 and Eq. 6 are also given. Under P-criterion 33 STLs (ca. 23%) were not assigned to any of the seven classes.

Table 3.3 Overall performance measures and base-class recall and precision for the ML-kNN model with k=5 applied on the test set. The number of STLs from the test set belonging to each tribe (including multi-labeled STLs) is given in parentheses. The measures are based on the classes (tribes) to which an STL was assigned to under the corresponding criterion (P-, T-, and C-criterion).

P-criteriona T-criterion C-criterion Hamming Loss 0.081 0.096 0.098

AccuracyML(α=1) 0.754 0.695 0.698 Class (tribe) Recall Precision Recall Precision Recall Precision ANT (30) 0.769 0.800 0.767 0.742 0.767 0.742 CAR (13) 0.545 0.667 0.615 0.667 0.615 0.667 EUP (36) 0.538 0.824 0.556 0.714 0.639 0.657 HLT (47) 0.927 0.776 0.894 0.737 0.915 0.729 INU (12) 0.250 0.667 0.167 0.667 0.167 0.667 LAC (10) 0.625 0.833 0.500 0.833 0.500 0.833 VER (14) 0.750 0.857 0.571 0.667 0.571 0.615 a 33 STLs (ca. 23%) were not assigned to any of the seven Asteraceae tribes

Results 69

3.3.3 Measures based on the label rankings

The classification performances of the two multi-labeled classifiers based on the label rankings (one-error, coverage and average precision) applied to the test set is shown in Table 3.4.

Table 3.4 Performance measures for the ct-SVM and ML-kNN (k=5) models applied on the test set. The measures are based on the label rankings produced by each method.

ct-SVM ML-kNN short description one-error 0.204 0.268 Gives the ratio of the number of STLs which were not found in the top-ranked tribe to the total number of STLs. Bound between zero and one. The smaller the value, the better the performance coverage 1.563 1.711 Shows to how many tribes an STL has to be assigned on average to make sure that all true tribes are included in the prediction. Bound between one and seven (the number of classes). The smaller the value, the better the performance average 0.876 0.832 Shows how often the true tribes are top-ranked. Bound precision between zero and one, the larger the value, the better the performance.

3.3.4 Comparison with single-labeled classifier

As a baseline, Table 5 compares the accuracy of the k-nearest neighbor single-labeled classifier as we have previously reported14 with the accuracy of the two multi-labeled methods under C-criterion.

Table 3.5 Accuracy of a single-labeled k-nearest neighbor classifier with k=1 (cf. ref 14) and of the two multi-labeled methods under C-criterion.

a 1-NN SVM ML-kNN Accuracy 0.722 0.753 0.698 asingle-labeled kNN with k=1, cf. ref 14.

70 Multi‐labeled Classification of Asteraceae

3.4 Discussion

The main aim of this work was to build and evaluate a multi-labeled classification model (Figure 3.4) that is capable of assigning STLs from Asteraceae species to their corresponding tribes based on skeletal types and substitutional patterns. Such a model can further be used for targeted collection of plants with the goal of isolating specific STL. To carry out our goal, two multi-labeled classification techniques were applied: a cross-training with support vector

machine as binary classifiers (ct-SVM) and a modified version of the k-nearest neighbor (ML- kNN). Both methods are able to assign a sample to more than one class and proved to be appropriate for the building of models with good performance.

3.4.1 Comparison of the classification methods

3.4.1.1 Based on the actual tribes to which an STL has been assigned

The following discussion is based on the results under C-criterion, i.e., the last two columns in Table 3.2 and Table 3.3. However, the same trend is observed with the other two criteria.

Starting with the overall measures – Hamming loss and the multi-labeled accuracy – one can see that both methods performed well on the test set. Remember that for a classifier which performs perfectly the Hamming loss (Eq. 3.1) equals zero. Therefore, the obtained low

values (0.081 for ct-SVM and 0.098 for ML-kNN) show good performance. On the other hand, the multi-labeled accuracy (Eq. 3.4) equals one when the underlying classifier performs perfectly and zero when all predictions are wrong. Once again the obtained values (0.753 for

ct-SVM and 0.698 ML-kNN) show good performance. According to both metrics, the ct-SVM

performs better than ML-kNN.

Figure 3.5 compares the performance of ct-SVM and ML-kNN with regards to the base classes under C-criterion (cf. Table 3.2 and Table 3.3). Considering recall – Figure 3.5a – the ct-SVM model performed better for all base classes with the exception of HLT. The recall of a base class, as calculated according to Eq. 3.5, measures the fraction of STLs correctly predicted as belonging to the corresponding tribe (base class). That is, a recall of one will show that indeed all STLs isolated from a given tribe (including the multi-labeled cases) were predicted as belonging to at least that tribe. If we consider, for example, the tribe Anthemideae (ANT) there were 30 STLs in the test set.

Discussion 71 a

1

0.8

0.6

recall 0.4

0.2

0 ANT CAR EUP HLT INU LAC VER tribe

ct-SVM ML-kNN b

1

0.8

0.6

0.4 precision

0.2

0 ANT CAR EUP HLT INU LAC VER tribe

ct-SVM ML-kNN

Figure 3.5 Base-class recall and precision under C-criterion (cf. Table 3.2 and Table 3.3). 72 Multi‐labeled Classification of Asteraceae

The ct-SVM under C-criterion achieved a recall of 0.867 (cf. Table 3.2). This means that ~87% of the 30 STLs in the test set (26 STLs) obtained at least an ANT label, i.e. were classified (at least partially) correct. Therefore, from the base class recall values presented in Table 3.2 and Table 3.3 and depicted in Figure 3.5a it is clear that both multi-labeled classification methods performed well for almost all base classes (tribes) with the exception of the tribe Inuleae (INU) for which relatively low recall values were obtained. Both methods

produced the same recall values for LAC while ML-kNN gave a slightly better recall for HLT. The ct-SVM method achieved higher recall for ANT, CAR; EUP, INU, and VER.

The base-class precision calculated according to Eq. 3.6, on the other hand, shows the fraction of STLs predicted as belonging to a given tribe which really belongs to this tribe. That is, a perfectly performing classification method is expected to achieve a precision of one. Such precision will indicate that all STLs predicted as produced by a given tribe are actually found in at least this tribe. In other words, no false positive predictions are made. Let us consider again, as an example, the tribe Anthemideae (ANT). From all 142 STLs in the test set (cf. Table 3.1) 39 were predicted by ct-SVM as synthesized at least by the plant species in ANT. From these 39 STLs 26 were indeed isolated from plant species belonging to the tribe Anthemideae. By making the ratio (26 / 39) the resulting precision of 0.667 (cf. Table 3.2) is obtained. A look at Figure 3.5b reveals that both ct-SVM and ML-kNN produced similar base-class precisions, greater then 0.6 in all cases. Therefore, both methods offer a moderate number of false positives and thus have a good performance.

3.4.1.2 Based on the label rankings (Table 3.4)

A look at Table 3.4 confirms the good performance of both methods. Low one-error and coverage values were obtained while the average precision was high. All metrics in this category favor the ct-SVM method as can be seen from Table 3.4. Although the difference is not very large, combined with the better base-class recall it makes the ct-SVM the method of choice.

3.4.1.3 Comparison with single-labeled classification (Table 3.5)

It is worth noting that both multi-labeled methods exhibit similar accuracy as the single- labeled classification as can be seen from Table 3.5. In the case of ct-SVM the multi-labeled method even outperformed the single-labeled kNN. There are two different reasons for this result. First, the SVM method was not explored in our previous study14 and it maybe more Discussion 73 suited to this data set. The second possible reason is that the multi-labeled ct-SVM classifier based on cross-training uses the multi-labeled patterns more than once (cf. Introduction). Subsequently each of the individual binary classifiers is based on more data compared to the single-labeled case which may lead to better performance.

3.4.2 Comparison between the P-, T-, and C-criterion

Table 3.2 and Table 3.3 shows that the use of P-criterion produced the best performance metrics, compared to the T- and C-criterion. However, a noticeable aspect is that under the P- criterion both classification methods were not able to assign around 22% of the STLs to any of the seven tribes (Table 3.2 and Table 3.3).

Naturally, one would expect that there is a large intersection between the two sets of STLs – unassigned by ct-SVM and ML-kNN. However, there were 14 STLs unassigned to any of the tribes by both classification methods (Chart 3.2) while the rest (16 and 19 for ct-SVM and Ml-kNN, respectively) were unique to the corresponding classification method. As expected, it indicates that due to the singular nature of both classification methods, they have found somehow different decision boundaries between the classes. Since the multi-labeled instances usually are set very close to the decision boundaries, cf. Figure 2.1, it was surprising that only one of the fourteen rejected STLs was actually reported from more than one tribe (structure 14, Chart 2). Nevertheless, all of the fourteen structures are somewhat hard to be classified into a single tribe. If skeletal types (or their subtypes) are analyzed alone only two of the structures shown in Chart 2 (8 and 13) are considered as unique to the tribe they belong to. Based on the literature5 all of the remaining STLs from Chart 3.2 may appear in at least one more tribe. The exclusiveness of these structures in only one tribe, when it occurs, is solely attributed to certain peculiarities regarding specific substituents – or combinations of substituents – in their molecules and we believe that such features were a little bit difficult to be captured using the P-criterion. It is the case, for example, of STLs 10 – 12 and 18, which are typical for the tribes they come from,5 but not exclusive at all.

74 Multi‐labeled Classification of Asteraceae

Chart 3.2 Structures of STLs from the test set which both classification methods did not assign to any tribe under P-criterion and the corresponding tribe(s) from which they have been isolated.

O O H H O O O O H O O H O O H O H O O O H O O O 5 (ANT) 6 (ANT) 7 (CAR) O 8 (EUP)

O H O O O H O H O H O O O O O O OO O O O H Cl O H H O O O H O O O O 9 (EUP) 10 (EUP) 11 (EUP) 12 (HLT)

O O O O H O O O O O O H O H H O H O H O O O O O H O O O O O 13 (HLT) 14 (HLT;INU) 15 (INU) 16 (LAC)

O H O O O O O O O O O O O O 17 (VER) 18 (VER)

Another possible explanation for the inability to assign certain STLs to any particular tribe is that these unassigned STLs are outliers due to specific skeletal or substitution patterns. However, such an assumption is not backed up by the structures in Chart 3.2. In addition we have already attempted to account for outliers in this data set. By using Principal Component Analysis in concert with Hotteling T2 test it was shown that no improvement in the classification accuracy is obtained even when a significant amount of the test data were Discussion 75 discarded.14 Consequently, this result is more likely caused by the fact that these STLs lie very close to the decision boundaries of the corresponding methods.

Looking at the two remaining tested criteria (T- and C-criterion) the tendency of both classification methods – SVM and ML-kNN – was to obtain slightly better results under the C- criterion. The difference looks marginal at a first glance, cf. Table 3.2 and Table 3.3. Nevertheless, from the data shown in Table 3.2 and Table 3.3 it is difficult to evaluate how each classifier actually treated the multi-labeled cases. Contrary to the expected behavior, the examination of the 13 multi-labeled STLs in the test set reveals that the C-criterion does not improve their predictions but it improves the overall performance instead. The reason is that some of the misclassified STLs under the T-criterion were assigned to more than one tribe under the C-criterion, thus making the prediciton partially correct. By its nature, the C- criterion produces a higher number of multi-labeled instances. Hence, while under the T- criterion the ct-SVM assigned multiple classes to nine STLs, under the C-criterion this number was 16. The corresponding numbers for ML-kNN were seven and 17, respectively.

The aim of our models is to assist in a targeted collection of plants. With this aim in mind, although the P-criterion produced the best performance metrics (cf. Table 3.2 and Table 3.3), its inability to assign around 22% of the STLs to any of the considered tribes limits its use. Both T- and C-criterion are preferable. When choosing between T- and C-criterion one needs to consider two costs: 1) completely missing the correct tribe and 2) predicting more than one possible tribe for an STL, which has been reported from only one tribe. The C-criterion is one of choice when compared to the T-criterion when the second cost is lower. With the aim of assisting in a targeted collection of plants the second cost is lower, especially when the “wrong” tribes bear close relationships with the correct one. This is the case because only a small part of the Asteraceae plants have been studied so far and it is likely that some of the STLs from our dataset will be found in additional tribes in the future. Thus, based on the discussion so far we have identified the ct-SVM in concert with the C-criterion as the best method for helping in the targeted collection of plants. In the following section we present an analysis of the performance of the proposed method – ct-SVM, with the C-criterion – with regards to the chemical structures of the STLs, reported from more than tribe. 76 Multi‐labeled Classification of Asteraceae

3.4.3 Analyses of the ct-SVM results under C-criterion

Although the current taxonomic classification27 of Asteraceae used to label the STLs in this work is mostly based on morphologic aspects, we obtained good classification models. Hence, taking into account only the C-criterion it is clear that the distribution of the STLs among the tribes differs sufficiently to allow a targeted plant collection. This observation is backed up by the good base-class performance, achieved by ct-SVM. However, another aspect to be considered is that even though all plant sources were checked for each single STL, there is a probability that in the future some of the currently single-labeled STLs may appear in more than one tribe. Since the main advantage of the proposed methodology is its ability to assign an STL to multiple tribes, we examined the STLs which were predicted as belonging to more than one tribe under the C-criterion using the cross-training SVM method. Chart 3.3 gives the 16 STLs which were predicted as occurring in more than one tribe.

First we examine the case when a structure was reported from a single tribe – structures 10, 12, 20 – 25, 27, 31 and 32 on Chart 3 – and predicted as occurring in two or more tribes. In most cases, the additional “new” tribe (or tribes) is meaningful from a chemotaxonomic point of view. The “new” tribes usually possess closely related STLs. Therefore, the obtained results indicate that the chemical boundaries among such tribes are blurred. Consider, for example, STLs 22 and 23 from Chart 3 which actually belong to EUP. The predictions indicated that they belong to EUP as well as HLT. In this example HLT – the “new” tribe – should not be considered as a completely wrong prediction for 22 and 23 since according to the structures in the data set closely related STLs also occur in HLT. Data from literature also support this statement since EUP and HLT have a strong chemical connection between each other.5,6,7 It should be emphasized that structures 10 and 12 appeared in both Chart 3.2 and Chart 3.3 and their exclusiveness in only one tribe is already discussed above, thus corroborating the current results under the C-criterion. The above explanation is valid for the remaining STLs in Chart 3.3 as well. The predicted tribes were completely wrong only for structures 27 and 32.

The second interesting case is when a structure is reported from two or more tribes (STLs 19, 26, 28 – 30), i.e., they are multi-labeled. In these examples, all tribe assignments were quite good. Totally correct assignments were achieved for structures 26 and 30. Although the STLs 28 and 29 have been assigned to only two tribes (out of five and four tribes, Discussion 77 respectively) all of the assigned tribes were correct. The only completely wrongly assigned STL was 19.

Chart 3.3 STLs from the test set classified as belonging to more than one Asteraceae tribe by the ct- SVM under the C-criterion. The tribe(s) from which the STLs have been reported is (are) given below the corresponding structures and the tribes assigned by the classification are in parentheses.

H O O O O O O O H O OO OO O H H O O O O O H O O O O 10 12 19 20 EUP (EUP;ANT) HLT (HLT;VER) ANT;EUP (CAR;INU) CAR (CAR;HLT)

O H O O H

O Cl O O O O O O H H O H O O O O O O O O O O O O 21 22 23 24 CAR (CAR;HLT) EUP (EUP;HLT) EUP (EUP;HLT) HLT (HLT;EUP;INU)

H O OO O O O O H O O O O O H H O H O H O O 28 25 26 27 CAR;EUP;HLT;INU;LAC HLT (HLT;EUP) HLT; INU (HLT;INU) INU (CAR;HLT) (HLT;INU)

O H O O O OO O O O O H O O H O O H H O O O H O O O H O O 29 30 31 32 ANT;EUP;HLT;INU CAR;LAC (CAR;LAC) LAC (CAR;LAC) VER (CAR;LAC) (HLT;INU)

As shown in Chart 3.3, five out of the 13 multi-labeled STLs in the test set obtained more than one label from the classification (ct-SVM, C-criterion). Chart 3.4 shows the remaining eight originally multi-labeled STLs. 78 Multi‐labeled Classification of Asteraceae

Chart 3.4 STLs from the test set reported in more than one Asteraceae tribe and their subsequent predictions by ct-SVM using C-criterion. The tribes from which the STL have been reported are below the structures and the predicted tribe is in parentheses. Only the eight multi-labeled STLs not shown in Chart 3.3 are shown.

O O O O H O O O O O O O O H O O O O O O O O 33 34 35 36 ANT;EUP (ANT) EUP;VER (EUP) HLT;INU (HLT) EUP;HTL;INU (HLT)

O O O O O O O

H H O O O O O O O O 37 38 39 40 HLT;INU (ANT) EUP;INU (ANT) ANT;LAC (ANT) ANT;LAC;VER (ANT)

With the exception of STLs 37 and 38, Chart 3.4 shows that all the multi-labeled ones were predicted partially correct. It is interesting to point out that if one excludes from this analysis STL 35 – whose skeleton is more restricted to HLT or INU – all the remaining STLs’ skeletons can actually be found in all of the seven tribes since they are of widely occurring nature. For this reason, it was unexpected that only one tribe was assigned to each of these structures. The obtained results demonstrate that the classification in this case was based more on the substitutional features rather than skeletal types of the STLs. In the light of focused collection of plants such results, although missing a possible plant sources, are useful when the substitutional features of the desired STL are of high importance.

An examination of the actual scores produced by the SVM classifier showed that in most cases the “true” labels obtained the highest scores, although the difference among them was higher than the used threshold.

Conclusions 79

3.5 Conclusions

We have presented a general overview of multi-labeled classification – a machine learning technique, which allows an object to be simultaneously classified into several classes. This technique has applications in various domains - text categorization, selectivity of a biologically active compounds, gene functions analysis, etc. Two multi-labeled classification methods – cross-training with support vector machine as a classifier and multi-labeled k- nearest neighbor - have been successfully applied to the assignment of a special type of secondary metabolites – sesquiterpene lactones – into the Asteraceae tribe(s) from which they were isolated. The utility of the proposed classification model for a targeted collection of plant material with the aim of finding a particular natural compound was shown. The SVM model yielded better results, outperforming both the multi-labeled k-nearest neighbor and the previously built single-labeled k-NN classifier. Considering the STLs which have been isolated from more than one tribe, i.e., the multi-labeled STLs, both methods (cross-training with support vector machines and multi-labeled k-nearest neighbor) performed reasonably, although not as good as separating the individual tribes. Taking into account the three different testing criteria used to convert the real-valued classifier output to labels, the criterion of choice was the C-criterion. The C-criterion showed the best probability to correctly label the STLs from the test set. The simultaneous assignment of STLs to multiple Asteraceae tribes was exemplified and discussed. The handling of STLs, which appear in more than one tribe was shown and discussed as well. Both analyses demonstrated the value of the proposed methodology 1) to study the relationships between the secondary metabolism of the plant family Asteraceae and its current taxonomic classification; and 2) to assist in the targeted collection of plant material with the aim of isolating particular sesquiterpene lactones.

3.6 Acknowledgments

FBC is grateful to the Alexander von Humboldt-Foundation (Germany) for a Research Fellowship at the Computer-Chemie-Centrum.

80 Multi‐labeled Classification of Asteraceae

3.7 References

(1) Mann, J. Chemical Aspects of Biosynthesis. Oxford University Press: Oxford, United Kingdom, 1994.

(2) Cordell, G. Natural Products in Drug Discovery - Creating a New Vision. Phytochem. Rev. 2002, 1, 261-273.

(3) Abel, U.; Koch, C.; Speitling, M.; Hansske, F. G. Modern Methods to Produce Natural-Product Libraries. Curr. Opin. Chem. Biol. 2002, 6, 453-458.

(4) Hostettmann, K.; Wolfender, J. L. The Search for Biologically Active Secondary Metabolites. Pestic. Sci. 1997, 51, 471-482.

(5) Seaman, F. C. Sesquiterpene Lactones As Taxonomic Characters in the Asteraceae. Bot. Rev. 1982, 48, 123-551.

(6) Alvarenga, S. A. V.; Ferreira, M. J. P.; Emerenciano, V. P.; Cabrol-Bass, D. Chemosystematic Studies of Natural Compounds Isolated From Asteraceae: Characterization of Tribes by Principal Component Analysis. Chemometr. Intell. Lab. 2001, 56, 27-37.

(7) Zdero, C.; Bohlmann, F. Systematics and Evolution Within the Compositae, Seen With the Eyes of a Chemist. Plant Syst. Evol. 1990, 171, 1-14.

(8) Herlocker, J. L. Understanding and Improving Automated Collaborative Filtering Systems. Ph.D Dissertation, University of Minnesota, 2000.

(9) Erhan, D.; LrHeureux, P. J.; Yue, S. Y.; Bengio, Y. Collaborative Filtering on a Family of Biological Targets. J. Chem. Inf. Model. 2006, 46, 626-635.

(10) Witten, I. H.; Eibe, F. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: San Francisco, 2000.

(11) McCallum, A. Multi-label text classification with a mixture model trained by EM. In Proceedings of AAAI'99 Workshop on Text Learning, 1999.

(12) Schapire, R. E.; Singer, Y. BoosTexter: A Boosting-Based System for Text Categorization. Mach. Learn. 2000, 39, 135-168. References 81

(13) Clare, A.; King, R. D. Knowledge Discovery in Multi-Label Phenotype Data. In Lecture Notes in Computer Science; Raedt, L. D., Siebes, A., Eds.; Springer: Berlin, Germany, 2001; Vol. 2168, pp. 42-53.

(14) Hristozov, D.; DaCosta, F. B.; Gasteiger, J. Sesquiterpene Lactones-Based Classification of the Family Asteraceae Using Neural Networks and K-Nearest Neighbors. J. Chem. Inf. Model. 2007, 47, 9-19.

(15) Boutell, M. R.; Luo, J.; Shen, X.; Brown, C. M. C. Learning Multi-Label Scene Classification. Pattern. Recogn. 2004, 37, 1757-1771.

(16) Schlokopf, B.; Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, 2001.

(17) Spycher, S.; Nendza, M.; Gasteiger, J. Comparison of Different Classification Methods Applied to a Mode of Toxic Action Data Set. QSAR Combinat. Sci. 2004, 23, 779-791.

(18) Freund, Y.; Schapire, R. E. A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119-139.

(19) Elisseeff, A.; Weston, J. A Kernel Method for Multi-Labelled Classification. In Advances in Neural Information Processing Systems; Dietterich, T. G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, 2002; Vol. 14.

(20) Quinlan, J. R. C4.5: Programs for Machine Learning; Morgan Kauffman: San Mateo, California, 1993.

(21) Ueda, N.; Saito, K. Parametric Mixture Models for M ulti-Label Text. In Advances in Neural Information Processing Systems; Becker, S., Thrun, S., Obermayer, K., Eds.; MIT Press: Cambridge, MA, 2003; Vol. 15.

(22) Freund, Y.; Mason, L. The alternating decision tree learning algorithm. In Proceedings of 16th International Conference on Machine Learning, Morgan Kaufmann: San Francisco, CA, 1999, pp. 124-133.

(23) De Comite, F.; Gilleron, R.; Tommasi, M. Learning Multi-Label Alternating Decision Trees From Texts and Data. In Lecture Notes in Computer Science; Perner, P., Rosenfeld, A., Eds.; Springer: Berlin, Germany, 2003; Vol. 2734, pp. 35-49. 82 Multi‐labeled Classification of Asteraceae

(24) Zhang, M.-L.; Zhou, Z.-H. A k-Nearest Neighbor Based Algorithm for Multi-label Classification. In 2005 IEEE International Conference on Granular Computing, 2005, Vol. 2, pp. 718-721.

(25) Gower, J. C.; Legendre, P. Metric and Euclidean Properties of Dissimilarity Coefficients. J. Classif. 1986, 3, 5-48.

(26) Salton, G. Developments in Automatic Text Retrieval. Science 1991, 253, 974-980.

(27) Bremer, K. Asteraceae: Cladistics and Classification. Timber Press: Portland, 1994.

(28) Hemmer, M. C.; Steinhauer, V.; Gasteiger, J. Deriving the 3D Structure of Organic Molecules From Their Infrared Spectra. Vib. Spectrosc. 1999, 19, 151-164.

(29) Sadowski, J.; Gasteiger, J. From Atoms and Bonds to Three-Dimensional Atomic Coordinates: Automatic Model Builders. Chem. Rev. 1993, 93, 2567-2581.

(30) CORINA, version 3.2, Molecular Networks GmbH: Erlangen, Germany, http://www.molecular-networks.com (accessed 06.2006)

(31) ADRIANA.Code, version 1.0, Molecular Networks GmbH: Erlangen, Germany, http://www.molecular-networks.com (accessed 06.2006)

(32) Vapnik, V. N. Statistical Learning Theory. Wiley-Interscience: New York, NY, 1998.

(33) Hsu, C.-W.; Chang, C.-C.; Lin, C.-J. A Practical Guide to Support Vector Classification. 2006, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf (accessed 06.2006)

(34) R Development Core Team. R: A language and environment for statistical computing, version 2.2.1, 2005, http://www.r-project.org (accessed 06.2006).

(35) Dimitriadou, E.; Hornik, K.; Leisch, F.; Meyer, D.; Weingessel, A. e1071: Misc functions of the department of statistics (e1071), TU Wien, 2005.

(36) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, 2001.

Further comments and discussion 83

Further comments and discussion

The study presented in the previous section concludes our attempt to extract new or to confirm the existing knowledge in the chemotaxonomy of Asteraceae. Once again good classification models were obtained. Thus, the taxonomic division of Asteraceae as proposed by Bremer relates to the plant’s ability to synthesize specialized sesquiterpene lactones. However, to further support this claim, additional studies utilizing some of the other taxonomic classifications of Asteraceae might be needed. In addition, the secondary metabolism of Asteraceae is by no means limited to sesquiterpene lactones. The consideration of additional classes of secondary metabolites like flavonoids, coumarins, benzofuranes, etc. is another possible extension of the work presented in Chapter 2 and Chapter 3.

A general overview of the techniques to deal with multi-labeled data was presented. Such methods have a wide spectrum of applications in all domains where an object may belong to more than one class simultaneously. These domains are quite diverse, including text categorization, selectivity of biologically active compounds, gene functions analysis, and image classification, to name just a few.

The use of a more sophisticated classification technique – namely support vector machines – resulted in better classification performance. However, the conceptually simpler k-nearest neighbor algorithm once again performed well, even in its multi-labeled reincarnation.

The search for sources of natural compounds, which exhibit desired properties – biological activity, specific flavor, nutrition value, etc. – can be very time and resources consuming, considering the large number of plant species. The developed models can decrease the costs by allowing the targeted, knowledge-driven, collection of plants.

In Chapter 2 and Chapter 3 we have shown how with the help of appropriate machine learning techniques chemical information can be transformed into knowledge concerning the taxonomic classification of plant species from Asteraceae and how this knowledge can be transferred into practical value. With this, we conclude our exploration of small chemical data sets. In the following chapters ways to navigate in chemical databases which contain hundreds of thousands of compounds will be presented.

85

4 Ligand-based Virtual Screening by Novelty Detection with Self-Organizing Maps

Overview

We continue our exploration in discovering knowledge from chemical data by moving from small data sets to large chemical databases. The focus of this and the next two chapters is ligand-based virtual screening. As mentioned in the introduction of this work, the virtual screening, in general, is designed for searching large databases of chemical compounds with the help of computer with the aim of selecting a number of candidate molecules for testing to identify novel chemical entities that may have a desired biological activity. As such it complements the high-throughput screening, which is in everyday use in the modern pharmaceutical industry. The ligand-based virtual screening, in particular, requires one or more small molecules (ligands) which are known to be active against the target protein and does not require that the 3D structure of the protein is known.

When only a single known active molecule is available, it does not provide enough data to apply a full-blown machine learning technique. In such cases similarity search1 is the method of choice. A detailed overview of the similarity search process is presented in the next section of this chapter. Briefly, the known active molecule is used as probe against which all compounds from the screened set are compared. This can be done by substructure search, which looks for common fragments between the probe and the screened structure or by first encoding the probe and the screened molecules by a given structure representation and comparing the resulting representations. The latter approach is by far the most commonly used. Different types of binary fingerprints are the most often used structure representation in this type of ligand-based virtual screening.

The origins of the binary structural fingerprints can be traced back to the structural keys used to speed up substructure searching. A structural key is represented as a Boolean array in which each element is either one (true) or zero (false). In a structural key each position in this array represents the presence (one) or absence (zero) of a specific structural feature. To create a structural key, the important structural features (patterns) are determined and a Boolean 86 Virtual Screening by Novelty Detection with SOM

array with the same size as the number of selected features is created for each molecule in the database.

The structural keys described above lack generality. The choice of patterns included in the key is somehow arbitrary but has a critical effect on any subsequent processing. Fingerprints address this problem by eliminating the idea of pre-defined patterns. Fingerprints, just like structural keys, are stored in a Boolean array (hence the name binary fingerprints) but, unlike a structural key, there is no assigned meaning to each bit. There is a variety of methods to encode the chemical structure using the above idea. These methods have resulted in a variety of binary fingerprints. The first group is the so called hashed fingerprints. The hashed fingerprints encode different patterns, where a pattern describes, for example, a path of length

n bonds, i.e., (atom-bond-atom)n with the natures of the atoms and bonds defined. The set of these patterns differs from molecule to molecule and it is generally not possible to assign each potential patter to a specific position in a Boolean array of predefined length. Instead, the pattern is passed to a hashing function, which generates the corresponding position (or positions). The most popular type of hashed fingerprints are the ones from Daylight Chemical Information Systems2 and the Unity fingerprints from Tripos, Inc.3

Another type of binary fingerprints is based on the circular substructure. The circular substructure is a descriptor where each atom is represented by a string of extended connectivity values calculated using a modification of the Morgan algorithm.4 Different variations of this type of fingerprints are provided by Scitegic’s Pipeline Pilot Software.5

Hashed binary fingerprints calculated with the algorithm of Daylight Chemical Information Systems2 (referred to as Daylight fingerprints from here on) are used throughout Chapter 4 and Chapter 5 of this work.

An important limitation of the binary fingerprints described above is that, despite that they are not based on predefined structural fragments, they encode mostly structural features of the molecules. This works fine when the final goal is to retrieve molecules, which are structurally similar to the probe. Based on the similarity principle, which states that similar molecules have similar properties, good results were obtained with all types of fingerprints when screening for biologically active molecules. However, there are notorious cases in which a very small change can lead to a total lost of activity6 thus violating the similarity principle. Overview 87

Another type of chemical structure representation – typically representing the chemical structure as a vector of real numbers – tries to solve this problem by taking into account different physico-chemical properties of the molecule or of the atoms and bonds from which the molecule is built. This approach is based on the assumption that structural similarity alone is not sufficient, or, at the extreme, even not needed as long as two molecules share the same charge or hydrophobic distribution, for example. In addition, while the binary fingerprints described above usually operate on the 2-dimensional chemical structure, some vectorial descriptors can describe the 3-dimensional structure as well. In Chapter 4 a topological autocorrelation, which is a vectorial descriptor based on the molecular constitution alone (2- dimensioal) is used while in Chapter 5 radial distribution function (RDF) – a vectorial descriptor based on the 3-dimensional structure of the molecule, as already discussed in Chapter 2 and Chapter 3 – is investigated as well. A detailed description of these two structure representations is presented in the corresponding sections of Chapter 4 and Chapter 5. Additional information about these and other types of structural descriptors can be found in the Textbook7 and in the Handbook8 of Chemoinformatics. A full-featured implementation of topological autocorrelation and radial distribution function together with fast empirical methods for the calculation of different physico-chemical properties is available in Molecular Networks’ ADRIANA.Code Software.9

When several known active structures are available they present an opportunity to describe the activity space in a much better way. At this point different machine learning techniques can be applied with the aim of gaining better understanding of the underlying activity mechanism or simply to retrieve potential drug candidates. Nowadays a number of databases of biologically active compounds are freely – PubChem,10 National Cancer Institute (NCI) database11 – or commercially – MDL Drug Data Report (MDDR),12 WOrld of Molecular BioAcTivity (WOMBAT)13 – available. The availability of these databases presents an opportunity to use a set of active structures rather than a single query. However, the majority of machine-learning techniques require both active and inactive structures. While the access to active structures is relatively easy having in mind the number of databases mentioned above, the access to proven inactive compounds is limited since they are usually not stored. The problem is, to some extent, inverted in the pharmaceutical industry. Due to the availability of high-throughput screening techniques usually the proprietary databases of the pharmaceutical companies contain much more tested inactive structures than known active molecules. 88 Virtual Screening by Novelty Detection with SOM

The class of machine learning techniques which are devoted exactly to learning from such single class data is termed broadly as novelty detection or one-class classification. 14,15 In the next section we investigate the ability of one particular novelty detection method, based on Self-Organizing Maps (SOM),16 to help us in navigating through a large database of biologically active compounds. The results are compared to an adapted version of the similarity search algorithm, which can take multiple known active structures into account.

The remainder of this chapter corresponds exactly to the original paper accepted for publication in the Journal of Chemical Information and Modelling other than the last section after the references list and the numbering.

References

(1) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical Similarity Searching. J. Chem. Inf. Model. 1998, 38, 983-996.

(2) Daylight Chemical Information Systems, Inc. http://www.daylight.com/dayhtml/doc/theory/theory.finger.html, accessed 06.2007

(3) Tripos, Inc. http://www.tripos.com, accessed 06.2007

(4) Morgan, H. L. The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107-113.

(5) Scitegic, Inc. http://www.scitegic.com, accessed 06.2007

(6) Martin, Y. C.; Kofron, J. L.; Traphagen, L. M. Do Structurally Similar Molecules Have Similar Biological Activity? J. Med. Chem. 2002, 45, 4350-4358.

(7) Terfloth, L. Calculation of Structure Descriptors. In Chemoinformatics; Gasteiger J., Engel T., Eds.; Wiley-VCH: Weinheim, 2003, Chapter 8, pp. 401-438

(8) Gasteiger, J. A Hierarchy of Structure Representations. In Handbook of Chemoinformatics; Gasteiger, J., Ed.; Wiley-VCH Verlag: Weinheim, 2003; Chapter 3.1, pp. 1034-1061 Overview 89

(9) ADRIANA.Code, version 1.0, Molecular Networks GmbH, Erlangen, Germany, http://www.molecular-networks.com, accessed 06.2006

(10) PubChem Project, http://pubchem.ncbi.nlm.nih.gov/, accessed 06.2007

(11) National Cancer Institute Database, http://129.43.27.140/ncidb2/, accessed 06.2007

(12) MDL Drug Data Report, version 2006.1, MDL Information Systems, http://www.mdli.com/, accessed 06.2007

(13) Olah, M.; Mracec, M.; Ostopovici, L.; Rad, R.; Bora, A.; Hadaruga, N.; Olah, I.; Banda, M.; Simon, Z.; Mracec, M.; Oprea, T. I. WOMBAT: World of Molecular Bioactivity. In Cheminformatics in Drug Discovery; Oprea, T. I., Ed.; Wiley-VCH: New York, 2003.

(14) Markou, M.; Singh, S. Novelty Detection: a Review - Part 1: Statistical Approaches. Signal Process. 2003, 83, 2481-2497.

(15) Markou, M.; Singh, S. Novelty Detection: a Review - Part 2: Neural Network Based Approaches. Signal Process. 2003, 83, 2499-2521.

(16) Kohonen, T. Self-Organizing Maps, Springer: Berlin, 2001. 90 Virtual Screening by Novelty Detection with SOM

Original Article

Ligand-based Virtual Screening by Novelty Detection with Self- Organizing Maps

Dimitar Hristozov1, Tudor I. Oprea2, and Johann Gasteiger1*

1Computer-Chemie-Centrum, Universität Erlangen-Nürnberg, Nägelsbachstr. 25, D-91052 Erlangen, Germany

2Division of Biocomputing, University of New Mexico School of Medicine, MSC 11 6145, 1 University of New Mexico, Albuquerque, New Mexico 87131-0001, USA

Hristozov, D., Oprea, T.I., Gasteiger, J., 2007, submitted to J. Chem. Inf. Model.

Abstract

We describe a novel method for ligand-based virtual screening, based on utilizing Self- Organizing Maps (SOM) as a novelty detection device. Novelty detection (or one-class classification) refers to the attempt of identifying patterns that do not belong to the space covered by a given data set. In ligand-based virtual screening, chemical structures perceived as novel lie outside the known activity space and can therefore be discarded from further investigation. In this context, the concept of “novel structure” refers to a compound, which is unlikely to share the activity of the query structures. Compounds not perceived as “novel” are suspected to share the activity of the query structures. Nowadays, various databases contain active structures but access to compounds which have been found to be inactive in a biological assay is limited. This work addresses this problem via novelty detection, which does not require proven inactive compounds. The structures are described by spatial autocorrelation functions weighted by atomic physicochemical properties. Different methods for selecting a subset of targets from a larger set are discussed. A comparison with similarity search based on Daylight fingerprints followed by data fusion is presented. The two methods complement each other to a large extent. In a retrospective screening of the WOMBAT database novelty detection with SOM gave enrichment factors between 105 and 462 – an improvement over the similarity search based on Daylight fingerprints between 25% and 100%, when the one hundred top ranked structures were considered. Novelty detection with SOM is applicable (1) to improve the retrieval of potentially active compounds also in concert Introduction 91 with other virtual screening methods; (2) as a library design tool for discarding a large number of compounds, which are unlikely to posses a given biological activity; (3) for selecting a small number of potentially active compounds from a large data set.

4.1 Introduction

Virtual screening of databases is a popular method for selecting available compounds for biological assays. It involves the scoring of molecules in a database of chemical compounds in order of decreasing probability of biological activity. The result of this process is a ranked list in which the highest ranked compounds are assumed most likely to share the target’s activity. In such a manner, potential hits may be acquired or synthesized, then tested in the early stages of a lead-discovery program. When the 3D structure of the biological target protein is available, structure-based approaches to virtual screening may be used.1,2 However, when such information is not available, ligand-based approaches can be used, with similarity search3 being the most common one. Similarity searching requires a known active structure, such as a hit from high-throughput screening, used as query. This query is compared with a set of potentially active compounds by means of a similarity measure.

The use of known active compounds as template to rank a set of compounds with unknown activity has been explored in a variety of studies.4,5,6 Once the compounds from the screened set are compared to the query, the set is sorted using the similarity coefficients; the top-ranked compounds are expected to share the biological properties of the query molecule. Most commonly 2D fingerprints (as descriptors) and Tanimoto coefficients3 (as similarity metric) are utilized, although other types of similarity measures7,8 exist. Similarity measures are often continuous values, and can be adapted to real-value representation. A number of studies have applied autocorrelation vectors,5 reduced graphs,9 or other real valued descriptors10 to this problem.

The availability of published active compounds often presents an opportunity to use a set of active structures rather than a single query. Several reports11,12,13 utilizing this information for similarity search exist. A common way for incorporating such multi-target data is via data fusion.12 The fusion process usually improves the results, and Hert at el.14 recently reported on additional improvements when the nearest non-active neighbors are used as queries. Another possibility for utilizing data from multiple targets is to apply novelty detection techniques. 92 Virtual Screening by Novelty Detection with SOM

Novelty detection refers to the attempt of identifying patterns that does not belong to the space covered by a given set.15,16,17 The set usually contains patterns for only one of the available classes, hence novelty detection techniques are sometimes called one-class classification. This approach is valuable in many cases where data about a given state of the system can be acquired easily, while the collection of data for the other states is difficult. Novelty detection is commonly applied for fault detection, network intrusion detection, hand written digit recognition, Internet and e-commerce transactions, statistical process control, etc.16,17 Prior to virtual screening, one can often access different databases of active structures; however, access to compounds that have been found to be inactive in a biological assay is relatively difficult unless such results are reported or stored in on-line repositories (such as PubChem). Therefore, building two-class classification models that might distinguish actives from inactives is hindered and a method that does not require proven inactive compounds is preferable.

In a typical novelty detection application, say, fault detection, there is usually much more data for the “normal” case. In such a scenario one is interested in the patterns predicted as novel (signaling fault conditions). However, putting the novelty detection into a virtual screening context the “normal” case is described using only known active compounds. Since the system has a knowledge of the active class alone it predicts as novel those compounds, which lie far enough from the chemical space set of the active compounds. Thus, the process is inverted in the sense that the accent is now on the compounds, predicted as known, that is, active. In addition, while the discovery of new active scaffolds and chemotypes is a desirable feature of any VS method, the term “novelty” throughout this text refers only to compounds lying sufficiently far from the chemical space of the given activity.

Novelty detection can be performed by a big variety of statistical and machine-learning methods. A comprehensive review covering both approaches has been presented by Marcou et al.16,17 From the large variety of novelty detection techniques we selected one based on Self- Organizing Maps (SOM). SOM is a neural network model well-known for its applications to high-dimensional data analysis in many fields, including chemistry.18,19 In spite of its well- documented use as a novelty detector in the field of machine learning,20,21,22 and the continuous use of SOM for solving chemistry-related problems23 including virtual screening24,25 to the best of our knowledge novelty detection with SOM has not been applied to chemical problems so far. Materials and Methods 93

There are several reasons for the application of SOM as a novelty detector: a) the method allows the ranking of the accepted structures according to their proximity to the chemical space covered by the trained SOM, besides the ability to detect novel patterns; this feature allows us to compare the results of this method with the results obtained using similarity search; b) the size of the trained SOM is usually smaller than the size of the entire training set, which offers speed improvements; c) the SOM is an unsupervised learning method, thus its application to such “one-class” problems is intuitive.

To build a good description of the active space a representative set of actives is needed. Clearly, the success of a retrospective virtual screening with SOM novelty detection can not be measured on the same compounds, which were used to build the novelty detector. Thus, different methods for splitting the available sets of actives into training and test set were studied. Although not strictly needed for a similarity search with Daylight fingerprints, such a split might be useful when a representative set has to be selected in order to decrease the run time of the search.

The aims of this work were to a) examine the applicability of SOM novelty detection to ligand-based virtual screening; b) compare SOM novelty detection with the most commonly used similarity search with Daylight fingerprints; c) study the effect of different methods for subset selection on the results of a virtual screening experiment.

4.2 Materials and Methods

4.2.1 Data sets

4.2.1.1 Overview

Five sets of known inhibitors – 568 Acetylcholinesterase (AChE), 999 Cyclooxygenase 2 (COX-2), 457 3',5'-cyclic-nucleotide phosphodiesterase IV (PDE4), 2105 Thrombin, and 199 u-Plasminogen activator (uPA) inhibitors were extracted from the WOMBAT (World Of Molecular BioAcTivity) database,26 version 2006.1. WOMBAT is a target-annotated database available from Sunset Molecular Discovery (Santa Fe, New Mexico). Release 2006.1 contains 154,236 compounds, collected from articles in medicinal chemistry journals published between 1975 and 2006. A reduced set of 135,877 unique chemical structures resulted after 94 Virtual Screening by Novelty Detection with SOM

removing duplicate structures as well as those, for which some of the used physicochemical properties could not be calculated.

In addition to the structural information, WOMBAT contains also the reported activity

values, expressed as pKi value, information about the species in which the tests were performed, the biological role of the structure (inhibitor, antagonist, etc.) as well as additional properties of interest. For all sets of actives used in this study only inhibitors for the corresponding enzymes tested in human and with reported activity less than 30 μm were selected. An overview of the selected active subsets is given in Table 4.1.

Table 4.1 Subsets of inhibitors used in this study

Target Abbreviation N μa σa Acetylcholinesterase AChE 568 0.433 0.168 Cyclooxygenase 2 COX-2 999 0.345 0.142 3',5'-cyclic-nucleotide phosphodiesterase PDE4 457 0.430 0.161 Thrombin Thrombin 2105 0.377 0.133 u-Plasminogen activator uPA 199 0.397 0.172

a μ and σ give the mean and standard deviation of the intraclass similarities, obtained with Daylight fingerprints and Tanimoto coefficient

In addition to the number of compounds in each subset, the average self-similarity, μ, of the whole subset is given. This value was calculated by taking the pairwise Tanimoto similarity between all compounds in the set using the 1024-bit Daylight fingerprints, and calculating the average value of all similarity scores. As can be seen, the inhibitors of AChE are the most self-similar group, having the highest value of μ, while the COX-2 set is the least self-similar, having the lowest μ value.

4.2.1.2 Training set selection

Three methods for splitting the available actives into training and test set were compared – random, semi-random and Taylor-Butina27,28 clustering. In all cases, half of the known actives were used as training set, and the other half as a test set. The test set was merged with the rest of the WOMBAT structures (up to 135,877), and the performance of the different methods was measured by their ability to retrieve the known active compounds from this set. Materials and Methods 95

The semi-random method was based on the activity values, assigned to each structure. Each subset was sorted according to the reported activity and three groups – of high (pKi > 8), medium (6 < pKi <= 8), and low (4.5 < pKi <= 6) activity were distinguished. Half of the structures from each of these groups were randomly picked as training and the remaining were kept aside as a test set.

The Taylor-Butina clustering was first described by Taylor27 and later by Butina.28 It is an unsupervised non-hierarchical clustering method which guarantees that every cluster contains molecules which are within a distance cut-off of the central molecule. The similarity matrix obtained with Daylight fingerprints and Tanimoto coefficients for each of the active subsets was used as input, and a cut-off value of 0.8 was used. Half of the structures in each cluster were selected as training set. Half of the singletons were randomly assigned to the training set as well. The clustering as well as data set splitting was carried out using the R environment.29

4.2.2 Chemical structure representation

4.2.2.1 Binary fingerprints (BFP)

1024-dimensional unfolded Daylight fingerprints30 were generated with the Chemical Descriptors Library.31

4.2.2.2 Topological autocorrelation (AC2D)

Introduced by Moreau et al.32 the topological autocorrelation descriptors have since then been applied in a number of studies.5,24,33 The descriptors are calculated according to eq. 4.1:

k k A(d) = ∑∑ pi p jδ (d − dij ) (4.1) i==1 j i

Here k is the number of atoms in the molecule, pi is some atomic property of atom i, dij is the topological distance (i.e., the number of bonds) between atoms i and j, and δ (x) = 1: x = 0;δ (x) = 0; x ≠ 0 is the binning function. In the present study, the autocorrelation function was evaluated from 0 to 10 topological distances. Thus, the chemical structures were represented as 11 dimensional vectors with regard to the used atomic property.

Three atomic properties were calculated with the software package PETRA34 by previously 35 published empirical methods for all atoms in a molecule: sigma electronegativity (χσ), 36 37 effective atom polarizability (α), and partial atomic charge (qtot). In addition, the identity 96 Virtual Screening by Novelty Detection with SOM

function, i.e. each atom was represented by 1, was used. The atomic properties for each molecule, with exception of the identity, were auto scaled to zero mean and unit variance before applying eq. 1. The scaling has been shown38 to diminish the correlations between the bins of autocorrelation and to better preserve the physicochemical information. The resulting autocorrelation vectors were calculated with ADRIANA.Code39 and were additionally auto- scaled to ensure that the values are comparable when a distance measure (such as the Euclidian distance, used by SOM) is calculated.

4.2.3 Virtual screening methods

A schematic overview of the three virtual screening methods used in this work is presented in Figure 4.1. A detailed overview of each of these methods is given in the following sections.

ligand-based virtual screening methods

1 2 similarity search SOM novelty detection with data fusion (somND) (SSDF)

2a 2b single structure multiple structure representation representations (somNDs) (somNDm)

Figure 4.1 Overview and abbreviations of the three methods for ligand-based virtual screening used in this work. See sections 4.2.3.1, 4.2.3.2.a and 4.2.3.2.b for details on each method.

Materials and Methods 97

4.2.3.1 Similarity search with subsequent data fusion

A schematic workflow for this type of virtual screening is presented in Figure 4.2.

representation query q ranked list

screened set 1 0.7 2 0.9 1 2 0.9 n 0.8 2 ...... 1 2 n 0.8 3 1 0.7 n

Figure 4.2 Workflow of a similarity search with a single query. 1) represent the chemical structures; 2) measure the similarity; 3) sort the list. See text for details.

In the first step, both the query and the screened data set are described by the same structure representation. In the second step, the representations of each of the n structures in the screened dataset are compared to the query representation by means of a similarity measure and a list with the similarity scores is obtained. In step 3 this list is sorted in descending order with regards to the similarity score and the final ranked list is produced. In the present study for this similarity ranking, the structures were described by 1024-dimensional binary Daylight fingerprints. The Tanimoto coefficient was used as a similarity measure and was calculated according to eq. 4.2:

c S = (4.2) T a + b − c

where a is the number of bits “on” in the representation of the query structure, b is the number of bits “on” in the screened structure, and c is the number of bits “on” in both, i.e., the union of the two representations.

The workflow presented in Figure 4.2 applies to only one query structure. However, a set of known high-quality actives is often available. It has been shown by Whittle et al.11 that in such cases the fusion of the ranked lists obtained from each query structure, enhances the 98 Virtual Screening by Novelty Detection with SOM

results of the virtual screen. Thus, to obtain the highest performance of the similarity search after the m ranked lists were obtained (m being the number of query structures, i.e., the number of actives in the training set) these lists were then subject to data fusion. The data fusion algorithm calculates a new score for each structure from the screened dataset by combining the scores, which the structure has obtained in any of the m similarity lists. There are different methods to combine these scores, called “fusion rules”12. In the present work the MAX rule was used, meaning that each screened structure j obtained a final score, equal to the maximum value of its individual scores, collected from each of the lists, according to eq. 4.3:

* S FUS ( j) = max[S (i, j)] i (4.3)

where S* denotes the calculated similarity score between the query structure i and the screened structure j.

4.2.3.2 Novelty detection with Self-Organizing Maps (somND)

4.2.3.2.1 Single structure representation (somNDs)

The process of using a SOM novelty detector for virtual screening with a single structure representation is shown in Figure 4.3.

representation ranked list training set 1 1 1.0 2 0.01 2 2 0.01 n 0.2 ...... 1 2 4 m n 0.2 1 1.0 screened set screened 1 2 ... 1 3 n representation

Figure 4.3 Workflow of a virtual screening with a SOM novelty detector. 1) represent the chemical structures; 2) train a SOM; 3) project the screened data set; 4) sort the list. See text for details.

Materials and Methods 99

In the first step, both the training set and the screened data set are described by the same structure representation. In the second step, a SOM is trained using only the known active compounds in the training set. Once the trained SOM is obtained, a local accuracy is assigned to each neuron by using the average distance between this neuron and its first-sphere neighbors.20,22 In step 3, each structure from the screened data set is projected onto the trained SOM and its best matching neuron (BMN) is found. The (Euclidian) distance between the screened structure and its BMN is then compared to the local accuracy at the BMN and the screened structure j obtains a score SSOM(j) according to eq. 4.4:

d( j, BMN j ) S SOM ( j) = 1− (4.4) aBMN j

where d(j, BMNj) denotes the distance between the structure j from the screened set and its

BMN, BMNj is the best matching neuron for structure j on the trained SOM, and αBMNj is the local accuracy at BMNj.

The process of assigning the local accuracy and the scoring of the screened structures is visualized in Figure 4.4 for a SOM with rectangular topology and lattice.

α

1 1 - (d1 / α ) 2 3 3 1 - (d3 / α ) 1 2 1 - (d2 / α )

Figure 4.4 Assignment of local accuracy to a SOM with rectangular topology and lattice and scoring of three hypothetical structures.

100 Virtual Screening by Novelty Detection with SOM

The local accuracy α is determined as the average distance between the neuron and its first sphere neighbors. All screened structures, falling into the neuron are scored as shown, d denoting the distance between the neuron weight vector and structure representation vector.

The distance d2 between the screened structure 2 and the neuron weight vector is larger then the local accuracy (see Figure 4.4), thus its score becomes negative and consequently it is classified as novel (unlikely to share the biological activity).

4.2.3.2.2 Two or more structure representation (somNDm)

The above workflow makes the screening of a data set using a single representation of the chemical structures straightforward. Usually more than one representation is available – in this study autocorrelation vectors, weighted by different atomic properties were used. Thus, the easiest way to combine such multiple structure representations is by concatenating all vectors together, producing a higher dimensional descriptor and applying the workflow described above.

However, when more than one representation of the chemical structures is available, another method for utilizing this information may be more suited to the task at hand. We adopted here the generalized method for multi-sensor environment described by Wong et al.22 The method relies on training a SOM for each individual chemical structure representation and projecting the training set through these networks. The outputs of each network, i.e., the distance between a pattern and its winning neuron – for the training set are collected in a matrix, which constitutes the system’s knowledge of the “normal” (active) case. When a structure is presented for screening, the output of all SOMs gives a vector in the output space, which may be compared to the positions of the training data. This comparison is performed using the Mahalanobis distance (MD) between the screened structure and the training data in

the output space. The MD measures the distance between a vector, x, and its mean vector, mx,

scaled by its covariance matrix, Cxx. Using these notations, the (squared) MD is given as shown in eq. 4.5:

2 T −1 MD = ()x − mx C xx (x − mx ) (4.5)

In the context of this work mx is a vector containing the column means of the “active” space

matrix obtained when building the novelty detector (area C in Figure 4.5, left) and Cxx is the covariance matrix of this “active” space matrix; x denotes a vector containing the output of each individual SOM for the compound being tested (area C in Figure 4.5, right). Materials and Methods 101

De Maesschealk et al.40 present a comprehensive tutorial about MD and its applications. An important property of the MD is that they are χ2 distributed. Thus, one can apply a statistically sound threshold when declaring patterns as novel (less likely to possess biological activity, given the previously known active). The threshold may be calculated using the confidence intervals of the χ2 distribution, thus χ2(α=0.99, n) will give a value that encapsulates 99% of the data in the active cluster when n different structure representations are used. Figure 4.5 visualizes the four steps of the described procedure.

Atomic Physicochemical Properties Identity Electronegativity Polarizability Partial Charge A

3 3 2 1 3 2 2 1 1 2 1 3 B

1 dI1 1 de1 1 dp1 1 dq1

2 dI2 2 de2 2 dp2 2 dq2

3 dI3 3 de3 3 dp3 3 dq3 C

1 dI1 de1 dp1 dq1 Test Pattern MD 2 dI2 de2 dp2 dq2 T dIT deT dpT dqT

3 dI3 de3 dp3 dq3 > > “Active” Space χ2(0.99,4) D Active Inactive

Figure 4.5 Novelty detection with four different structure representations. After the selected structure representations have been obtained (area A), a SOM is trained for each structure representation (area B). The training set is projected through the trained networks and the knowledge of the “active” space is collected in a matrix (area C, left). The screened structure is then projected through the same networks, collecting their output in a vector (area C, right). This vector is in turn compared to the “active” space by means of Mahalanobis distance. Based on a statistically chosen threshold, the screened structure is classified as active or inactive (area D). 102 Virtual Screening by Novelty Detection with SOM

The χ2 distribution of the Mahalanobis distance is based on the assumption that the data used to calculate it are multivariate normal distributed. Figure 4.6 a) shows a quantile-quantile (Q-Q) plot of the calculated Mahalanobis distances against the theoretical quantiles of χ2 distribution with 4 degrees of freedom for the PDE 4 training set, while Figure 4.6 b) shows the same plot for the uPA set. The value of the threshold χ2(0.99, 4) is presented as a dotted line. As can be seen from Figure 4.6 the distribution of the Mahalanobis distances is deviating from the χ2 distribution. The observed distribution is heavy tailed in comparison to the χ2. The Mahalanobis distance can still be used. However, due to the heavy tailing towards larger distances the selected threshold is likely to be conservative, that is, describing the active space very tight. Such a behavior is useful when a low false-positive ratio is needed. False-positive in the context of this paper refers to inactive compounds, predicted as active.

a) PDE 4 b) uPA Mahalanobis distance Mahalanobis distance Mahalanobis 0 204060 0 102030

0 5 10 15 0 5 10 15

2 2 quantiles of χ4 quantiles of χ4

Figure 4.6 Q-Q plot of the calculated Mahalanobis distance of the training set of PDE 4 inhibitors – a), and for the training set of uPA inhibitors – b), against the quantiles of χ2 distribution with 4 degrees of freedom.

4.2.3.3 Selection of SOM size and time complexity

The complexity of projecting a dataset of size m where each pattern has a dimensionality d to a SOM with n neurons is linear in all these sizes – O(mdn). Preliminary experiments have shown that increasing the number of neurons beyond 80% of the size of the training set does not lead to significant improvement of the results. Thus, a SOM with a number of neurons roughly equal to 0.8× s , where s is the number of patterns in the training set, was used. The Materials and Methods 103 complexity of the similarity search with subsequent data fusion, on the other hand, also scales linearly with m, d and the number of query structures. Thus, the SOM-based method is about 20% faster than the similarity search for an equal descriptors dimensionality. Provided that the autocorrelation vectors used in this study have a 23 times lower dimensionality than the fingerprints, the execution is faster when a single SOM is used. All run times were measured on a 64 bit Athlon 3700+ PC with 1GB of RAM, running SuSE Linux and using an in-house written program.

4.2.3.4 Selection of the best SOM

The training of a SOM is a stochastic approximation process, thus usually several maps need to be built and the best one to be selected. A big variety of measures for the goodness of mapping has been proposed,41,42 but there is still no uniformly recognized method to evaluate the quality of a particular SOM. Thus, in the following experiments with single structure representation ten SOMs were trained and the obtained ranked lists were merged using eq. 3. When multiple representations in concert with a Mahalanobis distance were used, the SOM with the lowest quantization error was used for each representation. For a data set containing N patterns, the average quantization error is calculated according to eq. 4.6:

1 N ε = d(x , BMN ) q ∑ i xi (4.6) N i=1

where d(x , BMN ) denotes the distance between pattern xi and its best-matching neuron. i xi

4.2.4 Performance measures

4.2.4.1 Recall

Virtual screening involves the sorting of a data set of chemical compounds in order of decreasing probability of activity. Once the whole data set has been sorted a subset of the top ranked compounds is considered. The size of this subset is called rank.

The results of virtual screening experiments are usually reported by using a cumulative recall (most commonly referred to as recall).43 The recall at a given rank is calculated according to eq. 4.7:

Fact recallr = (4.7) Dact 104 Virtual Screening by Novelty Detection with SOM

where Dact is the number of compounds in the screened dataset, which exhibit a given

biological activity (i.e., the size of the test set), r is the rank, and Fact is the number of experimentally validated actives recovered at this rank. The recall gives that fraction of the known actives, which was recovered in the similarity search at a given rank. It is bound between zero and one, with one indicating perfect performance.

4.2.4.2 Enrichment factor (ef)

The enrichment factor (ef) gives the improvement in the retrieval of active structures at a given rank compared to a random selection with the same rank. It is calculated according to eq. 4.8:

F D ef = act × all (4.8) r Dact

where Dall is the size of the data set, Dact is the number of compounds in the data set which

exhibit a given biological activity, r is the rank, and Fact is the number of experimentally validated actives recovered at this rank. Any method which performs superior to a random selection of r compounds has an enrichment factor greater than one.

4.2.5 Validation

Two cross-validation-like approaches have been used to validate the performance of the studied virtual screening methods.

In the first approach, the known active structures were randomly split into ten parts. Each one of these parts was kept aside. The training set was selected from the remaining known active structures, and a virtual screening of the whole WOMBAT database was performed. The retrieval of the actives which were kept aside amongst the 1000 top-ranked compounds (after removing all other known actives from the ranked list) was considered. The whole procedure was repeated five times and the average recall over the fifty repetitions is reported. This approach ensures a realistic evaluation of the performance of the corresponding virtual screening method.

The second approach was designed in a more aggressive way. Approximately ten per cent of the known actives were kept aside. However, instead of a random selection, these test sets contained complete clusters, as identified by the Taylor-Butina clustering algorithm. Therefore, the training sets selected from the remaining active structures contained Results and Discussion 105 compounds structurally dissimilar to the compounds in the test sets. Since there is a limited variability in the test set selection procedure due to the clustering, the above procedure was executed one time. In this manner the lower bound of the expected performance can be evaluated.

4.3 Results and Discussion

Table 4.2 Recall in per cent at rank 1000 for the five activity classes (the numbers in parenthesis give the number of recovered active compounds).

Activity Dataset fingerprintsa Autocorrelation of concatenated

Class Split b b b b c identity χσ α qtot Random 65.1 (182) 29.3 28.2 36.3 36.4 70.4 (199) AChE semi-random 67.3 (191) 23.9 22.6 33.3 38.2 66.3 (189) (284)d clustering 73.1 (208) 24.1 27.4 33.2 37.1 66.5 (189) random 69.4 (344) 44.1 26.3 35.1 42.7 68.4 (339) COX-2 semi-random 79.8 (400) 48.3 26.4 32.9 40.3 66.8 (335) (500) d clustering 61.6 (312) 47.8 25.2 34.8 39.8 68.3 (341) random 92.8 (213) 36.9 34.2 44.1 38.1 72.4 (166) PDE 4 semi-random 88.5 (203) 38.3 35.1 36.4 31.4 73.9 (169) (229) d clustering 94.3 (216) 33.1 30.3 40.7 33.8 74.1 (170) random 41.4 (430) 28.4 20.1 27.2 34.3 46.4 (481) Thrombin semi-random 39.2 (409) 32.7 19.9 29.4 35.2 50.2 (529) (1053) d clustering 41.7 (440) 30.3 21.3 29.3 36.4 53.4 (560) random 69.4 (69) 34.2 15.9 33.1 32.5 52.3 (52) uPA semi-random 74.4 (74) 32.4 19.2 36.6 46.2 70.2 (70) (100) d clustering 79.7 (80) 24.2 18.1 35.7 46.7 56.4 (56) a similarity search with subsequent data fusion (method 1 of Figure 4.1, cf. section 4.2.3.1); b SOM novelty detection with single structure representation (method 2a of Figure 4.1, cf. section 4.2.3.2.1), fused results of ten networks, descriptor: topological autocorrelation vectors, weighted by the corresponding atomic property; c same method as b, descriptor: the four autocorrelation vectors concatenated; d total number of active compounds

106 Virtual Screening by Novelty Detection with SOM

Table 4.2 gives the recall in per cent and in parentheses the absolute number of retrieved actives at rank 1000, i.e., after the thousand top-ranked structures have been retrieved from the screened set of 135,877 compounds. The five activity classes are shown with regards to the descriptors, analysis method, and data set splitting method. The size of the test set, that is, the actives which were the aim of retrieval, is indicated in parentheses for each activity class (approximately half of the total number of all known active structures, cf. Table 4.1).

4.3.1 Training set selection

The utilization of the Taylor-Butina clustering algorithm for selecting the training set gives the best results when a similarity search with Daylight fingerprints and subsequent data fusion is carried out. This tendency is most obvious for the AChE and uPA classes, while in the COX-2 case the results with a clustering based splitting are the worst. The semi-random selection offers significant improvements only in the COX-2 case, while the random selection is comparable with the clustering for PDE 4 and Thrombin. The SOM novelty detection was not much affected by the choice of training set. This hints that the compounds in the selected activity classes are relatively evenly distributed among the chemical space, defined by the autocorrelation vectors, thus any subset is able to retrieve a similar number of actives. Due to the fact that quantitative activity information is not always available, or there may not be enough activity spread in the actives set, the semi-random split may not always be applied. Except for the COX-2 case, it did not lead to significant improvements. Based on the results in Table 2, the split based on clustering is used for the rest of this work. A Taylor-Butina clustering is generally accepted44 as the clustering method of choice for the selection of an optimal diverse set of compounds and can easily be integrated into the virtual screening process. However, as can be seen from Table 4.2, one can do reasonably well with random selection, provided that the selected training set is large enough. For splits that preserve only a minor part for training, the clustering method is preferred over random splitting. For example, using only 20% of the available COX-2 inhibitors as a training set, a recall (at rank 1000) of 38% and 57% was obtained with random and clustering based splits, using Daylight fingerprints with data fusion.

The split based on Taylor-Butina clustering is used for the rest of this work since this method decreases the variability due to different randomly selected training sets and guarantees an even distribution of the compounds in the training set through the activity space. Results and Discussion 107

4.3.2 Validation

Table 4.3 shows the recall values considering the top-ranked 1000 structures obtained with 5 times 10-fold cross-validation-like experiment based on random splitting.

Table 4.3 Cross-validated recall in per cent at rank 1000 using random test set selection (cf. section 4.2.5). Mean and standard deviation over 50 repetitions (5 time 10-fold cross-validation)

fingerprintsa autocorrelation, concatenatedb autocorrelation, χ2c Activity Class recall std. dev. recall std. dev. recall std. dev. AChE 77.0 6.0 72.3 6.0 61.7 6.2 COX-2 58.9 7.1 84.6 2.9 78.0 6.2 PDE 4 90.3 4.8 79.7 6.5 71.9 6.7 Thrombin 48.6 4.2 53.1 4.1 39.0 6.4 uPA 69.8 9.8 60.1 10 57.9 10.1 a similarity search with subsequent data fusion (method 1 of Figure 4.1, cf. section 4.2.3.1); b SOM novelty detection with single structure representation (method 2a of Figure 4.1, cf. section 4.2.3.2.1), fused results of ten networks, descriptor: descriptor: the four autocorrelation vectors concatenated; c SOM novelty detection with multiple structure representation (method 2b of Figure 4.1, cf. section 4.2.3.2.2)

Values similar to those reported in Table 4.2 were observed. The performance of the similarity search method decreases slightly while the averaged recall obtained with SOM novelty increases. Thus, the gap between the two methods is lower than suggested solely by Table 4.2. In the case of uPA inhibitors the advantage of the similarity search is much less pronounced in the cross-validated results. For the other activity classes similar performances as in Table 4.2 are observed. The results obtained with multiple structure representations and Mahalanobis distance in the output space (not included in Table 4.2) are in general lower than those obtained with concatenated autocorrelation vectors. The deviation of the SOM output from the expected χ2 distribution, as shown in section 4.2.3.2.2, might be the reason for this behavior.

Table 4.4 shows the recall values considering the top-ranked 1000 structures obtained with 10-fold cross-validation-like experiment, based on leaving out complete clusters.

108 Virtual Screening by Novelty Detection with SOM

Table 4.4 Cross-validated recall in per cent at rank 1000 using complete clusters as test set (cf. section 4.2.5). Mean and standard deviation over 10 repetitions (10-fold cross-validation)

fingerprintsa autocorrelation, concatenatedb autocorrelation, χ2c Activity Class recall std. dev. recall std. dev. recall std. dev. AChE 14.0 12.6 31.2 14.2 21.2 10.2 COX-2 9.5 3.8 75.8 13.7 65.0 15.6 PDE 4 30.6 26.1 38.9 20.7 24.5 19.6 Thrombin 3.8 3.6 21 5.5 12.1 2.7 uPA 27.5 23.0 30.8 18.2 31.8 23.7

a similarity search with subsequent data fusion (method 1 of Figure 4.1, cf. section 4.2.3.1); b SOM novelty detection with single structure representation (method 2a of Figure 4.1, cf. section 4.2.3.2.1), fused results of ten networks, descriptor: descriptor: the four autocorrelation vectors concatenated; c SOM novelty detection with multiple structure representation (method 2b of Figure 4.1, cf. section 4.2.3.2.2)

The first thing to note from Table 4.4 is that the obtained recall values are two to ten times lower compared to the values in Table 4.2 and Table 4.3. Another observation is that the SOM novelty detection performed better for all activity classes. The decrease in the performance for the similarity search is hardly surprising. As the name implies, this methods relies mainly on similarities between the structure in the training set and those in the screened database. By excluding whole clusters an artificial situation in which a complete group of highly self- similar compounds is not presented at all in the training set is created. This has led to a more than ten-fold loss of performance in the case of Thrombin inhibitors (recall of 41.7, cf. Table 4.2, against 3.8, cf. Table 4.4). The SOM novelty detection with topological autocorrelation was affected less, leading only to a two-fold loss of performance on average. This observation confirms that by using topological autocorrelation vectors weighted by physico-chemical properties additional aspects of the similarity between the structures are covered. Thus, while complete clusters of structurally self-similar molecules were not present in the training set, between 20 and 70 per cents of these excluded structures were retrieved. However, ultimately most machine learning methods depend on at least some similarities between the structures in the training and in the test set. Thus, the results presented in Table 4.4 have to be read as a pessimistic measure of the performance since large groups of self-similar compounds were intentionally not represented in the training set. Results and Discussion 109

Finally, the very high standard deviations shown in Table 4.4 deserve a note. A closer expectation of the performance in the individual runs has shown that the main source of these high standard deviations is the same for all activity classes. The run in which the largest homogeneous cluster was omitted from the training set always resulted in low recall values, thus creating a high variability in the ten runs. This observation once again demonstrates the dependency of the both studied methods on at least some structural similarity between the structures in the training and in the test set.

We conclude this section noting that in a cross-validation-like experiment based on random splitting the obtained recall values shown in Table 4.3 match closely those in Table 4.2. Thus, the expected performance of the methods presented in this study is somewhere around the recall values in Table 4.3. On the other hand, the lower-bound of the performance is shown in Table 4.4. The novelty detection with autocorrelation vectors is expected to give better results in cases where compounds structurally dissimilar from those in the training set are to be retrieved.

4.3.3 Method comparison

Comparing the recall values obtained with the different methods (using clustering based split) shows that the SOM novelty technique performs better or comparable to the Daylight fingerprints when the concatenated autocorrelation vectors are used to describe the structures. The SOM based novelty detection performs better for COX-2 and Thrombin and comparable for the AChE, while for the PDE 4 and uPA sets, the fingerprint based method gives better results. When an autocorrelation vector, weighted by a single property is used, the results are generally of lower quality. However, taking into account the much lower dimensionality (11 compared to 1024), the fact that the SOM novelty detection is capable in most of the cases to recover around 50% of the actives that are recovered with fingerprints is remarkable.

Using different physicochemical properties enables one to describe different aspects of the active set. Thus, while the identity accounts strictly for structural similarity, i.e. it is equivalent to a histogram of the bond distances in a molecule, the partial charge accounts for the electrostatic properties of the compound, while the polarizability descriptors are related to size and hydrophobic effects. The electronegativity values take into account the hydrogen bonding properties of the compound. Although all these are rather crude descriptions of the underlying effects, based on the connectivity matrix alone, they provide a good basis for this 110 Virtual Screening by Novelty Detection with SOM

type of virtual screening experiments in which no quantitative structure-activity correlations are made. Therefore, the different autocorrelation vectors are expected to recover different sets of actives, which explains the significant improvement when the concatenated vector is used. This 44-dimensional concatenated autocorrelation vector has an approximately 23 times lower dimensionality than the 1024-bit Daylight fingerprints. This, together with the smaller size of the SOM, offers a significant run time improvement. Thus, for the Thrombin subset, which is the largest one used in this study, the fingerprints based method runs in approximately 130 seconds, while projecting the whole test set onto a single network is twice as fast, approximately 65 seconds. This advantage is compensated for by the fact that the recall is hardly related to any of the SOM quality measures, which leads to the requirement of using several networks as a better description of the “active” space.

4.3.4 Methods complimentary

AChE COX-2 PDE 4

731 269 731 687 313 687 708 292 708 (37) (151) (57) (95) (246) (66) (6) (164) (52)

somNDs SSDF somNDs SSDF somNDs SSDF

abc Thrombin uPA

594 406 594 821 179 821 (266) (294) (146) (9) (47) (33)

somNDs SSDF somNDs SSDF

de

Figure 4.7 Venn diagrams of the intersection between the 1000 top ranked compounds retrieved by similarity search with subsequent data fusion and Daylight fingerprints (method 1 of Figure 4.1) and SOM novelty detection with concatenated autocorrelation vector (method 2a of Figure 4.1). The number of actives in the intersection and in the unique part of the lists is given in parentheses. Inside each circle the numbers on top add up to the rank value of 1000 and the numbers in parentheses add up to the values given in Table 4.2. Results and Discussion 111

Looking at the recall values in Table 4.2 one can see that neither of the methods was able to recover the full set of known actives. This observation leads us to investigate how different are the ranked lists, obtained by the two methods. Figure 4.7 shows Venn45 diagrams of the intersection between the fingerprints based similarity search (method 1 of Figure 4.1) and SOM novelty detection with concatenated autocorrelation vectors (method 2a of Figure 4.1).

4.3.4.1 Intersection

There is a clear difference between the two sets of recovered compounds, the highest intersection in the case of the Thrombin subset being equal to 40% of the total ranked list size. As expected, the majority of the recovered actives are found in the intersection. A look at Figure 4.7 reveals that the intersection between the ranked lists obtained with the two methods (the area shared by both circles) is highly enriched in active structures. The percentage of active structures amongst these compounds found in the intersection of the two lists as well as the corresponding enrichment factors are summarized in Table 4.5.

Table 4.5 Per cent active compounds and enrichment factors obtained when only the compounds found in the intersection between the ranked lists returned by similarity search with data fusion (method 1 on Figure 4.1) and SOM novelty detection with concatenated autocorrelation vectors (method 2a on Figure 4.1) are considered. The number of actives recovered in a ranked list of the same length (cf. column two) by each method alone is shown as well.

intersection actives recovered bya activity compounds active % actives enrichment SSDF somNDs

AChE 269 151 56 268 86 126 COX-2 313 246 79 213 166 204 PDE 4 292 164 56 333 159 135 Thrombin 406 294 72 93 238 276 uPA 179 47 26 357 44 39 a number of active compounds recovered by the corresponding method considering the top-ranked compounds inside a list with the same length as the intersection (column 2).

From the values presented in Table 4.5 it is clear that starting with two ranked lists returned by each of the methods (each list containing 1000 structures) by taking their intersection a 112 Virtual Screening by Novelty Detection with SOM

very short list which is highly enriched in active structures can be obtained. In addition, the list obtained by considering the intersection contains from 3 to 75 per cent more active structures than the ranked lists of the same size obtained with each of the methods alone.

4.3.4.2 Union

Table 4.6 Per cent active compounds and enrichment factors obtained when the compounds found in the union between the ranked lists returned by similarity search with data fusion (method 1 on Figure 4.1) and SOM novelty detection with concatenated autocorrelation vectors (method 2a on Figure 4.1) are considered. The number of actives recovered in a ranked list of the same length (cf. column two) by each method alone is shown as well.

union actives recovered bya activity compounds active % actives enrichment SSDF somNDs

AChE 1731 245 14 68 238 205 COX-2 1687 407 24 65 366 378 PDE 4 1708 222 13 77 220 186 Thrombin 1594 706 44 57 571 672 uPA 1821 89 5 66 88 64

a number of active compounds recovered by the corresponding method considering the top-ranked compounds inside a list with the same length as the union (column 2).

Each of the methods, however, is capable of recovering structures, which the other method has missed. Thus, by concatenating the unique compounds in the lists a new list is obtained, which is, of course, longer, but still more enriched than the corresponding individual lists of the same length. Considering the Thrombin subset, by summing up the numbers on top in both circles (594 + 406 + 594) the size of the combined ranked list – 1594 – is obtained. The number of known actives recovered in this combined ranked list – 706 – is obtained by summing up the numbers in parentheses (266 + 294 + 146). To compare the result of the above union to the results obtained by each of the methods alone, a ranked list of the same size – 1594 – was obtained with each method. The number of recovered actives in these lists (not shown in Figure 4.7) was 571 and 672 for the similarity search with Daylight fingerprints and SOM novelty detection with concatenated autocorrelation vector, respectively. Thus, a Results and Discussion 113 union of the two lists at rank 1000 gives a list, which is more enriched with actives. The same holds true for the other four sets as summarized in Table 4.6. Even when the fingerprints similarity search seems to recover almost all actives, as in the case with the PDE 4 subset where the recall is 94% (cf. Table 4.2, clustering split) the SOM novelty recovers six of the thirteen (6%, cf. Table 4.2) missed actives.

4.3.5 Scaffold analysis

Ultimately, one of the highly desired properties of any virtual screening method is the ability to recover new active scaffolds. We want to remind that the term “novelty” as used throughout this text does not refer to the discovery of novel classes of active compounds. The SOM novelty detection like any machine-learning method relies on some commonalities between the active compounds. While the autocorrelation descriptor attempts to take the chemistry as well as structural features into account, it is still dependent on the underlying chemical structure. Therefore, it is unrealistic to expect the discovery of completely new active chemotypes.

Nevertheless, we were interested in the difference in terms of chemotypes between the structures obtained with SOM novelty detection and with similarity search with subsequent data fusion. To achieve this we performed a graph-based scaffold analysis with the help of MeqiSuite.46

MeqiSuite calculates 66 different graph-based indices. Detailed description of these indices and the software itself can be found on the MeqiSuite website (http://www.pannanugget.com) and in the technical report “An Introduction to the MeqiSuite Indices”.47 In our work, we considered the ordering index CyclicSystemOrd. This is a composite hierarchical index meant to facilitate the browsing of diverse compound collections. The CyclicSystemOrd index is formed by the concatenation of eleven other MeqiSuite indices. The concatenation is done in a way which groups the similar ring systems together. All compounds which do not contain a cyclic system are grouped together as well. However, almost all of the active compounds used in this study have at least one cycle thus this particular index was considered. For all classes of active compounds a comparison between the unique active structures recovered by each method (the numbers in parentheses in the left- and right-hand circles on Figure 4.7) was performed. These compounds were also compared to the training set. An outline of the 114 Virtual Screening by Novelty Detection with SOM

scaffold analysis procedure is presented in Figure 4.8 with AChE inhibitors as an example and the results for all five activity classes are summarized in Table 4.7.

training set (284)

similariry search SOM novelty with data fusion detection

WOMBAT (135,309) virtual screening model virtual screening model WOMBAT (135,309) + test set (284) (method 1 of Figure 1) (method 2a of Figure 1) + test set (284)

1 1 intersection (151) compounds 2 2

......

1000 57unique 37 1000

41 intersection (8) 28

33 unique 20 chemotypes

int e ) r (9 s n ectio o cti n ( 13 terse new (20) ) in new (11) 136 (training set)

Figure 4.8 Outline of the scaffold analysis workflow with AChE inhibitors as an example. Using the training set a virtual screening model is developed and the whole WOMBAT database plus the test set is screened. The top-ranked 1000 structures are considered. Amongst them the active structures, recovered exclusively by each method are subject to scaffold analysis by means of CyclicSystemOrd MequSuite index. A comparison between the active chemotypes recovered by each method and between the recovered chemotypes and the chemotypes contained in the training set is made. The results for all activity classes are summarized in Table 4.7.

Results and Discussion 115

Table 4.7 Number of chemotypes amongst the active compounds recovered exclusively by one of the two methods (cf. Figure 4.8). The MeqiSuite CyclycSystemOrd index was used for the identification of chemotypes with the exception of the last column.

chemotypes unique not in train activity class method active not in train total unique set, compoundsa set UnSkCycb somNDsc 37 28 20 11 3 AChE SSDFd 57 41 33 20 6 somNDs 95 57 48 26 6 COX-2 SSDF 66 35 26 10 2 somNDs 6 6 6 4 2 PDE 4 SSDF 52 38 38 24 12 somNDs 266 157 134 64 18 Thrombin SSDF 146 112 89 49 30 somNDs 9 7 6 2 1 uPA SSDF 33 19 18 6 3

aunique active structures recovered by each method (the numbers in parentheses in the left- and right-hand circles on Figure 4.7); bThe UnSkCyc****Mqn index ignores atom and bond types and considers the unextended cyclic system; cSOM novelty detection with concatenated autocorrelation vectors (method 2a on Figure 4.1); dsimilarity search with Daylight fingerprints and subsequent data fusion (method 1 on Figure 4.1)

4.3.5.1 AChE inhibitors

A look at the third column of Table 4.7 shows that there were 37 AChE inhibitors recovered only by SOM novelty detection and 57 recovered only by similarity search (these are the same figures as given in parentheses on Figure 4.7 a). The CyclicSystemOrd MequSuite index distinguished 28 different chemotypes for SOM novelty and 41 for the similarity search – the fourth column in Table 4.7. Eight chemotypes were shared by both sets of recovered structures thus leaving 20 unique chemotypes recovered by SOM novelty detection alone and 33 unique chemotypes recovered by similarity search alone – the fifth column in Table 4.7.

Five from the twenty unique chemotypes recovered by SOM novelty detection are depicted in Chart 4.1a, while Chart 4.1b shows five from the chemotypes, recovered by similarity 116 Virtual Screening by Novelty Detection with SOM

search alone. As can be seen from Chart 4.1 the SOM novelty detection was able to recover chemotypes missed by the similarity search and vice versa.

Chart 4.1 AChE inhibitors illustrating some of the chemotypes (as perceived by CyclicSystemOrd MeqiSuite index) recovered a) exclusively by SOM novelty detection (method 2a on Figure 4.1) and b) exclusively by similarity search with subsequent data fusion (method 1 on Figure 4.1).

O N N S S N N N S O S

N O

O S ON N O N NN + N N N

N Cl Cl O N a

O O N O N N N O O O N N N O N

O + O N N N N N O

+ N

O

b

Another interesting question was whether the methods recover chemotypes not present in the set of query structures. As can be seen from the sixth column in Table 4.7, the CyclicSystemOrd MeqiSuite index distinguished 11 chemotypes amongst the 37 AChE inhibitors recovered exclusively by SOM novelty which were not present in the training set. Results and Discussion 117

The number of chemotypes not present in the training set was 20 for the set of 57 AChE inhibitors recovered exclusively by similarity search. These numbers suggest that both methods were able to recover scaffolds not present in the training set. A careful examination of the corresponding structures revealed that although they do have ring systems absent from the training set the differences are mainly due to the position of certain heteroatoms. Considering only the unextended cyclic-system skeleton index UnSkCyc****Mqn – which does not distinguish between atom and bond types, the number of chemotypes not present in the training set decreased to 3 for the SOM novelty and to 6 for the similarity search – the seventh column in Table 4.7. This shows that some structural similarities between the recovered compounds and the set of known actives used as training set do exist.

However, as already mentioned it is unrealistic to expect the recovery of completely new classes of active compounds based on a general structural descriptor. In the case of similarity search with binary fingerprints such new classes are not recoverable since by definition they are dissimilar to the training set. For the SOM novelty with autocorrelation vectors such compounds will be typically predicted as inactive (novel) since they may lie outside the space covered by the training set. From this perspective the recovery of chemotypes not present in the training set is remarkable.

4.3.5.2 COX-2 inhibitors

Chart 4.2a illustrates five of the 57 chemotypes unique to SOM novelty. Chart 4.2b shows five of the 36 chemotypes unique to similarity search.

The same analysis as for the AChE inhibitors was performed on the 95 compounds recovered exclusively by SOM novelty detection and on the 66 COX-2 inhibitors recovered exclusively by similarity search with subsequent data fusion. The corresponding numbers for the different chemotypes are shown in the COX-2 row in Table 4.7. Similar observation as in the case of AChE inhibitors can be made. However, considering that the COX-2 training set was the less diverse one in terms of unique chemotypes in the training set (on average 3 structures per chemotype as determined by CyclicSystemOrd MeqiSuite index) the recovery of some additional ones deserve a note.

118 Virtual Screening by Novelty Detection with SOM

Chart 4.2 COX-2 inhibitors illustrating some of the chemotypes (as perceived by CyclicSystemOrd MeqiSuite index) recovered a) exclusively by SOM novelty detection (method 2a on Figure 4.1) and b) exclusively by similarity search with subsequent data fusion (method 1 on Figure 4.1).

F O S O O

N N O O O O S N N F O N N O OOS N

N N N N O N

S N S N N a O S H OO S N N O O O N S OO

O F F F F S O F F N

N N

OOS N b

4.3.5.3 PDE 4 inhibitors

In this particular case, the similarity search with data fusion recovered almost all known actives amongst the top-ranked 1000 compounds. Thus, it was interesting to examine if the six PDE inhibitors recovered exclusively by SOM novelty detection contain different Results and Discussion 119 chemotypes than the rest. As can be seen from Table 4.7, column five inside the PDE 4 row, this was indeed the case. Four of the chemotypes recovered by SOM novelty were not present in the PDE 4 training set as well.

Chart 4.3 PDE 4 inhibitors illustrating some of the chemotypes (as perceived by CyclicSystemOrd MeqiSuite index) recovered a) exclusively by SOM novelty detection (method 2a on Figure 4.1) and b) exclusively by similarity search with subsequent data fusion (method 1 on Figure 4.1).

Cl O Cl N N O O O S O O N N Cl N N N N S O N N O N N O O O N N 1 2 O 3 N 4 O F

F Cl O N O

Cl N N O N N O O O O 5 O 6 N a

O O O O N N O N N N N O N N N N O O O N O O N O O

O Cl N N N N N N Cl Cl Cl N N O N O O N O O O b

120 Virtual Screening by Novelty Detection with SOM

Chart 4.3a shows all six PDE 4 inhibitors recovered exclusively by SOM novelty detection. Five compounds representing some of the chemotypes recovered only by similarity search are shown on Chart 4.3b.

A detailed description of the MeqiSuite indices lies out of the scope of this article. However, in the following we provide a short discussion centered on the compounds shown on Chart 4.3a which should facilitate the understanding of this section. As already mentioned all six compounds on Chart 4.3a are perceived as representing different chemotypes according to the composite CyclicSystemOrd index – a result, which is hardly surprising having in mind the structures shown. Two structures can have the same value for this composite index only when they have the same cyclic system. The CyclicSystemOrd index accounts for the atom and bond types as well, thus it creates a relatively large number of different chemotypes, i.e. it has high resolution.

On the other hand, although structures 1 and 2 on Chart 4.3a are by any means different chemical entities – structure 1 being a xanthine sulfonamide and structure 2 being a pyrazolopyrimidine-2,4-dione sulfonamide – there is high similarity in their skeletons. To achieve a broader grouping, a MeqiSuite index with lower resolution should be used. The composite indices like the CyclicSystemOrd used here, are formed by concatenating some of the other MeqiSuite indices (when a hierarchical ordering is needed additional care has to be taken to follow the corresponding relationship between the individual indices, the reader is referred to reference 47 for details).

Deleting the individual indices from which such a hierarchical index is built from right to left leads to a decrease in the resolution and subsequently larger groups of compounds are perceived as belonging to the same chemotype. This is exemplified in column seven in Table 4.7. To obtain the numbers shown in column seven six indices were repeatedly deleted from the CyclicSystemOrd index starting with the right-most one. In this manner the resolution was effectively determined by the UnSkCyc***Mqn index. The UnSkCyc***Mqn index provides a description of the unextended cyclic-system skeleton. It treats all atoms as carbon and all bonds as single. The index is “unextended“ because the bridging atoms – like the ether bridge in compound 4 on Chart 4.3a – are not taken into account. It should be apparent now that based on UnSkCyc***Mqn index structures 1 and 2 from Chart 4.3a are perceived as identical. Of course, one can argue that structures 1 and 2 still differ in their side chains. While this is certainly true and the MequLite does provide the corresponding side-chains and Results and Discussion 121 functional group indices, we will limit the discussion only to the cyclic skeleton since, as can be seen from all charts shown, the cyclic skeleton can be seen as the main building block for all classes of active compounds.

We conclude this section noting that the chemotypes of structures 2, 3, 5, and 6 were not found in the complete set of known actives which were used to build the novelty detector, while structures 3 and 6 from Chart 4.3a has been perceived as chemotypes missing from the training set even when the UnSkCyc***Mqn index was used.

4.3.5.4 Thrombin inhibitors

Five examples for unique chemotypes discovered by SOM novelty and by similarity search with subsequent data fusion are shown at Chart 4.4a and Chart 4.4b.

Similar observations for the AChE, COX-2, and PDE 4 classes can be drawn. The SOM novelty detection has recovered more actives in the top ranked 1000 structures. This has resulted in a higher number of unique chemotypes as can be seen from Table 4.7. Compared to the similarity search with Daylight fingerprints 134 additional chemotypes were recovered by SOM novelty detection. 122 Virtual Screening by Novelty Detection with SOM

Chart 4.4 Thrombin inhibitors illustrating some of the chemotypes (as perceived by CyclicSystemOrd MeqiSuite index) recovered a) exclusively by SOM novelty detection (method 2a on Figure 4.1) and b) exclusively by similarity search with subsequent data fusion (method 1 on Figure 4.1).

N

N O N S O O N S N O O S O N O O N O O N O N N NN N N N N O S N N Cl N N N O N N N N O O Cl OO O N N PN N O O O O O

a

O N B O NO O O O N N N O O O N N N O S NN S N N O N O O O S N N N O O N

N N O N OO N O O N O O N O N

N b

Results and Discussion 123

4.3.5.5 uPA inhibitors

Six uPa inhibitors representing the six unique chemotypes (cf. Table 4.7, column five) discovered by SOM novelty alone are depicted on Chart 4.5a. Five structures representing five of the 18 chemotypes discovered by similarity search alone are shown on Chart 4.5b.

As in the case of PDE 4 inhibitors, although the similarity search recovered almost all known actives, the SOM novelty detection was able to recover additional chemotypes. As can be seen from Chart 4.5 almost all of the structures shown have a terminal amino or amidine group. This is an example when the use of the side chain and functional groups MeqiSuite indices may be more informative in distinguishing different chemotypes with regards to uPA inhibitory activity.

Chart 4.5 uPA inhibitors illustrating some of the chemotypes (as perceived by CyclicSystemOrd MeqiSuite index) recovered a) exclusively by SOM novelty detection (method 2a on Figure 4.1) and b) exclusively by similarity search with subsequent data fusion (method 1 on Figure 4.1).

N N N N N N N N N N N N N N S N Br Cl N N S S I O O N N Cl O S a I O N N O N NN O N O

N O Br N O O N N

O N

O O O N B O O O N N O O N N N O S N N O O O b

124 Virtual Screening by Novelty Detection with SOM

Based on the above discussion it is clear that the SOM novelty detection method with a 44- dimensional concatenated autocorrelation vector has recovered a significant amount of chemotypes which were missed by the similarity search with Daylight fingerprints. Thus, it is a useful tool for retrieving chemotypes which otherwise would have been missed. On the other hand, the similarity search with Daylight fingerprints has recovered chemotypes missed by the SOM novelty detection. Therefore, as already discussed in section 4.3.4, the two methods are not by any means orthogonal to each other and they compliment each other pretty well.

4.3.6 Rejection rates

Until now a comparison between the ranked lists obtained with both methods was made. While this is the only kind of results available from a nearest-neighbor based similarity search, novelty detection has the additional capability of immediately classifying as inactive those structures that are outside the space covered by the training set. This one-class classifier offers the advantage of directly rejecting a significant number of patterns without the need of a numerical threshold that always carries some degree of randomness with it.

80

70

60

50

40

30

20 classified as inactive (%) 10

0 AChE COX-2 PDE 4Thrombin uPA activity class

Figure 4.9 Per cent of compounds from the WOMBAT data set immediately classified as inactive by SOM novelty detection with concatenated autocorrelation vector (method 2a on Figure 4.1). Results and Discussion 125

It is commonly accepted48,49 that a Tanimoto coefficient greater then 0.85, when using binary fingerprints, yields similar compounds. Often, in designing targeted libraries, only one from such pairs of compounds is kept. The validity of this threshold, however, has been subject to criticism.50 By utilizing novelty detection techniques, no artificial threshold is needed since compounds that are sufficiently far from the chemical space defined by the training set will automatically be classified as novel, i.e., probably inactive. Figure 4.9 shows the per cent of the compounds from the whole WOMBAT data set, which were immediately classified as inactive when using SOM novelty detection with concatenated autocorrelation vector.

The SOM ensemble trained on the PDE 4 training set classified 75% of the remaining WOMBAT structures as inactive. Thus, 101,791 compounds were directly classified as unlikely to be PDE 4 inhibitors. The smallest number of compounds directly perceived as inactive was obtained for the uPA subclass, which partially explains the lower recall values produced by the SOM novelty detection (cf. Table 4.2). The number of structures immediately perceived as inactive can be explained well when the structure of the corresponding enzyme is considered. Thus, for the enzymes with well defined binding pocket, which can accommodate somehow typical substrates – COX-2, PDE 4, and Thrombin – the number of structures immediately perceived as inactive is high. On the other hand, AChE is known to be inhibited by at least two different mechanisms. Here, the number of structures classified as inactive therefore decreases. In the extreme case of uPA, an enzyme with a large binding pocket which can accommodate ligands of different structural types only a small ammount (~22%) of the WOMBAT structures are immediately perceived as inactive.

4.3.7 Multiple structure representations with Mahalanobis distance.

An interesting observation is the fact that this method provides a very tight description of the “active” space when applied to the five activity classes from Table 4.1. This confirms that indeed the discussed (cf. section 4.2.3.2.2) deviation of the Mahalanobis distances distribution from the χ2 has led to a conservative threshold.

126 Virtual Screening by Novelty Detection with SOM

somNDm SSDF somNDs

100 139 57 90 80 104 469 39 70 60 50 40 30 percent actives 20 10 0 AChE COX-2 PDE 4Thrombin uPA activity class

Figure 4.10 SOM novelty detection with multiple structure representations and Mahalanobis distance (somNDm, method 2b of Figure 1) in the output space as a measure of novelty. The number of structures predicted as active by somNDm is shown above the bars. The height of the bars gives the per cent of active compounds amongst those predicted as active and amongst the ranked lists of the same size, obtained by the other two methods (SSDF, method 1 of Figure 4.1, and comNDs, method 2a of Figure 4.1). The value χ2(α=0.99, 4) was used as threshold.

Figure 4.10 shows the number of accepted structures for each active subset over the bars and the percent of actives contained in the ranked lists of the given size by the three methods discussed so far. A threshold, equal to χ2(α=0.99, 4), which gives a value that encapsulates 99% of the data in the active cluster was used.

The number of accepted structures is very low with any of the active subsets, in the extreme case of uPA only 39 structures of the original 135,877 were classified as belonging to the active space. The percentage of the actives inside the accepted structures is always more than 50%, thus the short lists so obtained are highly enriched in actives – enrichment factors (ef)of 325 for AChE, 236 for COX-2, 489 for PDE4, 92 for Thrombin, and 732 for uPA. A comparison with the similarity search with Daylight fingerprints and with the SOM novelty detection with concatenated autocorrelation vectors at these low ranks given on top of the bars in Figure 4.10 favors SOM novelty detection with multiple structure representations and Mahalanobis distance in the output space method as well. Such a tight description, however, Results and Discussion 127 may not be desirable, since a significant number of actives were wrongly rejected and declared as inactive, i.e. the method exhibits a rather high false-negative rate. On the positive side, the high enrichment and the short lists produced may be useful when a limited number of compounds have to be tested, for example in a limited-resource research environment. In spite of the small number of accepted structures, these still contain some “false” positives (we put “false” in quotes since the majority of the structures have not been tested against the target enzyme, therefore we do not factually know if these molecules are inactive). A close examination of the lists revealed that most of the “false” positives are actually inhibitors with a low activity (above 30μm) which were left out of consideration when building and testing the novelty detector (cf. Materials and Methods). Furthermore, they contained actives for the same target enzyme but in different or unspecified species. The third and probably most interesting group contained inhibitors of other enzymes. It should be stressed that in general such “false” positives are the main target of virtual screening since they are most likely to become the next lead. To illustrate the above discussion Chart 6 and Chart 7 show some of the “false” positives, i.e., structures which were accepted by the SOM novelty with multiple structure representations although they are not marked as actives in WOMBAT, for the COX- 2 and Thrombin activity classes. All “false” positive structures were clustered using k- medoids clustering method in R environment29 with k = 5 and the five cluster centers as identified by the algorithm are shown for both activity classes.

Chart 4.6 COX-2 “false” positives (cluster centers) as identified by SOM novelty detection with multiple structure representations (method 2b of Figure 4.1).

Br F F O F F O S F N F N O N O F N O N N O 5 1 3 F 4 2 O S O O O OOS O S O N

128 Virtual Screening by Novelty Detection with SOM

Chart 4.7 Thrombin “false” positives (cluster centers) as identified by SOM novelty detection with multiple structure representations (method 2b of Figure 4.1).

N O N N O 6 7 N N 8 N OOS N O N N N N N O O N O O O N N O N O O N

N N N 10 N N O N O O N N S N N N O S O N N O O O O 9 N

The “false“ positives and the known actives shown on Chart 4.2 and Chart 4.6 show a rather high degree of structural similarity. Thus, it is not surprising that all structures from Chart 4.6 have been actually tested against COX-2. However, all of them were found to possess activity below 30 μm and therefore were not considered in this study (cf. Materials and Methods).

In the case of the Thrombin subset, compound 7 from the “false” positives shown on Chart 4.7 was found highly active against bovine thrombin, while compounds 8 and 9 have been found inactive against human thrombin. Compounds 6 and 10 have not been tested against Thrombin, according to the WOMBAT data set. Two conclusions follow from the above observation. First, the method is very good at recovering structurally similar compounds which appear promising in the eyes of a medicinal chemist. However, there is still a lot to be desired in distinguishing pure structural similarity from the cause of a given biological Results and Discussion 129 activity. Although the autocorrelation vectors were weighted by different physicochemical properties, the topological nature of these descriptors together with their low dimensionality limits the approach. Thus, other chemical descriptor sets, or better definitions of the active space may prove useful. The proposed method shows promise as a fast and reliable alternative – especially when a short list of putative actives is sought or as a complimentary method to the similarity search with Daylight fingerprints. The SOM novelty with multiple structure representations method also allows the ranking of structures using, e.g., their Mahalanobis distance to the training set. In this manner, structures that are not too distant from the training set could be considered. A comparison with the other two methods at different ranks can be done.

4.3.8 Comparison at different ranks

Different virtual screening experiments may target a different number of potentially active compounds for a subsequent biological assay. Thus, rather than working at fixed rank, Table 4.8 shows the enrichment factors obtained by the three methods at ranks 100 and 1000, while Figure 4.11 compares the results of the nearest-neighbor similarity search with Daylight fingerprints and the two types of SOM novelty detection with topological autocorrelation descriptors at different ranks.

Both SOM novelty detection methods exhibit similar performance in the case of COX-2, PDE 4, and uPA activity classes while the SOM novelty detection with concatenated autocorrelation vectors is better in the other two cases, as can be seen from Figure 4.11. A look at Table 4.8 confirms the observation (cf. Figure 4.10) that SOM novelty detection with multiple structure representations gives highly enriched lists at low ranks – it outperforms the other two methods in all cases at rank 100. However, its advantage is lost when increasing the size of the ranked list – at rank 1000 it is outperformed in all cases except of the COX-2 class. Thus, SOM novelty detection with multiple structure representations is preferable when a very small subset of lead candidates is required.

130 Virtual Screening by Novelty Detection with SOM

Table 4.8 Enrichment factors, obtained by the three methods at ranks 100 and 1000.

Rank 100 Rank 1000

# Actives Enrichment # Actives Enrichment Activity Class Screening Method factor, ef factor, ef SSDFa 32 153 208 100 AChE somNDsb 62 297 188 90 (284) somNDmc 67 321 150 72 SSDF 74 201 312 85 COX-2 somNDs 75 204 341 93 (500) somNDm 93 253 355 96 SSDF 67 398 216 128 PDE 4 somNDs 61 362 170 101 (229) somNDm 79 469 162 96 SSDF 55 71 440 57 Thrombin somNDs 76 98 560 72 (1053) somNDm 81 105 391 50 SSDF 24 326 80 109 uPA somNDs 24 326 56 76 (100) somNDm 34 462 55 75

asimilarity search with Daylight fingerprints followed by data fusion (method 1 of Figure 4.1); bSOM novelty detection with concatenated autocorrelation vector (method 2a of Figure 4.1); cSOM novelty detection with multiple autocorrelation vectors (method 2b of Figure 4.1)

Results and Discussion 131

AChE COX-2 0.8 1.0 0.8 1.0 ll a ec r SSDF recall somNDs somNDm Theoretical maximum Random expectation 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 rank rank PDE 4 Thrombin ll a ec r recall 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 rank rank uPA ll a ec r 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1000 1500 2000 2500 rank

Figure 4.11 Recall plots for the nearest-neighbor similarity search with Daylight fingerprints (method 1 of Figure 4.1, green), SOM novelty detection with concatenated topological autocorrelation vectors (method 2a of Figure 4.1, blue), and SOM novelty detection based on multiple structural descriptors and Mahalanobis distance in the output space (method 2b of Figure 4.1, red). Also the theoretical maximum (solid black line) and the random expectation (dotted line) are shown.

Comparing the SOM novelty detection method using a single concatenated autocorrelation vector with the similarity search with Daylight fingerprints and subsequent data fusion, the SOM novelty detection performs better for COX-2 and Thrombin classes, while for the PDE 4 and uPA the similarity search was better. In the case of AChE activity class, the tendency of 132 Virtual Screening by Novelty Detection with SOM

the SOM novelty detection methods to recover higher number of actives at low ranks can be clearly seen with the fingerprints method catching up at rank 800. The terms “better” and “worse” results are used for comparative purposes. However, these two methods are not mutually exclusive. As has already been shown in the sections 4.3.4, cf. Figure 4.7, merging the ranked lists is a valuable way of improving the virtual screening results. On the other hand – as has been demonstrated in section 4.3.5, the SOM novelty detection succeeded in discovering chemotypes, which are missed by the similarity search with data fusion and vice versa.

4.4 Conclusions

Two different methods for novelty detection with Self-Organizing Maps were used for a retrospective ligand-based virtual screening of the WOMBAT database. One method used a single structure representations and data fusion, while another method used multiple representations in concert with a Mahalanobis distance measure. The structures were described by topological autocorrelation functions weighted by atomic physicochemical properties. The results were compared with a traditional similarity search method based on Daylight fingerprints and subsequent data fusion. In addition, different methods for selecting an initial set of targets from a larger set were compared. Based on the presented results we conclude that:

1) Both SOM novelty detection techniques – with single structure representation (method 2a of Figure 4.1) and with multiple structure representations (method 2b of Figure 4.1) – based on topological autocorrelation descriptors are useful for ligand-based virtual screening.

2) The Taylor-Butina clustering is the method of choice for a subset selection, especially when a small representative set of actives is needed.

3) Using a 44-dimensional autocorrelation vector and SOM novelty detection gives better (COX-2, Thrombin) or comparable results to the similarity search with Daylight fingerprints and data fusion.

4) Using a 44-dimensional autocorrelation vector and SOM novelty detection is twice as fast compared to the similarity search with Daylight fingerprints and data fusion when a single network is used. Acknowledgments 133

5) Small sets of compounds highly enriched in active structures can be obtained by considering the intersection between the ranked lists obtained by a combined application of SOM novelty detection with single structure representation and of similarity search with subsequent data fusion.

6) The SOM novelty detection method with a 44-dimensional concatenated autocorrelation vector complements the Daylight fingerprints based similarity search. Better enriched lists were obtained by merging these results.

7) The SOM novelty detection method with a 44-dimensional concatenated autocorrelation vector recovered significant amount of chemotypes which are missed by the similarity search.

8) The SOM novelty detection method with a 44-dimensional concatenated autocorrelation vector is applicable as a library design tool for discarding a large number of compounds which are unlikely to posses a given biological activity without the need of an artificial threshold.

9) Using multiple structure representations in concert with a Mahalanobis distance recovers between 34% and 93% of the actives in the top 100 ranked structures. This corresponds to enrichment factors between 105 and 470. Thus, it is the recommended method when a short list of lead candidates is required.

4.5 Acknowledgments

Part of this work was funded by National Institutes of Health grant U54 MH074425-01 (National Institutes of Health Molecular Libraries Screening Center Network) and by the New Mexico Tobacco Settlement Fund (T.I.O.).

4.6 References

(1) Lyne, P. D. Structure-Based Virtual Screening: an Overview. Drug Discov. Today 2002, 7, 1047-1055.

(2) Taylor, R. D.; Jewsbury, P. J.; Essex, J. W. A Review of Protein-Small Molecule Docking Methods. J. Comput.-Aided Mol. Des. 2002, 16, 151-166. 134 Virtual Screening by Novelty Detection with SOM

(3) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical Similarity Searching. J. Chem. Inf. Model. 1998, 38, 983-996.

(4) Kearsley, S. K.; Sallamack, S.; Fluder, E. M.; Andose, J. D.; Mosley, R. T.; Sheridan, R. P. Chemical Similarity Using Physiochemical Property Descriptors. J. Chem. Inf. Model. 1996, 36, 118-127.

(5) Fechner, U.; Franke, L.; Renner, S.; Schneider, P.; Schneider, G. Comparison of Correlation Vector Methods for Ligand-Based Similarity Searching. J. Comput.-Aided Mol. Des. 2003, 17, 687-698.

(6) Evers, A.; Hessler, G.; Matter, H.; Klabunde, T. Virtual Screening of Biogenic Amine- Binding G-Protein Coupled Receptors: Comparative Evaluation of Protein- and Ligand-Based Virtual Screening Protocols. J. Med. Chem. 2005, 48, 5448-5465.

(7) Nikolova, N.; Jaworska, J. Approaches to Measure Chemical Similarity - a Review. QSAR Combinat. Sci. 2003, 22, 1006-1026.

(8) Whittle, M.; Willett, P.; Klaffke, W.; van Noort, P. Evaluation of Similarity Measures for Searching the Dictionary of Natural Products Database. J. Chem. Inf. Model. 2003, 43, 449-457.

(9) Gillet, V. J.; Willett, P.; Bradshaw, J. Similarity Searching Using Reduced Graphs. J. Chem. Inf. Model. 2003, 43, 338-345.

(10) Sheridan, R. P.; Miller, M. D.; Underwood, D. J.; Kearsley, S. K. Chemical Similarity Using Geometric Atom Pair Descriptors. J. Chem. Inf. Model. 1996, 36, 128-136.

(11) Whittle, M.; Gillet, V. J.; Willett, P.; Alex, A.; Loesel, J. Enhancing the Effectiveness of Virtual Screening by Fusing Nearest Neighbor Lists: A Comparison of Similarity Coefficients. J. Chem. Inf. Model. 2004, 44, 1840-1848.

(12) Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. J. Chem. Inf. Model. 2004, 44, 1177-1185.

(13) Bender, A.; Jenkins, J. L.; Glick, M.; Deng, Z.; Nettles, J. H.; Davies, J. W. "Bayes Affinity Fingerprints" Improve Retrieval Rates in Virtual Screening and Define References 135

Orthogonal Bioactivity Space: When Are Multitarget Drugs a Feasible Concept? J. Chem. Inf. Model. 2006, 46, 2445-2456.

(14) Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Enhancing the Effectiveness of Similarity-Based Virtual Screening Using Nearest- Neighbor Information. J. Med. Chem. 2005, 48, 7049-7054.

(15) Marsland, S. Novelty Detection in Learning Systems. Neural Comput. Surv. 2003, 3, 157-195.

(16) Markou, M.; Singh, S. Novelty Detection: a Review - Part 1: Statistical Approaches. Signal Process. 2003, 83, 2481-2497.

(17) Markou, M.; Singh, S. Novelty Detection: a Review - Part 2: Neural Network Based Approaches. Signal Process. 2003, 83, 2499-2521.

(18) Bauknecht, H.; Zell, A.; Bayer, H.; Levi, P.; Wagener, M.; Sadowski, J.; Gasteiger, J. Locating Biologically Active Compounds in Medium-Sized Heterogeneous Datasets by Topological Autocorrelation Vectors: Dopamine and Benzodiazepine Agonists. J. Chem. Inf. Model. 1996, 36, 1205-1213.

(19) Zupan, J.; Gasteiger, J. Neural Networks in Chemistry and Drug Design; Wiley-VCH: Weinheim, 1999.

(20) Vesanto, J. SOM-Based Data Visualization Methods. Intell. Data Anal. 1996, 3, 111- 126.

(21) Zhang, S.; Ganesan, R.; Xistris, G. D. Self-Organising Neural Networks for Automated Machinery Monitoring Systems. Mech. Syst. Signal. Pr. 1996, 10, 517-532.

(22) Wong, M. L. D.; Jack, L. B.; Nandi, A. K. Modified Self-Organising Map for Automated Novelty Detection Applied to Vibration Signal Monitoring. Mech. Syst. Signal. Pr. 2006, 20, 593-610.

(23) Noeske, T.; Sasse, B. C.; Stark, H.; Parsons, C. G.; Weil, T.; chneider, G. Predicting Compound Selectivity by Self-Organizing Maps: Cross-Activities of Metabotropic Glutamate Receptor Antagonists. ChemMedChem 2006, 1, 1066-1068. 136 Virtual Screening by Novelty Detection with SOM

(24) Teckentrup, A.; Briem, H.; Gasteiger, J. Mining High-Throughput Screening Data of Combinatorial Libraries: Development of a Filter to Distinguish Hits From Nonhits. J. Chem. Inf. Model. 2004, 44, 626-634.

(25) Selzer, P.; Ertl, P. Applications of Self-Organizing Neural Networks in Virtual Screening and Diversity Selection. J. Chem. Inf. Model. 2006, 46, 2319-2323.

(26) Olah, M.; Mracec, M.; Ostopovici, L.; Rad, R.; Bora, A.; Hadaruga, N.; Olah, I.; Banda, M.; Simon, Z.; Mracec, M.; Oprea, T. I. WOMBAT: World of Molecular Bioactivity. In Cheminformatics in Drug Discovery; Oprea, T. I., Ed.; Wiley-VCH: New York, 2003, pp. 223-241

(27) Taylor, R. Simulation Analysis of Experimental Design Strategies for Screening Random Compounds As Potential New Drugs and Agrochemicals. J. Chem. Inf. Model. 1995, 35, 59-67.

(28) Butina, D. Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. J. Chem. Inf. Model. 1999, 39, 747-750.

(29) R Development Core Team. R: A language and environment for statistical computing, version 2.2.1, 2005, http://www.r-project.org (accessed 01.2006).

(30) Daylight Chemical Information Systems Inc. http://www.daylight.com/dayhtml/doc/theory/theory.finger.html (accessed 06.2006).

(31) Sykora, V., Chemical Descriptors Library, http://cdelib.sourceforge.net (accessed 06.2006)

(32) Moreau, G.; Broto, P. Autocorrelation of a Topological Structure: A New Molecular Descriptor. New J. Chem. 1980, 4, 359-360.

(33) Spycher, S.; Nendza, M.; Gasteiger, J. Comparison of Different Classification Methods Applied to a Mode of Toxic Action Data Set. QSAR Combinat. Sci. 2004, 23, 779-791.

(34) PETRA - Parameter Estimation for the Treatment of Reactivity Applications, version 4.0, Molecular Networks GmbH, Erlangen, Germany, 2006, http://www.molecular- networks.com (accessed 06.2006) References 137

(35) Hutchings, M. G.; Gasteiger, J. Residual Electronegativity - an Empirical Quantification of Polar Influences and Its Application to the Proton Affinity of Amines. Tetrahedron Lett. 1983, 24, 2541-2544.

(36) Gasteiger, J.; Hutchings, M. G. New Empirical Models of Substituent Polarisability and Their Application to Stabilisation Effects in Positively Charged Species. Tetrahedron Lett. 1983, 24, 2537-2540.

(37) Gasteiger, J.; Marsili, M. Iterative Partial Equalization of Orbital Electronegativity - a Rapid Access to Atomic Charges. Tetrahedron 1980, 36, 3219-3228.

(38) Hollas, B. An Analysis of the Autocorrelation Descriptor for Molecules. J. Math. Chem. 2003, 33, 91-101.

(39) ADRIANA.Code, version 1.0, Molecular Networks GmbH, Erlangen, Germany, 2006, http://www.molecular-networks.com (accessed 06.2006).

(40) De Maesschalck, R.; Jouan-Rimbaud, D.; Massart, D. L. The Mahalanobis Distance. Chemometr. Intell. Lab. 2000, 50, 1-18.

(41) Vesanto, J.; Sulkava, M.; Hollmén, J. On the Decomposition of the Self-Organizing Map Distortion Measure, In Proceedings of the Workshop on Self-Organizing Maps (WSOM'03), Kitakyushu, Japan, 2003, 11-16

(42) Pölzlbauer, G. Survey and Comparison of Quality Measures for Self-Organizing Maps. In Proceedings of the Fifth Workshop on Data Analysis (WDA'04), Sliezsky dom, Vysoké Tatry, Slovakia, 2004, Paralic, J., Pölzlbauer, G., and Rauber, A., Eds, Elfa Academic Press, Košice, 2004.

(43) Edgar, S. J.; Holliday, J. D.; Willett, P. Effectiveness of Retrieval in Similarity Searches of Chemical Databases: A Review of Performance Measures. J. Mol. Graph. Model. 2000, 18, 343-357.

(44) Olah, M.; Bologa, C.; Oprea, T. I. Strategies for Compound Selection. Curr. Drug Discov. Technol. 2004, 1, 211-220.

(45) Murdoch, D. Venn Diagrams in R. J. Statist. Soft. 2004, 11, Code Snippet 1.

(46) MeqiSuite, version 2.30, Pannanugget Consulting L.L.C., Kalamazoo, MI, USA, http://www.pannanugget.com (accessed 20.03.2007) 138 Virtual Screening by Novelty Detection with SOM

(47) Johnson, M. An Introduction to the MeqiSuite Indices. Technical Report 2006/001, Pannanugget Consulting, Inc., Kalamazoo, MI, USA, 2006

(48) Patterson, D. E.; Cramer, R. D.; Ferguson, A. M.; Clark, R. D.; Weinberger, L. E. Neighborhood Behavior: A Useful Concept for Validation of "Molecular Diversity" Descriptors. J. Med. Chem. 1996, 39, 3049-3059.

(49) Matter, H. Selecting Optimally Diverse Compounds From Structure Databases: A Validation Study of Two-Dimensional and Three-Dimensional Molecular Descriptors. J. Med. Chem. 1997, 40, 1219-1229.

(50) Martin, Y. C.; Kofron, J. L.; Traphagen, L. M. Do Structurally Similar Molecules Have Similar Biological Activity? J. Med. Chem. 2002, 45, 4350-4358.

Further comments and discussion 139

Further comments and discussion

The study presented in this chapter has demonstrated that novelty detection techniques are useful in the field of ligand-based virtual screening. A few additional remarks about this class of machine learning techniques deserve attention.

Usually, novelty detection techniques are applied when there is an abundance of data for one of the possible states of the system while data for the other state is hard to obtain. Consider, for example, machine fault detection – an area well suited for the application of novelty detection techniques and investigated in a number of studies.1-4 In this case, data for normally working machines are very easy to collect. On the other hand, data from a faulty machine are hard or, in some cases (e.g. a completely broken machine) impossible to collect. Therefore, there is an abundance of data for the normally operating machine and these data provide the basis for novelty detection. Transferring this approach directly to the field of ligand-based virtual screening would mean to keep the known active compounds aside and use all other compounds from the database to build a representation of the inactive class. This representation can later be used to decide if a new structure is inactive. However, this approach is hampered by the fact that, in most of the commercial databases, the compounds not marked as active against a given biological target were usually not tested against this target. That is, strictly speaking it is unjustified to consider them as proven inactive compounds. As a result, any attempt to model the inactive space is likely to include a decent amount of actually active compounds, which may result in a subsequent misclassification. For that reason, in the study presented in this chapter we have chosen to model the space covered by the known active structures, in spite of the fact that it is actually the less represented class. Thus, in a sense, the idea of novelty detection – to use the data from the most represented class – is inverted due to the peculiarities of the used database. On the other hand, in proprietary databases of modern pharmaceutical companies there are a lot of structures which have been proven inactive in a high-throughput screening experiment. If such a database is to be screened by novelty detection techniques the natural approach is to use the data on these inactive structures for the development of a novelty detector.

The novelty detection techniques can additionally benefit from the availability of both active and proven inactive structures. While in such cases the building of full-blown classification models is in theory possible, it is still hampered due to the highly unbalanced distribution of the data. In a database with 10,000 proven inactive structures and only 100 140 Virtual Screening by Novelty Detection with SOM

known active structures most of the standard classification methods will encounter problems in finding the decision border between the two classes. On the other hand, the novelty detection techniques can benefit from the limited amount of information for the other class. The improvement is usually substantial to the extent that often an artificially generated data is used to mimic the opposite class.5,6

Finally, a note of caution is given concerning the comparison between novelty detection and similarity search methods presented in this chapter. The main goal of this study was to examine the applicability of the novelty detection technique to the problem of ligand-based similarity searching. Since similarity search with different types of binary fingerprints is presently in daily use our results were compared against this method. However, the use of the two methods with different structure representations raises the question if the estimated difference in the performance is due to the method or due to the descriptor. This question, amongst others, is investigated in the next chapter.

Additional References

(1) Zhang, S.; Ganesan, R.; Xistris, G. D. Self-Organising Neural Networks for Automated Machinery Monitoring Systems. Mech. Syst. Signal. Pr. 1996, 10, 517-532.

(2) Ypma, A.; Duin, R. P. W. Novelty Detection Using Self-Organizing Maps. In Progress in Connectionist-Based Information Systems; Springer: London, 1997.

(3) Qing, Z.; Zhihan, X. Design of a Novel Knowledge-Based Fault Detection and Isolation Scheme. IEEE T. Syst. Man Cy. B 2004, 34, 1089-1095.

(4) Wong, M. L. D.; Jack, L. B.; Nandi, A. K. Modified Self-Organising Map for Automated Novelty Detection Applied to Vibration Signal Monitoring. Mech. Syst. Signal. Pr. 2006, 20, 593-610.

(5) Tax, D. M. J.; Duin, R. P. W. Uniform Object Generation for Optimizing One-Class Classifiers. J. Mach. Learn. Res. 2001, 2, 155-173.

(6) Markou, M.; Singh, S. A Neural Network-Based Novelty Detector for Image Sequence Analysis. IEEE T. Pattern Anal. 2006, 28, 1664-1677. 141

5 Virtual screening applications – a study of ligand- based methods and descriptors in four different scenarios

Overview

We continue our exploration in navigating through large chemical databases. In addition to the WOMBAT database, which was used in the previous chapter, in this chapter the MDL Drug Data Report (MDDR) database is used as well. Thus, the number of screened compounds is doubled.

The studies presented in the following section are focused mainly on the application side of virtual screening. Four practical scenarios in which ligand-based virtual screening can bring additional knowledge out of a large chemical database are identified. Each of these scenarios is exemplified and discussed in details. The inclusion of MDDR allows us to study the difference between the chemical spaces covered by two chemical databases, which are expected to be relatively similar. In addition, the use of two databases allows the investigation of an important topic with each machine learning method – the bias towards retrieving compounds from the same database which was used to select the training set of compounds. While the used methods – similarity search with data fusion and SOM novelty detection – remain the same, an extensive study of their applicability in the four virtual screening scenarios is presented. An algorithm to handle binary data together with SOM novelty detection is presented. The similarity search with data fusion was performed with all structure representations under investigation. This allows a fair comparison between the performance of the two methods and the three structure representations which were studied. The utility of a vectorial structure representation based on the 3-dimensional structure of the molecules – radial distribution function – was investigated. Finally, various ways of assessing the performance of a virtual screening experiment in each of the four scenarios are presented and discussed.

The remainder of this chapter corresponds exactly to the original paper submitted for publication in the Journal of Computer-Aided Molecular Design other than the numbering. 142 Virtual Screening Applications

Original Article

Virtual screening applications – a study of ligand-based methods and different structure representations in four different scenarios

Dimitar Hristozov1, Tudor I. Oprea2, and Johann Gasteiger1*

1Computer-Chemie-Centrum, Universität Erlangen-Nürnberg, Nägelsbachstr. 25, D-91052 Erlangen, Germany

2Division of Biocomputing, University of New Mexico School of Medicine, MSC 11 6145, 1 University of New Mexico, Albuquerque, New Mexico 87131-0001, USA

Hristozov, D., Oprea, T.I., Gasteiger, J., 2007, submitted to J. Mol. Graphics Model.

Abstract

Four different ligand-based virtual screening scenarios are studied: 1) prioritizing compounds for subsequent high-throughput screening (HTS); 2) selecting a predefined (small) number of potentially active compounds from a large chemical database; 3) assessing the probability that a given structure will exhibit a given activity; 4) selecting the most active structure(s) for a biological assay. Each of the four scenarios is exemplified by performing retrospective ligand- based virtual screening for eight different biological targets using two large databases - MDDR and WOMBAT. A comparison between the chemical spaces covered by these two databases is presented. The performance of two techniques for ligand-based virtual screening – similarity search with subsequent data fusion (SSDF) and novelty detection with Self- Organizing Maps (ndSOM) is investigated. Three different structure representations – 2048- dimensional Daylight fingerprints, topological autocorrelation weighted by atomic physicochemical properties (sigma electronegativity, polarizability, partial charge, and identity) and radial distribution functions weighted by the same atomic physicochemical properties – are compared. Both methods were found applicable in scenario one. The similarity search was found to perform slightly better in scenario two while the SOM novelty detection is preferred in scenario three. No method/descriptor combination achieved significant success in scenario four. Introduction 143

5.1 Introduction

Virtual screening of compound libraries is often employed for the selection of subsets of chemical structures, which are enriched in active compounds.1,2,3,4 Ligand-based methods are frequently applied when the 3D structure of the biological target of interest is unknown. The ligand-based methods rely on a representative set of reference structures, molecular descriptors, and an appropriate similarity measure.2,5 Usually the result of these virtual screening methods is a ranked list of the screened compounds. Highly ranked compounds in such a list are assumed to share the activity of the reference structures.

The retrieval of relevant structures on the top of the ranked list is the broad aim of virtual screening experiments. This aim can be further split into more concrete scenarios, depending on the available resources and on the concrete goal of the experiments. In this work, we have identified and will discuss four such scenarios.

5.1.1 Scenario 1: Prioritizing compounds for a subsequent HTS (SC.1).

In this scenario a large database of potentially active compounds is screened and the compounds are ordered in descending order according to the score assigned by the virtual screening method. Afterwards a certain percentage of the top-ranked compounds are selected and evaluated in a high-throughput screening (HTS) campaign. The assumption here is, as the high-throughput word implies, that a large amount of the compounds in the original database will be screened. Proprietary compound libraries containing millions of structures are nowadays common in the pharmaceutical industry.6 Therefore, eliminating even a small percentage may decrease the costs in an HTS campaign significantly. In such cases it is important that a virtual screening method is able to guarantee that no potential active structures will be missed, i.e., a small false-negative ratio is required. Since still a relatively large number of compounds will be selected, it is important to determine the size of the ranked list, which provides the best trade-off between recovered active structures (true- positives) and structures of unknown activity (false-positives). A useful virtual screening method in this scenario has to perform better than a random picking of compounds. 144 Virtual Screening Applications

5.1.2 Scenario 2: Selecting compounds for a subsequent lead-optimization (SC.2).

In this scenario even halving the number of available compounds is not likely to be sufficient. In contrast to a HTS campaign the lead-optimization process requires a decent amount of human intervention. Asking a medicinal chemist to work with half a million compounds is not reasonable. Therefore, a small amount of possibly active compounds is required. In such a scenario a virtual screening method should not only guarantee a performance better than random selection but also that as many actives as possible are retrieved in the beginning of the ranked list – the so called “early recognition” problem. This problem is even more pronounced in a university research laboratory where the resources are usually limited. In this scenario it is irrelevant if all actives have been retrieved in the remaining part of the list (after the predefined number of compounds has been reached) since this part will not be examined. The (relatively) low number of considered structures usually makes an examination of the retrieved compounds by a chemist possible. This scenario – using a different predefined number of the considered compounds, usually between 1 to 10 per cent of the screened database – has been by far the most common application of different virtual screening studies. 7-12

5.1.3 Scenario 3: Is a given compound active? (SC.3)

Another possible use of a virtual screening method is to assess the probability that a given structure is active. That is, instead of screening a large dataset, can we provide an answer for a single, already available structure? This question arises often in library design problems. When deciding which structures to include into the chemical library being designed it is beneficial to assess the probability that these structures will exhibit a certain activity.13 Then, depending on the goal of the structure library which is being designed either structures close to the chemical space of the compounds known to be active or rather far from this space may be purchased.

Usually the above question is answered with the help of a qualitative or quantitative structure-activity relationship (QSAR) model, dedicated to the activity in question. It should be stressed that any QSAR model can be used as a virtual screening device. However, such models are usually built using a limited amount of training data and are unable to extrapolate too far from the chemical space covered by these training data. Therefore, it will be beneficial Introduction 145 if a virtual screening method – which is usually faster and does not attempt to make quantitative predictions – can be applied in such a context. A common approach in this direction is to threshold the used similarity metric at a given (perhaps arbitrary) value.14,15 When more than one known actives are used (through data fusion) the problem of determining the value of the threshold may become more pronounced. An attractive approach, which handles SC.3 implicitly, is based on so-called novelty detection techniques16,17 or one- class classifiers.18

5.1.4 Scenario 4: Identification of the most active compound (SC.4).

A possible application of a virtual screening method is the selection of a “best” structure out of a set of potentially active compounds. The “best” in this context is defined as the structure which will have the highest activity. This question is usually answered with the help of a quantitative structure-activity model. However, provided that a virtual screening experiment has been performed, it may be possible to determine such a “best” structure using only the returned ranked list. In this scenario, the actual ranking of the potentially active structures is important with the assumption that the higher a compound is ranked the higher its activity is. This kind of experiment is common in a structure-based virtual screening experiments based on docking. In these experiments the docking scores are usually correlated with the activity values.19 However, docking experiments require that the 3D structure of the target is known.

The aim of the present work was to test the applicability of different ligand-based virtual screening methods and chemical structure representations in each of the above scenarios. Thus, in the rest of this paper we first shortly describe the used chemical databases and two different methods for ligand-based virtual screening – similarity search with subsequent data fusion and novelty detection with self-organizing maps. Next, we introduce the measures which will be used to assess the success of our retrospective virtual screening experiments. Then, the results of the application of each of these methods for retrospective virtual screening for eight different biological targets in two large databases – MDL Drug Data Report (MDDR)20 and WOrld of Molecular BioAcTivity (WOMBAT)21 using different types of structural descriptors are presented. Different aspects of the results: the optimal size of the training set, the difference in the chemical spaces covered by MDDR and WOMBAT, the bias introduced by the training set selection, the differences in the compounds recovered by 146 Virtual Screening Applications

different methods or/and descriptors are discussed and the best method-descriptor combination is identified for each scenario.

5.2 Materials and Methods

5.2.1 Chemical databases

The two databases – MDL Drug Data Report (MDDR) version 2006.1 and World Of Molecular BioAcTivity (WOMBAT), version 2006.01, were used.

MDDR, version 2006.1, contains 159,662 structures together with the associated activity classes. A total of 149,414 molecules were used after removal of duplicates and molecules that could not be processed by some of the used computer programs.

WOMBAT, version 2006.01, contains 154,236 chemical compounds, collected from articles in medicinal chemistry journals published between 1975 and 2006. In addition to the

structural information, WOMBAT contains also the reported activity values, expressed as pKi value, information about the species in which the tests were performed, the biological role of the structure (inhibitor, antagonist, etc.) as well as additional properties of interest. A total of 118,346 chemical structures were used after removing duplicates, the structures which were also found in MDDR, the structures with reported activity less than 30 μmolar, and molecules that could not be processed by some of the used computer programs.

5.2.2 Biological targets

Eight different biological targets were subjected to retrospective virtual screening and are summarized in Table 5.1. The activity classes are referred to by using their WOMBAT activity IDs through the rest of this article. Materials and Methods 147

Table 5.1 Subsets of active structures used in this study

MDDR WOMBAT activity name activity ID # actives activity ID # actives 5HT3 antagonists 06233 775 5-HT3 635 5HT1A agonists 06235 953 5-HT1A 2524 D2 antagonists 07701 487 D2 2877 renin inhibitors 31420 1188 renin 583 angiotensin II AT1 antagonists 31432 2158 AT1 1361 thrombin inhibitors 37110 1122 thrombin 1841 HIV protease inhibitors 71523 971 HIV-1 P 2422 protein kinase C inhibitors 78374 545 PKC 152

5.2.3 Virtual screening protocol

The set of known active structures (referred to as “training set” from here on) was always selected from MDDR. The compounds selected as training set were removed from MDDR and were not considered when evaluating the performance. The training set selection consisted of 1) clustering the known actives in MDDR using the Taylor-Butina22,23 clustering algorithm; 2) randomly selecting a percentage of known actives from each cluster such that the total size of the training set equals a predefined number; 3) merging the structures from MDDR and WOMBAT and subjecting the resulting database (which contained 267,760 structures) to a similarity search with subsequent data fusion (SSDF) and to novelty detection with SOM (ndSOM) using the selected training set; 4) measuring separately the ability of each method/descriptor combination to recover the active structures from MDDR and from WOMBAT. The results of a single experiment following the described procedure are dependent on the particular training set and may result in too optimistic (or too pessimistic) performance estimates. To eliminate this effect, step 2 and 3 were repeated ten times and the means and standard deviations of the corresponding performance metrics were calculated. The actives from MDDR and from WOMBAT were separated in step 4 and the performance was assessed separately in order to investigate the bias introduced by selection of the training set from a particular database (MDDR in this case).

The above protocol was augmented with an additional bootstrapping step when evaluating the “early recognition” capabilities of the studied virtual screening methods. After a ranked 148 Virtual Screening Applications

list was obtained a number of known active structures were repeatedly removed at random ten times in a way which leaves only 150 known actives for the subsequent evaluation. This procedure was suggested by Truchon et al.24 and ensures that the BEDROC evaluation (see below) is not prone to saturation effects. The saturation effect appears when there are too many known actives in the beginning of the ranked list, which leads to an early recognition metric with low discriminative power between methods. In addition, the bootstrapped evaluation gives a better estimate of the standard deviation associated with the BEDROC metric.

5.2.4 Assessing the performance

The performance of a virtual screening method in each of the discussed scenarios cannot be assessed by the same metric. A survey of the available performance measures focused mainly on SC.1 and SC.2 can be found in ref. 25. Recently, a detailed investigation of the performance metrics, focused on the “early recognition” problem (SC.2 in the present work) was reported.24 In the following we will limit ourselves to giving a short overview of the measure(s) we have selected for each of the discussed scenarios.

5.2.4.1 Scenario 1: Prioritizing compounds for a subsequent HTS

There are two important questions to be answered in this kind of virtual screening experiment: a) is the virtual screening method able to prioritize compounds in a way, better than random ordering, and b) at which size a ranked list can be truncated in a way which provides the best trade-off between number of false-positives and recovered actives or, in other words, what portion of the ranked list should be examined to guarantee that a given amount of known actives has been recovered?

The area under the Receiver Operating Characteristic (ROC) curve26,27 provides an answer to the first of the above questions. At the same time the ROC curve itself can be used to find an answer to the second question in an easy to comprehend graphical manner.

The use of ROC curves to assess the performance of virtual screening experiments is well- established.28,29 A ROC curve represents a plot of the number of true active compounds (true positives) included in the sample on the vertical axis, expressed as a percentage of the total number of known actives, against the number of structures with unknown activity (false positives) included in the sample, expressed as percentage of the total number of structures Materials and Methods 149 with unknown activity, on the horizontal axis. Sample ROC curves and their corresponding areas are presented on Figure 5.1.

Single run ROC curve (area: 0.94) Average ROC curve (area: 0.93) ds) ds) n u o p m o ive c t ives (ac ives t si o p rue rue t 020406080100

0 20 40 60 80 100 false positives (compounds with unknown activity)

Figure 5.1 Sample ROC curves. The jagged ROC curve is obtained in a single run, while the smoothed ROC curve results from averaging ten runs with different training sets. The diagonal line shows the expected ROC curve for random picking. The selection of the desired number of compounds is illustrated assuming that 60% of the known actives should be recovered.

The jagged line represents a single run while the smoothed line is obtained after averaging the corresponding points over ten runs. The diagonal line represents the expected ROC curve for a random selection of active structures. As can be seen from Figure 5.1 one can easily answer both of the above questions using the ROC curve. The more the ROC curve is above the diagonal, the better the virtual screening method performs compared to random selection. To determine the position at which to truncate the ranked list in a way providing a good compromise between recovered true actives and structures of unknown activity, one has simply to select the desired percentage of true actives on the vertical axis and to read the corresponding value of the horizontal axis. 150 Virtual Screening Applications

This is illustrated on Figure 5.1 – if we want to recover 60% of the known actives, the ranked list is likely to contain around 15% of the compounds with unknown activity. Since Figure 5.1 represents the results of a real virtual screening run – retrieval of WOMBAT 5-HT3 antagonists – we can translate these percentages into numbers – that is, to recover 381 of the 635 5-HT3 antagonists available in WOMBAT we need to consider approximately the 40,333 top-ranked compounds – 39,952 (ca. 15%) of unknown activity plus the 381 known actives.

However, the answer to the first question – how well we perform compared to random selection – from the ROC curve alone is prone to errors since it is hard to quantify by eye how much above the diagonal a given ROC curve is. A quantitative measure is needed and the area under the ROC curve has been identified as such. 26,27 Some of the desirable properties of the area under the ROC curve include: it has a well-defined statistical meaning; it has a value of 0.5 or less if the ranking method does not perform better than random selection; it can be interpreted as the probability that an active will be ranked before an inactive; it has a value between zero (worst performance) and one (best performance). The area under the ROC curve was claimed for a long time to be independent of the ratio of actives and inactives. However, this assumption was recently 24 questioned. Nevertheless, it is a useful metric with an easy to grasp graphical representation, especially when no assumptions about the size of the ranked list are made a priori or when the best size in terms of the ratio true actives – structures of unknown activity has to be determined.

5.2.4.2 Scenario 2: Selecting compounds for a subsequent lead- optimization

In this scenario the required final size of the ranked list is known a priori. Furthermore, by definition this size is small enough to decrease significantly the costs of the subsequent experiments. Thus, typically 0.1 to 10 per cent of the top-ranked structures will be examined, depending on the size of the full database. Thus, the retrieval of actives as early as possible in the ranked list is desirable – the so called “early recognition” problem. In such experiments the area under the ROC curve is not a useful measure since it gives the same weight to an active compound recovered at the very beginning of the ranked list and to another active compound recovered towards the end of the ranked list. In a recent survey on evaluating virtual screening experiments focused on the early recognition problem Truchon et al.24 proposed the use of Boltzmann-Enhanced Discrimination of ROC (BEDROC) metric. The BEDROC metric is calculated according to Equation 5.1: Materials and Methods 151

n ∑e −αri / N i=1 Ra sinh(α / 2) 1 BEDROC = × + α (1−R ) (5.1) n ⎛ 1− e −α ⎞ cosh(α / 2) − cosh(α / 2 −αR ) 1− e a ⎜ ⎟ a ⎜ α / N ⎟ N ⎝ e −1⎠

where n is the number of known active structures, N is the number of inactive (or with th unknown activity) structures, ri is the rank of the i active structure, Ra is the ratio of active to inactive structures n/N, and α is a weighting factor, which controls the “early recognition” element – higher α values move the region of importance towards the beginning of the ranked list.

The main advantage of BEDROC over the ROC area is that the recovered actives are exponentially weighted according to their rank. Thus, a much higher weight is given to actives recovered early in the list compared to actives recovered towards the end of the list. For a detailed description of this metric together with a mathematical derivation and proofs the reader is referred to the original work of Truchon et al.24. The authors in ref. 24 provide guidance for the selection of the exponential term α in a way which ensures that the corresponding BEDROC metric is based on different per cents of the ranked list. In the present work an α value of 32.2 was used. Therefore, 80% of the corresponding BEDROC score was based on the top-ranked five per cents of the compounds in the original database.

5.2.4.3 Scenario 3: Is a given compound active?

In this scenario we are interested if a given structure is likely to share the activity of the set of query structures or not. This formulation brings the virtual screening method close to a classification task. In contrast to a standard classifier the training is performed using only known actives. Such kind of machine learning is usually termed novelty detection or one- class classification. 16,17 However, the classification error rate is unlikely to be a good measure of the performance since most of the tested structures are likely to be inactive. Therefore, a method which predicts any structure as inactive will achieve an error rate close to zero.

The standard metric for assessing the performance of a classification task – recall and precision,30 can be applied in this case. These metrics can be calculated from a 2x2 contingency table, such as the one shown in Table 5.2.

152 Virtual Screening Applications

Table 5.2 A 2 by 2 contingency table used to calculate different performance metrics

Retrieved Not Retrieved Total

Active Nar Nan Na

Inactive Nir Nin Ni

Total Nr Nn N

Recall measures the percent of active structures retrieved at a given size of the ranked list (r) and is calculated according to Equation 5.2:

N ar recallr = (5.2) N a

Consider, for example, a database of one thousand structures in which one hundred known actives are available. The database is screened, the top ranked five hundred structures are obtained, and the recall at 500 (r=500) is calculated according to Equation 5.2. If all one hundred known actives are contained in the top-ranked five hundred structures, the recall at 500 equals one and the virtual screening method has achieved perfect performance.

The precision is calculated according to Equation 5.3:

N ar precisionr = (5.3) N r

The precision (also known as sensitivity) measures how many of the retrieved structures are actually active. Thus, a precision of one will indicate that all structures predicted as active are indeed active, while a low value will speak of too many false positives. In SC.3, the precision can be read as the probability that a structure predicted as active is indeed active, while the recall will show how many active structures were missed, i.e. predicted as inactive.

5.2.4.4 Scenario 4: Identification of the most active compound

This scenario requires a different approach altogether. The objective here is to compare two ranked list – one, in which the a priori known active structures are ranked according to their activity and a second one, produced by the virtual screening method. Such a procedure is a common tool for evaluating different docking methods19 when a structure-based virtual screening is performed. Materials and Methods 153

Consider, for example, a list of ten known active structures sorted according to a measured activity value. After the virtual screening run has been completed, these ten structures will be sorted according to the chosen similarity measure. If there is a perfect correspondence between this ranking and the activity, the two lists – the one sorted by the activity and the one sorted by similarity – will contain each structure at the same position.

There is a set of so-called rank agreement measures, which can be utilized to measure the agreement between two ranked lists. These include rank correlations, such as Spearman’s ρ or Kendall’s τ, the ndpm measure,31 etc. A detailed discussion of all rank agreement metrics is beyond the scope of this article. Suffice to say is that we have selected the Kendall’s τ for the evaluation of SC.4. Kendall’s τ was selected mainly because it handles weak ordering slightly better than Spearman’s ρ. It is calculated according to Equation 5.4:

C − D τ = (5.4) ()()C + D + TR × C + D + TP

In this equation, C stands for the number of concordant pairs – pairs of structures that the virtual screening method predicts in the properly ranked order. D stands for the number of discordant pairs – pairs that the virtual screening method predicts in a wrong order. TR is the number of pairs of structures in the true ordering (the ranking determined by the activity values) that have tied ranks (i.e., the same activity) while TP is the number of pairs of structures in the predicted ordering that have tied ranks (the same similarity coefficient).

Kendall’s τ calculated with Equation 5.4 has the following properties: 1) If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has the value of one; 2) If the disagreement between the two rankings is complete (i.e., one ranking is the reverse of the other) the coefficient has a value of minus one; 3) For all other arrangements the value lies between -1 and 1, and increasing values imply increasing agreement between the rankings. If the rankings are completely independent, the coefficient has a value of zero.

5.2.5 Virtual screening methods

5.2.5.1 Similarity search with subsequent data fusion (SSDF)

The similarity search starts with a known active structure, usually called “target” or “query”. After this structure has been described by a given representation all compounds from 154 Virtual Screening Applications

the screened database are compared to it by means of a similarity coefficient. The screened database is then sorted in descending order according to the values of the similarity coefficient. The compounds most similar to the query end up on top of this list.

The procedure described so far requires a single known active. Data fusion is usually applied32,33 to adapt it to a case when more than one known active structure is available. Starting with n known actives (training set) a similarity search is carried out with each of them in turn. This results in n separate ranked lists. There are different methods to combine the similarity scores from these lists, called “fusion rules”. In the present work, the MAX rule was used, meaning that each screened structure j obtained a final score, equal to the maximum value of its individual scores, collected from each of the ranked lists, according to Equation 5.5:

* S FUS ( j) = max[S (i, j)] i (5.5)

where S* denotes the calculated similarity score between the query structure i and the screened structure j. The similarity score used in this work was the Tanimoto coefficient. The Tanimoto coefficient for binary structure representation assumes values between zero and one and is calculated according to Equation 5.6:

c S = (5.6) T a + b − c

where a is the number of bits “on” in the representation of the query structure, b is the number of bits “on” in the screened structure, and c is the number of bits “on” in both, i.e., the union of the two representations.

The above formula can be adapted to real-valued structure representations as shown in Equation 5.7:

N x y ∑i=1 i i S(x; y) = N N N (5.7) (x )2 + (y )2 + x y ∑i=1 i ∑∑i==11i i i i

where x and y are the corresponding real valued vectors of size N describing structures X and Y. The coefficient thus calculated assumes values between -0.333 and one. Materials and Methods 155

5.2.5.2 Novelty detection with Self-Organizing Maps (ndSOM)

A detailed description of two variants of this virtual screening method has been reported recently.18 In this work, ndSOM with a single structure representation was used. Briefly, the method requires a set of known actives (training set) from which a Self-Organizing Map (SOM) is built. Once the SOM has been obtained a local accuracy is determined from the map and the screened database is projected onto this SOM. If the distance between a screened structure and its best matching neuron (BMN) is larger than the local accuracy this structure is deemed novel, that is, unlikely to share the biological activity of the training set.

In our previous work,18 the local accuracy was determined using the average distance between a neuron and its neighbors. However, further experiments have shown that this is prone to rather large fluctuations from a map to map due to spots of low data density, which create “cliffs” in the map. Therefore, in the current work a global accuracy (ga) rather than a local accuracy was used. To obtain the global accuracy once the SOM has been trained the training set is projected onto it. The distances between each training structure and its best- matching neuron are calculated. The largest such distance can be used as a global accuracy. However, if an outlier is present in the training set, the maximum distance may become very large. To avoid this, a global accuracy, which classifies approximately 5% of the training structures as novel, i.e. inactive, was used. After the global accuracy was obtained, the screened structures were scored according to Equation 5.8:

ga S SOM ( j) = (5.8) ga + d( j, BMN j )

where ga is the global accuracy of the map and d(j,BMNj) denotes the distance between the structure j from the screened set and its best-matching neuron. The score obtained with Equation 5.8 is bound between zero and one. Any screened structure for which the calculated score has a value below 0.5 is classified as inactive. This modification eliminates the fluctuations from a map to map significantly and allowed us to use a single rather than ten SOMs, as described previously. All used SOMs were with hexagonal lattice and were trained using the batch algorithm.34

5.2.5.2.1 Self-organizing map for binary structure descriptors

In addition to studying the applicability of different virtual screening methods in the aforementioned four different scenarios, we were also interested in comparing different 156 Virtual Screening Applications

structure representations. However, the classic SOM algorithm has been developed for working with real valued vectors. To provide a fair comparison with SSDF with binary fingerprints and to study the applicability of ndSOM with such kind of descriptors we have implemented a version of the Self-Organizing Map algorithm capable of handling binary data.

The easiest way to train a SOM using binary fingerprints is to regard them as real valued descriptor, utilize the standard SOM algorithm and possibly transform the adapted weights back to binary strings at the end of the training. However, this algorithm can be very slow, provided the high dimensionality of the binary fingerprints. Another possibility is to adapt a version of a batch SOM algorithm, initially designed for building SOM on string data. 34 The main idea is to update the weights based on the generalized median of the set of binary fingerprints, S, which form the Voronoi region of the winning neuron and its neighborhood.

The generalized median is a binary string which has the minimum sum of distances to all other binary strings in the set. That is, it minimizes:

p = argmin d( p,q) p∈U ∑ (5.9) q∈S

where U is the set of all possible binary strings of the given length and d(p,q) is the distance between two binary patterns p and q. By using this algorithm, a ndSOM can be easily applied to binary structure representations. We have used the Tanimoto similarity coefficient as the distance measure, d, in Equation 5.9 after subtracting it from 1. Alternatively, one can replace argmin in the above equation with argmax.

5.2.6 Structure representation

5.2.6.1 Binary fingerprints

2048-dimensional Daylight fingerprints (DFP) were generated with the Chemical Descriptors Library.35

5.2.6.2 Topological autocorrelation (AC2D)

Introduced by Moreau et al.36 the topological autocorrelation descriptors have since then been applied in a number of studies.37,38,39,40,41 The descriptors are calculated according to Equation 5.10: Materials and Methods 157

k k A(d) = ∑∑ pi p jδ (d − dij ) (5.10) i==1 j i

Here k is the number of atoms in the molecule, pi is some atomic property of atom i, dij is the topological distance (i.e., the number of bonds) between atoms i and j, and δ (x) = 1: x = 0;δ (x) = 0; x ≠ 0 is the binning function. In the present study, the autocorrelation function was evaluated from 0 to 10 topological distances. Thus, the chemical structures were represented as 11 dimensional vectors with regard to the used atomic property.

Three atomic properties were calculated by previously published empirical methods for all 42 43 atoms in a molecule: sigma electronegativity (χσ), effective atom polarizability (αd), and 44 partial atomic charge (qtot). In addition, the identity function, i.e., each atom was represented by 1, was used. The atomic properties for each molecule, with exception of the identity, were autoscaled to zero mean and unit variance before applying Equation 5.10. The scaling has been shown45 to diminish the correlations between the bins of autocorrelation and to better preserve the physicochemical information. The resulting autocorrelation vectors were additionally autoscaled to ensure that the values are comparable when a distance measure (such as the Euclidian distance, used by SOM) is calculated. The final, 44-dimensional description of the chemical structures was achieved by concatenation of the autocorrelations vectors thus calculated. The topological autocorrelation vectors were calculated with the descriptor calculation package ADRIANA.Code.46

5.2.6.3 Radial distribution function (RDF)

A radial distribution function is a transform of the 3-dimensional structure of a molecule.47 All structures were represented by their RDF codes weighted by the same physicochemical properties as the topological autocorrelation – sigma electronegativity (χσ), effective atom polarizability (α), partial atomic charge (qtot), and identity (A1). Single, low energy 3D conformations were generated from the 2D constitution using the 3D structure generator CORINA.48,49 The RDF codes were calculated according to Equation 5.11:

N −1 N 2 −B(r−rij ) g(r) = ∑∑ pi p j e (5.11) i=>1 j i

where N is the number of atoms in a molecule, pi and pj are properties associated with the atoms i and j, respectively, rij represents the distance between atoms i and j, and B is a 158 Virtual Screening Applications

smoothing factor. The above formula was applied with the property p set to each of the four physicochemical properties in turn and 64 dimensional RDF codes were calculated. Analogously to the autocorrelation vectors, the atomic properties for each molecule, with exception of the identity, were autoscaled to zero mean and unit variance before applying Equation 5.11. The function g(r) was defined in the interval 0.8–12 Å. The resulting RDF codes were additionally auto-scaled. The final, 256-dimensional description of the chemical structures was achieved by concatenation of the four so RDF codes thus calculated. The RDF codes were calculated with the descriptor calculation package ADRIANA.Code. 46

It should be stressed that the aim of this study was not the development of precise quantitative models. Using a single conformation in the manner described above is a gross simplification. Even if a conformation minimized by a quantum mechanical method or force - field is used it may be far away from the biologically active conformation. However, the goal here was to study if the inclusion of 3D information, even if oversimplified, can improve the virtual screening results. The 3D structure generator CORINA has been shown50 to reproduce the PDB conformations of co-crystallized enzyme complexes reasonably well. In addition CORINA is very fast, thus the conversion from topological constitution to a single conformation does not hamper the speed of the virtual screening procedure significantly.

5.3 Results and Discussion

For all virtual screening scenarios studied in this work the following questions were considered of interest:

• What is the optimal size of the training set?

• Which method to use?

• Which descriptor to use?

• Is there a difference when screening different chemical space?

• Is there an advantage of using a 3D descriptor?

5.3.1 Scenario 1: Prioritizing compounds for a subsequent HTS

In addition to the above questions in this case the following additional questions are of interest: Results and Discussion 159

• Is the virtual screening method able to prioritize compounds in a way, better than random ordering?

• At which size can a ranked list be truncated in a way which provides the best trade- off between number of false-positives and recovered actives? In other words what portion of the ranked list should be examined to guarantee that a given amount of known actives have been recovered?

5.3.1.1 What is the optimal size of the training set?

Figure 5.2 shows the obtained area under the ROC curve (cf. Material and Methods) as a function of the size of the training set for all activity classes under investigation. The results with both methods when retrieving actives from the corresponding database with autocorrelation vectors as a descriptor are shown. Similar tendencies were observed for the other two investigated descriptor types.

Looking at Figure 5.2, the expected tendency of obtaining better results with a higher amount of training data is confirmed. This tendency is much stronger when the actives from MDDR database are considered (remember that the training set is always selected from MDDR only, cf. Materials and Methods). On the other hand, when actives from an external database (WOMBAT in this study) are considered increasing the size of the training set beyond 100 active compounds brings only marginal improvement. Thus, increasing the size of the training set as much as six times (from 50 to 300 training structures) brings, on average, less then 6% improvement in the obtained ROC areas when the retrieval of MDDR actives is considered and around 3% when a retrieval of active compounds from an external database (WOMBAT) is considered.

The obtained results show that using a large training set does not necessarily lead to significant improvements in the performance, especially when an attempt to retrieve actives from an external database is made. In some cases – for example the PKC inhibitors, Figure 5.2h – the use of a larger training set even leads to a decrease in the performance. This result shows that a certain amount of noise is introduced by using too large training sets. The results also hint that in spite of the fact that both MDDR and WOMBAT contain biologically relevant compounds there is a difference in the chemical space spanned by the two databases. Thus, a better sampling of the MDDR chemical space does not result into a significant increase in the retrieval of active compounds from WOMBAT. 160 Virtual Screening Applications

a) 5-HT3 b) 5-HT1A ) 100 x ea ( ea r a C O R ROC areaROC (x 100) 85 90 95 100 75 80 85 90 95 100 50 100 150 200 250 300 50 100 150 200 250 300 training set size

100 c) D2 d) Renin ) 100 x ea ( ea r a C O ROC area (x 100) R 95 96 97 98 99 100 70 75 80 85 90 95 50 100 150 200 250 300 50 100 150 200 250 300 training set size training set size

e) AT1 f) Thrombin ) ) 100 100 x x ea ( ea ( ea r similarity search, MDDR r a a

C SOM novelty, MDDR C

RO similarity search, WOMBAT RO SOM novelty, WOMBAT 85 90 95 100 75 80 85 90 95 100 50 100 150 200 250 300 50 100 150 200 250 300 training set size training set size

g) HIV-1 P h) PKC ) ) 100 100 x x ea ( ea ea ( ea r r a a C C O RO R 85 90 95 100 75 80 85 90 95 100 50 100 150 200 250 300 50 100 150 200 250 300 training set size training set size

Figure 5.2 Area below the ROC curve (x 100) as a function of the training set size. The structures were described with 44-dimensional autocorrelation vectors. Results and Discussion 161

Table 5.3 Areas under the ROC curve (x 100) obtained when retrieving MDDR active structures – mean and standard deviation over ten runs with different training sets of 100 active compounds

DFPa AC2Db RDFc SSDFd ndSOMe SSDF ndSOM SSDF ndSOM activity mean sd mean sd mean sd mean sd mean sd mean sd 5-HT3 93.0 1.2 90.8 1.1 92.0 1.0 90.8 0.9 94.9 0.5 94.0 0.7 5-HT1A 93.7 0.9 91.3 0.7 91.6 0.5 90.3 1.0 93.3 1.1 92.7 0.9 D2 94.6 0.8 93.3 0.9 90.3 0.5 89.0 0.5 92.0 1.1 91.5 0.9 Renin 97.6 0.5 97.3 0.5 97.3 0.6 97.1 0.4 97.2 0.6 97.3 0.5 AT1 98.2 0.2 97.9 0.2 96.3 0.3 96.1 0.3 91.8 0.7 92.1 0.7 Thrombin 93.5 1.3 92.1 1.4 89.7 0.6 88.7 0.7 89.8 0.9 88.4 1.1 HIV-1 P 95.7 0.6 93.8 0.8 93.2 0.8 92.0 0.9 92.8 1.0 92.1 1.0 PKC 92.6 2.2 89.5 2.3 87.3 1.6 84.0 1.2 91.4 1.4 88.4 2.0 aDaylight fingerprints; btopological autocorrelation; cradial distribution function; dsimilarity search with data fusion; eSOM novelty detection

Table 5.4 Areas under the ROC curve (x 100) obtained when retrieving WOMBAT active structures – mean and standard deviation over ten runs with different training sets of 100 active compounds

DFP AC2D RDF SSDF ndSOM SSDF ndSOM SSDF ndSOM activity mean sd mean sd mean sd mean sd mean sd mean sd 5-HT3 80.8 2.8 77.3 4.6 82.0 2.1 80.1 1.8 89.5 1.4 89.1 1.3 5-HT1A 89.6 0.7 87.9 0.9 86.8 0.8 86.8 0.7 88.2 1.1 88.8 0.7 D2 79.1 0.9 76.6 0.9 75.0 0.7 74.2 0.8 78.6 1.1 79.7 1.2 Renin 98.1 0.3 97.8 0.4 98.1 0.3 97.9 0.4 98.4 0.2 98.5 0.3 AT1 98.9 0.2 98.6 0.5 95.4 0.2 94.8 0.3 90.6 0.6 91.0 0.9 Thrombin 82.4 1.9 81.1 1.2 80.1 1.3 79.5 1.3 77.0 1.8 76.1 1.4 HIV-1 P 95.4 0.6 92.9 1.6 90.6 1.0 88.8 1.0 91.0 0.9 90.2 1.2 PKC 89.8 1.8 87.0 2.5 88.8 1.0 87.2 1.9 91.8 0.6 90.2 0.5

On the other hand, the improvement in the obtained ROC areas when retrieving active compounds from MDDR is not as large as the corresponding increase in the training set size. Therefore, the results obtained with a training set consisting of 100 known active compounds will be discussed in the rest of this section. This training set size provides virtually the same 162 Virtual Screening Applications

performance as using larger training sets when WOMBAT actives are retrieved. The obtained areas under the ROC curve when using a training set consisting of 100 active compounds are summarized in Table 5.3 and Table 5.4 when actives from MDDR and WOMBAT are retrieved, respectively.

5.3.1.2 Is there a difference when screening different chemical spaces?

Having identified the size of the training set we will discuss the bias introduced by selecting the training set from the same database which is later screened. The activities depicted in Figure 5.2 cover three different cases. In the case of 5-HT3 antagonists (Figure 5.2a) both MDDR and WOMBAT contain a similar number of active compounds (cf. Table 5.1). WOMBAT, on the other hand, contains around two times more 5-HTA1 (Figure 5.2b) agonists than MDDR. As can be seen from Figure 5.2 the obtained results when retrieving structures from WOMBAT were significantly lower in these two cases. In the case of PKC inhibitors (Figure 5.2h) MDDR contains an almost 4 times larger amount of available PKC inhibitors than WOMBAT. MDDR seems to cover the chemical space (as defined by the 44- dimensional autocorrelation vectors) of WOMBAT relatively well and the bias towards retrieving compounds from the same database from which the training set was selected is less pronounced. This has resulted in a similar performance was obtained regardless of the origin of the known actives. Similar tendencies were observed with the other activity classes.

Figure 5.3 shows a graphical summary of the data shown in Table 5.3 and Table 5.4. The ROC areas (average over 10 runs) and the corresponding standard deviations obtained when retrieving actives from MDDR and from WOMBAT, respectively. The training set consisted of one hundred active compounds selected from MDDR only (cf. Materials and Methods, Screening protocol) and Daylight fingerprints (Figure 5.3a), topological autocorrelation (Figure 5.3b), and radial distribution functions (Figure 5.3c) were used to describe the structures.

Results and Discussion 163

a) Daylight fingerprints

100

95

90

85

80 ROC area (x area 100) ROC

75

70 5-HT3 5-HT1A D2 Renin AT1 Thrombin HIV-1 P PKC

similarity search, MDDR SOM novelty, MDDR similarity search, WOMBAT SOM novelty, WOMBAT

b) topological autocorrelation

100

95

90

85

80 ROC area (x area 100) ROC

75

70 5-HT3 5-HT1A D2 Renin AT1 Thrombin HIV-1 P PKC

similarity search, MDDR SOM novelty, MDDR similarity search, WOMBAT SOM novelty, WOMBAT

c) radial distribution function

100

95

90

85

80

75

70 5-HT3 5-HT1A D2 Renin AT1 Thrombin HIV-1 P PKC

similarity search, MDDR SOM novelty, MDDR similarity search, WOMBAT SOM novelty, WOMBAT

Figure 5.3 Area under the ROC curve (x 100) obtained when retrieving actives from MDDR or from WOMBAT with the two virtual screening methods using the three structure representations investigated in this work. 164 Virtual Screening Applications

There are two cases which deserve a note. In the first case, illustrated by 5-HT1A, D2, Thrombin, and HIV-1 P activity classes more active structures are available in WOMBAT than in MDDR (cf. Table 5.1). A lower performance was obtained when retrieving WOMBAT actives in most of these cases (except HIV-1 P inhibitors, cf. Figure 5.3). These results hint that the few actives available in MDDR are unable to cover the full activity space in WOMBAT. In the second case, illustrated 5-HT3, AT1, Renin, and PKC activity classes the number of known active structures is higher in MDDR (cf. Table 5.1). In most of theses cases (AT1, Renin, and PKC) similar performances were obtained regardless of the source of the active structures (cf. Figure 5.3). However, lower performances were obtained when retrieving 5-HT3 antagonists originating from WOMBAT with all descriptors (cf. Figure 5.3) and when retrieving PKC inhibitors originating from WOMBAT with Daylight fingerprints (cf. Figure 5.3a). Thus, a certain amount of bias towards recovering compounds from MDDR is introduced.

The source of this bias is likely due to the fact that the two databases used in this study cover different aspects of the corresponding activity spaces. To investigate this assumption we have mapped the activity spaces covered by the actives in MDDR and WOMBAT on a plane using the Sammon projection algorithm.51 The distance matrix obtained with the corresponding descriptor using one minus the Tanimoto coefficient was used as an input to the R52 implementation53 of the Sammon mapping algorithm. Figure 5.4 shows the Sammon projections for the 5-HT3 antagonists (Figure 5.4a to Figure 5.4c) and for the HIV-1 P inhibitors (Figure 5.4d to Figure 5.4f) with the three structure representations under investigation.

Starting with Figure 5.4a more than 35% of the 5-HT3 antagonists in WOMBAT are mapped with X axis values higher than 0.25 while less than 10% of the 5-HT3 antagonists in MDDR are found in this region of the projection. This fact confirms the suspected difference between the chemical spaces covered by the 5-HT3 antagonists in MDDR and WOMBAT. The difference exists when using topological autocorrelation and RDF codes as well, as can be seen from Figure 5.4b and Figure 5.4c. However, the real valued structure representations, especially the RDF codes (Figure 5.4c) have mapped the WOMBAT activity space better than the Daylight fingerprints and this is reflected in the lower difference in the performance when retrieving actives from WOMBAT (Table 5.3, Table 5.4, and Figure 5.3).

Results and Discussion 165 a) 5-HT3, Daylight fingerprints d) HIV-1 P, Daylight fingerprints 3 . MDDR MDDR 5 .

WOMBAT 20 WOMBAT . 01 . 1 10 . 5 . 00 . 00 . 10 . 0 5 . 2-0 . 0-0 3-0 . . -1 -0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 b) 5-HT3, topological autocorrelation e) HIV-1 P, topological autocorrelation 3 4 . . 20 . 20 . 10 . 00 00 . . 10 . 20 . 2-0 . 4-0 . 3-0 . -0 -0 -0.4 -0.2 0.0 0.2 0.4 -0.2 0.0 0.2 0.4 0.6 c) 5-HT3, RDF code e) HIV-1 P, RDF code 3 . 3 . 20 . 20 . 10 . 10 . 00 . 00 . 10 . 10 . 2-0 . 2-0 . 3-0 . 3-0 . -0 4-0 . -0 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 -0.4 -0.2 0.0 0.2 0.4

Figure 5.4 Sammon projection of the active compounds in MDDR and WOMBAT

In the case of HIV-1 P inhibitors two the actives from WOMBAT form two distinct clusters when using 2-dimensional descriptors (Figure 5.4d and Figure 5.4e) while no such clustering is observed with RDF codes (Figure 5.4f). However, in all cases the actives from MDDR are spread through the WOMBAT space somehow even, therefore the bias towards retrieving MDDR actives is lower (Table 5.3, Table 5.4, and Figure 5.3). 166 Virtual Screening Applications

Based on the mappings shown in Figure 5.4 it is clear that for some activity classes there is difference between the chemical space covered by MDDR and WOMBAT. This difference leads to optimistic performance estimates when active structures from the same database (MDDR) from which the training set has been selected are retrieved.

5.3.1.3 Which method to use?

To work in a “bias-free” environment and to keep the text concise the discussion from here on will be based on the results obtained when retrieving actives from WOMBAT.

First we look at the performance of the two methods – similarity search with data fusion and SOM novelty detection. Figure 5.5 compares these two methods: a) when the structures were described with 2048-dimensional Daylight fingerprints; b) when the structures were described with 44-dimensional topological autocorrelation vectors; and c) when the structures were described with 256-dimensional RDF codes (cf. Table 5.3 and Table 5.4).

Both methods were able to rank the WOMBAT actives in a way superior to a random selection as indicated by the obtained high values of the area under the ROC curve. The ndSOM performed a bit worse when a binary structure representation was used (Figure 5.5a). On the other hand, in the case of real-valued descriptors (Figure 5.5b and Figure 5.5c) there is virtually no difference between the results obtained with SSDF and ndSOM. Since each of the ROC areas shown is an average over ten independent runs, we have used a paired t-test to investigate if there is a statistically significant difference by the results obtained by the two methods.

Before we discuss the results of the t-test it is worth noting that a method, aimed exactly at comparing the area under two ROC curves exists.27 However, it requires a correction factor based on the correlation between the two ranked lists. Since our discussion is based on an average ROC curve, to which no particular ranking list corresponds, we opted to use a t-test.

The results from the paired t-test found a statistically significant difference in the mean ROC areas in almost all cases. This result is mainly due to the low standard deviations and is a bit counter-intuitive considering the small differences between the two methods which can be seen in Figure 5.5. Results and Discussion 167

a) Daylight fingerprints

100

95

90

85

80 ROC area (xarea 100) ROC

75

70 5-HT3 5-HT1A D2 Renin AT1 Thrombin HIV-1 P PKC

similarity search SOM novelty

b) topological autocorrelation

100

95

90

85

80 ROC area (xarea 100) ROC

75

70 5-HT3 5-HT1A D2 Renin AT1 Thrombin HIV-1 P PKC

similarity search SOM novelty

c) radial distribution function

100

95

90

85

80 ROC area (x 100)

75

70 5-HT3 5-HT1A D2 Renin AT1 Thrombin HIV-1 P PKC

similarity search SOM novelty

Figure 5.5 Area under the ROC curve (x 100) obtained by the two virtual screening methods when retrieving WOMBAT active structures.

168 Virtual Screening Applications

However, it is not uncommon that extremely small and non-notable differences can be found to be statistically significant. In addition, statistical significance says nothing about the practical significance of a difference. Considering the fact that the highest average difference in the ROC areas between the two methods was around 5% (when retrieving 5-HT3 antagonists from WOMBAT with binary fingerprints) we conclude that both methods perform similarly for any practical purposes. Based on this conclusion we favor the SOM novelty detection technique due to the fact that it is twice as fast compared to the similarity search with data fusion.

5.3.1.4 Which descriptor to use?

Following the conclusions from the previous section the discussion here is limited to SOM novelty detection for retrieving the active structures from WOMBAT.

100

95

90

85

ROC area(xROC 100) 80

75

70 5-HT3 5-HT1A D2 Renin AT1 Thrombin HIV-1 P PKC

Daylight fingerprints topological autocorrelation RDF code

Figure 5.6 Area under the ROC curve (x 100) obtained when retrieving WOMBAT actives with SOM novelty detection using different descriptors.

Results and Discussion 169

Figure 5.6 shows the obtained areas under the ROC curve with the three descriptors – Daylight fingerprints, topological autocorrelation vectors, and RDF codes. According to our expectations the obtained performance varies with the activity class under investigation. Thus, while the RDF codes gave better performance in the case of 5-HT1, D2, and PKC classes, the Daylight fingerprints performed better in the case of AT1 and Thrombin inhibitors. The topological autocorrelation vectors performed very similar or better than some of the other two descriptors in all cases except D2 antagonists where the autocorrelation vectors have achieved the worst performance. That is an interesting observation having in mind that this was the descriptor with the lowest dimensionality used in this study. From the results depicted in Figure 5.6 no ultimate “good-for-all” descriptor can be identified. However, the performance obtained with topological autocorrelation vectors – although never the best one – shows that good results can be obtained even when a low-dimensional 2D descriptor is used. Another interesting question with regards to the use of different descriptors perhaps is the type of active compounds which the descriptor retrieves. However, we want to remind that in this scenario we were interested in the superiority of the obtained rankings over randomly picking compounds and in finding the best cut-off of the obtained list. No cut-off was defined a priori and therefore, the comparison between the descriptors from the point of view of recovered chemotypes is deferred to the discussion of Scenario 2.

5.3.1.5 Is there an advantage of using a 3D descriptor?

As discussed in the above section, the correct answer of this question is activity dependent. However, coming back to Figure 5.6 in six out of eight cases no particular advantage of utilizing 3D descriptor can be seen. Considering the fact that the RDF codes used throughout this study were of approximately six times higher dimensionality than the topological autocorrelation functions the results seem to indicate that little, if any, value was added through moving to a 3D-based representation of the chemical structure. This is in agreement with previous studies on the subject.14,54 This observation suggests that molecular features, which play important role for the biological activity, can be deduced in most of the cases from the 2D representation of the chemical structure and do not require explicit accounting for the conformational parameters. In addition, the behavior observed in this study may be due to the fact that a single low-energy conformation was used. However, no universally accepted approach to handle the conformational flexibility exists. On the other hand, with the aim of 170 Virtual Screening Applications

screening millions of compounds the trade-off between the execution speed and the obtained improvement due to the use of 3D descriptor is not attractive.

5.3.1.6 Scenario 1 specific questions

Based on the already shown results the answer to the first question – is the virtual screening method able to prioritize compounds in a way, better than random ordering – is clearly yes. It was shown – by means of the obtained large areas under the ROC curve – that all method/descriptor combinations are able to prioritize compounds in a way, better than random ordering.

Figure 5.7 shows the ROC curves obtained when the actives from WOMBAT were retrieved. To answer the second question – at which size a ranked list can be truncated in a way which provides the best trade-off between number of recovered false and true positives – one simply need to define the desired number of actives or the allowed number of structures with unknown activity and to read the corresponding numbers from the ROC curve (cf. Materials and Methods, section 5.2.4). In the next section we will discuss the retrieval of actives in the top 5% of the ranked list (the “early recognition” problem). The percentage of actives recovered in a list, which contains 5% false positives is marked with a dashed lines in Figure 5.7. Due to the low number of known actives the size of the so obtained list is approximately 5% of the total number of compounds. Thus, considering the 5-HT3 antagonists, for example, in a list which contains 5% false positives one can expect to recover around 43% of the known actives by using Daylight fingerprints or topological autocorrelation and around 58% when using RDF codes.

Results and Discussion 171

a) 5-HT3 b) 5-HTA1 ) % ive ( t ac active (%)

Daylight fingerprints (77.3) Daylight fingerprints (87.9) topological autocorrelation (80.1) topological autocorrelation (86.8) RDF codes (89.1) RDF codes (88.8) 0 20406080100 0 20406080100

0 20406080100 0 20406080100 inactive (%) inactive (%)

c) D2 d) Renin active (%) active (%)

Daylight fingerprints (76.6) Daylight fingerprints (97.8) topological autocorrelation (74.2) topological autocorrelation (97.9) RDF codes (79.7) RDF codes (98.5) 0 20406080100 0 20406080100

0 20406080100 0 20406080100 inactive (%) inactive (%)

e) AT1 f) Thrombin active (%) active (%)

Daylight fingerprints (98.6) Daylight fingerprints (81.1) topological autocorrelation (94.8) topological autocorrelation (79.5) RDF codes (91) RDF codes (76.1) 0 20406080100 0 20406080100

0 20406080100 0 20406080100 inactive (%) inactive (%)

g) HIV-1 P h) PKC active (%) active (%)

Daylight fingerprints (92.9) Daylight fingerprints (87) topological autocorrelation (88.8) topological autocorrelation (87.2) RDF codes (90.2) RDF codes (90.2) 0 20406080100 0 20406080100

0 20406080100 020406080100 inactive (%) inactive (%)

Figure 5.7 ROC curves obtained with SOM novelty detection when retrieving WOMBAT active compounds. The area below the corresponding curve is shown in parentheses. A hypothetical cut-off of the ranked list compromising 5% false negatives and the expected per cent of true positives is shown by dotted lines.

172 Virtual Screening Applications

5.3.1.7 Scenario 1 summary

We conclude the discussion about scenario 1 – prioritizing compounds for a subsequent HTS experiment – by summarizing the main points from the above discussion.

• increasing the size of the training set beyond one hundred active compounds leads to marginal improvements for both virtual screening method

• the above observation is especially true when actives from a database, external to the one used to select the training set, is screened

• despite that both MDDR and WOMBAT contain biologically relevant compounds there is difference in the chemical spaces covered by the two databases

• both virtual screening methods – similarity search with data fusion and novelty detection with self-organizing maps – are able to prioritize compounds better than a random picking

• both methods gave similar performances, however SOM novelty detection is to be preferred due to its faster execution times

• the three descriptors: 2048-dimensional Daylight fingerprints, 44-dimensional topological autocorrelation vectors, and 256-dimensional RDF codes – have lead to good results, with the best performing descriptor being activity dependent

• based on the area below the ROC curve no evidence supporting the use of a 3D descriptor (RDF code) based on a single low-energy conformation was found

• a simple way for estimating the expected trade-off between true and false positive at a given size of the ranked list with the help of a ROC curve has been illustrated

5.3.2 Scenario 2: Selecting compounds for a subsequent lead-optimization

In addition to the questions posted in the beginning of the discussion in this case the following additional questions were considered of interest:

• How well is a virtual screening method able to retrieve actives in the very beginning of the ranked list (“early recognition”)?

• How different are the compounds retrieved by each method at the pre-specified size of the ranked list?

• How different are the compounds retrieved by each descriptor? Results and Discussion 173

• Is the retrieval of new known active chemotypes (which are not present in the training set) possible?

• How diverse are the “false-postives” in terms of chemotypes? Do these “false- positives” look promising as a new leads for the given activity?

In the discussion of Scenario 1 we have demonstrated the utility of a virtual screening method for choosing the number of compounds which have to be selected from a large dataset in a way which provides the desired compromise between false and true positives. Scenario 2 assumes that the desired number of compounds has already been decided – by applying the techniques described above and/or by considering some other constraints, like the available resources for subsequent processing of the selected compounds. In addition, in what follows we assume that a relatively small number of the original compounds is being considered - usually the one to ten per cent top-ranked compounds. However, we want to emphasize that even 1% of structures from a database which contains one million compounds still results in 10,000 compounds – a size, which might be acceptable for a pharmaceutical company but is not likely to be of use to a university research laboratory. We concentrate mainly on the “early recognition” problem as perceived in the pharmaceutical industry. The performance of the methods in retrieving known active structures amongst the top-ranked 5% of the original database – corresponding to approximately 13,500 structures – is discussed. However, a short discussion of the one hundred false-positives is presented at the end of the section in an attempt to show the applicability of the proposed methods when limited resources are available.

In this section we will focus mainly on the actual retrieved structures. However, before going into a detailed discussion of the retrieved chemotypes, we will give a brief answer to some of the general questions posted in the beginning of the discussion based on the obtained BEDROC scores.

5.3.2.1 BEDROC score analysis

The discussion in this section is based on the BEDROC scores obtained when retrieving WOMBAT actives. The BEDROC score, as shortly described in Materials and Methods accounts for the “early recognition” capability of a virtual screening method by giving higher weight to active compounds recovered towards the beginning of the ranked list. Thus, it can be seen as a score of the “usefulness” of a given virtual screening method under the selected 174 Virtual Screening Applications

number of compounds which will be considered after the virtual screening has been performed.

Similarly to the area under the ROC curve, the BEDROC score can be interpreted as the probability that a given active compound will be ranked higher than a random compound, drawn by an exponential distribution with given α parameter. In the present study α=32.2 was used. This value ensures that 80% of the BEDROC score will be based on the five per cent top-ranked compounds. The theoretical expectation of the BEDROC score is 0.5. We want to stress that the BEDROC evaluation as, in fact, any evaluation of early recognition problems, is ultimately dependent on the selected parameters. That is, a given score has always to be interpreted taking into account the definition of “usefulness”. A method, which retrieves, say, 70% of the actives amongst the top-ranked five per cents may result in a BEDROC score lower than 0.5, although in most of the cases such a performance will be considered rather good. We have provided the full ranking data used in this study as a part of the supporting information, which accompanies this article. These data allow the interested reader to evaluate the performance of each method/descriptor combination under her/his definition of “usefulness”.

5.3.2.1.1 What is the optimal size of the training set?

The same tendency as in Scenario 1 was observed. Similarly to the area under the ROC curve the obtained BEDROC scores (α=32.2) did not increase significantly after one hundred active structures were used as training set. Therefore, the following discussion is still based on the results obtained with a training set consisting of one hundred known active compounds.

5.3.2.1.2 Is there a difference when screening different chemical spaces?

The bias towards retrieving actives from MDDR as already discussed in Scenario 1 was more pronounced when the early retrieval is considered. Thus, in what follows only the retrieval of actives from an external database – WOMBAT – will be discussed. The obtained BEDROC scores when retrieving WOMBAT actives with a training set consisting of 100 active compounds are summarized in Table 5.5. Mean value and standard deviation over 100 bootstrapped repetitions (cf. Materials and Methods) are reported.

Results and Discussion 175

Table 5.5 BEDROC scores (x 100, α=32.2) obtained when retrieving WOMBAT active structures – mean and standard deviation over one hundred bootstrapped runs with ten different training sets of 100 active compounds

DFP AC2D RDF SSDF ndSOM SSDF ndSOM SSDF ndSOM activity mean sd mean sd mean sd mean sd mean sd mean sd 5-HT3 42.9 8.2 36.0 8.8 40.0 5.4 34.9 4.8 48.0 6.9 43.3 6.5 5-HT1A 41.5 3.9 35.9 4.4 34.5 3.2 32.5 3.3 35.3 3.9 34.9 3.5 D2 27.6 3.1 25.1 2.9 20.9 2.5 19.0 2.7 23.6 2.7 23.8 2.9 Renin 73.4 4.4 69.6 6.4 74.9 4.7 74.2 4.3 73.5 4.1 77.7 4.4 AT1 86.1 3.0 85.7 3.2 68.7 3.4 66.2 3.4 53.3 4.5 52.6 4.0 Thrombin 36.5 4.4 34.6 4.4 24.8 2.9 24.2 3.1 24.8 3.4 23.0 3.4 HIV-1 P 62.5 4.6 59.7 5.7 41.8 4.5 38.1 4.2 45.3 4.7 46.0 5.3 PKC 73.8 2.5 72.1 2.1 51.8 5.7 46.0 7.4 73.2 2.7 71.8 3.0

5.3.2.1.3 Which method to use?

As can be seen from Table 5.5 the similarity search with subsequent data fusion produced better BEDROC (α=32.2) scores than the SOM novelty detection method in almost all cases regardless of the descriptor. In three of the eight activity classes – D2, Renin, and HIV-1 P – the SOM novelty detection was slightly better when RDF code was used to describe the structures. Analogously to Scenario 1 we have used a paired t-test to compare the obtained BEDROC scores between the two methods. Remember that each value in Table 5.5 is a mean value over one hundred bootstrapped evaluations (cf. Materials and Methods). A statistically significant difference was found in almost all cases except with the 5-HT1A and D2 activity classes based on RDF codes. In contrast to Scenario 1 in which the difference in the ROC areas was deemed irrelevant from a practical point of view, in the early recognition scenario the similarity search method produced up to 16% better BEDROC scores (5-HT3 activity class, for example, cf. Table 5.5). Therefore, it is the preferred method when an early recognition is needed. However, once again we want to point out that a statistically significant difference does not necessarily equal to a practically meaningful difference. Thus, for example, considering the Renin class with binary fingerprints (cf. Table 5.5) the mean of the difference between the BEDROC scores (x100) obtained by the two methods is 0.4 – a figure 176 Virtual Screening Applications

which hardly bears any practical consequences but nevertheless the two means differ from a statistical point of view (paired t-test).

5.3.2.1.4 Which descriptor to use?

A comparison between the BEDROC scores obtained with similarity search (which we have just identified as the method of choice) is presented in Figure 5.8.

100

90

80

70

60

50

40

30 BEDROC score (x100)

20

10

0 5-HT3 5-HT1A D2 Renin AT1 Thrombin HIV-1 P PKC

Daylight fingerprints topological autocorrelation RDF code

Figure 5.8 BEDROC scores (x 100) obtained when retrieving WOMBAT actives with similarity search using different descriptors.

Similar observations as for Scenario 1 can be made. The performance of the descriptors is once again activity dependent. The Daylight fingerprints performed best for the AT1, Thrombin, and HIV-1 P activity classes while for the rest of the activity classes, except PKC inhibitors, all three descriptors show similar performance. In the case of the PKC inhibitors the topological autocorrelation gave inferior performance compared to the other two structure representations. Thus, based on the obtained BEDROC scores no “best” descriptor can be identified as a general rule. Results and Discussion 177

5.3.2.1.5 Is there an advantage of using a 3D descriptor?

Taking another look at Figure 5.8 and Table 5.5 the observations from Scenario 1 with regards to this question are confirmed – no gain in the early retrieval ability of each of the studied virtual screening methods was achieved by utilizing a 3D descriptor. For most of the investigated activity classes - 5-HT3, 5-HT1A, D2, Renin, AT1, Thrombin, and HIV-1 P the use of RDF codes gave similar performance to this achieved with topological autocorrelation vectors. Thus, based on the obtained BEDROC scores no convincing reason to use RDF codes based on a single low-energy conformation was found.

5.3.2.1.6 How well performed the studied virtual screening methods in the “early recognition” scenario?

From the obtained BEDROC scores – Table 5.5 – both similarity search and SOM novelty detection were able to group significant amounts of the known WOMBAT actives amongst the top 5% of the ranked list. However, the actual results were activity dependent. In the case of Renin, AT1, HIV-1 P, and PKC a good performance – BEDROC scores higher than 0.5 – was obtained, while for the rest of the activity classes the grouping of actives amongst the top ranked 5% was not that successful – BEDROC scores lower than 0.5. This is in clear contrast to the area under the ROC curve. As was shown in the discussion of Scenario 1 all methods and descriptors produced ROC area values much superior to the randomly expected area of 0.5. However, when the early recognition is important (which is pretty much the case in any virtual screening application) there are cases where the performance is not optimal. Thus, while an analysis based on the area under the ROC curve is useful for providing an overview of the virtual screening method ability to perform better than random (or, precisely speaking, uniform) picking it is not directly indicative of the early recognition capabilities of a given virtual screening method. Of course, the actual ROC curve, which is useful when looking for a good trade-off between true and false positives, gives a rather accurate picture –the highest BEDROC scores obtained (Renin, AT1, HIV-1 P, PKC) correspond to the steepest ROC curves, cf. Figure 5.7.

5.3.2.2 Chemotype analysis

In this section, a comparison between the different virtual screening methods and descriptors with regards to the recovered active structures in terms of chemotypes is presented. The computer program MeqiLite55 – a freeware version of the full MeqiSuite 178 Virtual Screening Applications

program – was used to generate a variety of graph-based indices, which characterize different aspects of the chemical structure. From the wide range of graph-based indices with different degrees of sophistication we will center the discussion on the unextended cyclic-system skeleton (UnSkCycMqn) index. This index provides a basic description of the underlying chemotype considering only the connectivity of the original structure. It does not distinguish atom and bond types and thus produces relatively broad categories, which encompass a large amount of chemical structures. Chart 5.1 shows an example of three chemical structures, their corresponding UnSkCycMqn indices, and the connectivity graphs used to calculate the indices.

Chart 5.1 Example of the MeqiSuite UnSkCycMqn index. The reduced graph used for the calculation is marked with thicker lines.

N N N

O O Cl N N O O N O O Cl N N

1 2 3 XAZG5 FXGQ9 FXGQ9

Structures 2 and 3 from Chart 5.1 obtain the same UnSkCycMqn because this particular index does not take atom and bond types into account. For a detailed description of all available graph-based indices inside the MeqiSuite the reader is referred to the extensive technical report56 describing the MeqiSuite software.

5.3.2.2.1 How different are the active compounds retrieved by each descriptor?

As discussed above, the similarity search with data fusion provides a slight advantage in the early recognition scenario. Therefore, we will use the results obtained with this method to compare the studied structure representations. The ability of each descriptor to recover different structures and chemotypes is summarized in Table 5.6.

Results and Discussion 179

Table 5.6 WOMBAT active structures and chemotypes recovered by different structure representations. Mean values over ten similarity search runs with different training sets.

DFP AC2D RDF active active active structures chemotypes structures chemotypes structures chemotypes activity total unique total unique total unique total unique total unique total unique 5-HT3 332 34 107 7 320 28 115 10 380 50 114 9 5-HT1A 1393 320 316 58 1178 252 313 57 1216 179 287 30 D2 1087 291298 54 856 202 266 47 962 235 285 52 Renin 528 6 131 1 518 7 136 3 551 9 139 1 AT1 1307 134233 26 1110 11 206 3 891 2 156 1 Thrombin 962 209 230 41 689 80 204 29 662 61 154 13 HIV-1 P 1874 330 493 72 1341 72 396 16 1431 110 400 30 PKC 120 7 22 2 94 3 20 2 124 9 24 1

A few observations from Table 5.6 deserve attention. First, the correspondence between the BEDROC scores and the total number of retrieved actives amongst the top-ranked 5% can be seen. The higher the BEDROC score the higher is the number of retrieved actives. However, the relationship is not strictly linear since, as already discussed, the BEDROC score gives higher weight to compounds, which are found early in the ranked list. Thus, considering the Thrombin inhibitors as an example, the retrieval with autocorrelation vectors and RDF codes produced equal BEDROC scores (cf. Table 5.5) while the average number of recovered compounds differs slightly (cf. Table 5.6).

Another observation from Table 5.6 is that, as expected, different descriptors cover different aspects of the activity space and consequently retrieve different compounds. This is illustrated in the “unique compounds” columns. These numbers were produced by taking the difference between the sets of active compounds recovered by each descriptor and averaging the size of the obtained sets over the ten repetitions with different training sets. Thus, considering 5-HT3 as an example, one can see from Table 5.6 that on average 34 compounds were recovered exclusively by Daylight fingerprints, 28 exclusively by autocorrelation vectors, and 50 exclusively by RDF codes. Depending on the activity class, each descriptor recovered between 1 and 25 per cents active structures, which have been missed by the other two. In addition, as can be seen from the “unique chemotypes” columns in Table 5.6, the active 180 Virtual Screening Applications

structures recovered uniquely by each descriptor more often than not contain additional chemotypes. Figure 5.9 shows the percentage of different chemotypes amongst the recovered active structures for each descriptor.

40

35

30

25 active chemotypes (%)

20

15 5-HT3 5-HT1A D2 Renin AT1 Thrombin HIV-1 P PKC

Daylight fingerprints topological autocorrelation RDF code

Figure 5.9 Number of chemotypes expressed as percentage from the total number of recovered WOMBAT actives. Virtual screening method: similarity search with data fusion.

As can be seen from Figure 5.9 the autocorrelation vectors recovered the most diverse – in terms of UnSkCycMqn Meqi index – active structures regardless of the activity. Although the Daylight fingerprints recovered more actives, these structures were usually less diverse. This is not surprising keeping in mind that the Daylight fingerprints encode exclusively structural information while the autocorrelation and RDF codes make an attempt to account for various physico-chemical properties as well. The RDF codes usually recovered less diverse sets of actives compared to the topological autocorrelation. This may be attributed to the use of only a single low-energy conformation.

The results shown in Table 5.6 raise the question is it possible to further improve the retrieval of active structures by combining the ranked lists obtained with different structure Results and Discussion 181 representations. In an attempt to answer this question we have combined the ranked lists obtained with Daylight fingerprints, topological autocorrelation, and RDF codes using data fusion. A fusion algorithm, based on the sum of the ranks from the individual ranked lists was used, as described by Ginn et al.57 Thus, each structure from the screened database obtained a new rank equal to the sum of its ranks in the ranked lists obtained with the particular structure representation. The results of these experiments are summarized in Table 5.7.

Table 5.7 WOMBAT active structures recovered by fusion of the ranked list obtained with different structure representations and improvement over similarity search using only Daylight fingerprints. Mean values over ten similarity search runs with different training sets.

AC2D + DFP AC2D + RDF AC2D + DFP + RDF # improvement over # improvement over # improvement over activity actives DFP (%) actives DFP (%) actives DFP (%) 5-HT3 403 21.4 399 20.2 431 29.8 5-HT1A 1612 15.7 1496 7.4 1715 23.1 D2 1243 14.4 1151 5.9 1348 24.0 Renin 551 4.4 566 7.2 568 7.6 AT1 1300 -0.5 1244 -4.8 1362 4.2 Thrombin 930 -3.3 883 -8.2 988 2.7 HIV-1 P 1830 -2.3 1783 -4.9 1954 4.3 PKC 121 0.8 128 6.7 130 8.3

The total number of active structures (mean value over the ten repetitions) is shown together with the improvement compared to the commonly used similarity search with binary fingerprints alone. The results in Table 5.7 and the following discussion are based on the results obtained with the similarity search virtual screening method. This allows the estimation of the improvement comparative to the common use of this method in which only binary fingerprints are used. Similar tendencies were observed for the SOM novelty detection – the other virtual screening method under investigation. Figure 5.10 visualizes the improvement over Daylight fingerprints obtained in the three fusion experiments. 182 Virtual Screening Applications

35 5-HT3 30

D2 25 5-HT1A

20

15

10 Renin PKC AT1 HIV-1 P 5 Thrombin

0

-5

-10

AC2D + DFP AC2D + RDF AC2D + DFP + RDF

Figure 5.10 Improvement over similarity search with Daylight fingerprints when fusing the ranked list obtained by single structure representation. AC2D: topological autocorrelation, DFP: Daylight fingerprints; RDF: radial distribution function

As a first attempt, the lists obtained with similarity search using Daylight fingerprints and topological autocorrelation were fused. As can be seen from Table 5.7 and Figure 5.10, one to 16 per cent more actives were recovered for the 5-HT3, 5-HT1A, D2, Renin, and PKC activity classes. A slight decrease in the number of recovered actives compared to Daylight fingerprints was observed for the AT1, Thrombin, and HIV-1 P activity classes. However, the decrease, when it occurs, was never more than 3% (Thrombin inhibitors) while an improvement as high as 21% (5-HT3 antagonists) was observed. Therefore, the use of autocorrelation vectors based on atomic physico-chemical properties adds value to the commonly used similarity search with binary fingerprints.

In a second attempt, the lists obtained with topological autocorrelation and RDF codes were fused. In three cases (AT1, Thrombin, and HIV-1 P) five to eight per cent less actives were recovered compared to Daylight fingerprints (cf. Table 5.7 and Figure 5.10). However, for the rest of the activity classes the fusion of the lists obtained with topological autocorrelation and Results and Discussion 183

RDF codes resulted in the recovery of six to 13 per cent more active structures compared to Daylight fingerprints. This result shows that the use of RDF codes, even though based on single low-energy conformation, leads to better coverage of the activity space.

Finally, the results obtained with all three structure representations under investigation were fused. As can be seen from Table 5.7 and Figure 5.10, this resulted in the recovery of 3 to 23% more active structures compared to the case in which only Daylight fingerprints were used.

To summarize, the fusion of the results obtained with different descriptors is most effective when the individual structure representations show similar performance, as demonstrated by the 5-HT3 and 5-HT1A activity classes (cf. Table 5.6 and Table 5.7). The fusion of the lists obtained with two real-valued descriptors (topological autocorrelation and RDF codes weighted by atomic physico-chemical properties) recovered similar or higher number of actives compared to the use of Daylight fingerprints alone. This shows the utility of the real- valued descriptors and the corresponding physico-chemical properties in a ligand-based virtual screening experiment, especially when the much lower dimensionality of the real- valued descriptors is considered. In addition, the application of RDF codes even when based on a single low-energy conformation allows better coverage of the corresponding activity space.

The discussion so far has been based on the average number of active structures and chemotypes recovered from the ten runs with different training sets. In an attempt to translate some of the numbers discussed above in a chemical language we have randomly selected a single run out of the ten repetitions for each activity class. This allows us to analyze the actual chemical structures. The results for two of the eight activity classes – 5-HT3 and D2 antagonists – will be exemplified. The virtual screening for these activity classes resulted in average (BEDROC scores between 40 and 48, cf. Table 5.5) and low (BEDROC scores between 20 and 28, cf. Table 5.5) early recognition performance, respectively.

Chart 5.2 shows some of the 5-HT3 antagonists from WOMBAT recovered only by one of the used structure representations using similarity search with data fusion. Each structure represents a different chemotype. Five structures which represent some of the chemotypes recovered by all descriptor types are shown as well. 184 Virtual Screening Applications

Chart 5.2 5-HT3 antagonists from WOMBAT

a) recovered exclusively with Daylight fingerprints

N N N N N N N N N N O O 4 N O N O N N N 5 3 N O S N N N N N 6 7

b) recovered exclusively with topological autocorrelation

N S N N N S O 10 O N N N O N 9 N N N O N 8 S NN N N N O

11 12

c) recovered exclusively with RDF code

N N N N N N N O N N 13 14 15

N N O N N O 16 17

d) recovered with all descriptor types

O N O N N N N N N N N 19 N 20 N O 18 O N O N N N N Cl 21 22 O

Results and Discussion 185

The structures shown in Chart 5.2 illustrate the ability of each individual descriptor to recover structures from different parts of the activity space. Amongst the structures recovered exclusively by Daylight fingerprints a common (piperazinopyridopyrroloquinoxaline) ring system can be seen in structures 5 and 6 and a similar one (piperazinopyridopyrrolopyrazine) in structure 4. For the compounds discovered by some of the vectorial descriptors no such common structural building block can be distinguished. Thus, the compounds obtained by real-valued vectorial descriptor are somehow more diverse, as already discussed. A certain preference for recovering relatively small ligands is observed with RDF vectors – structures 13 and 16. Considering that structure 13 is not very flexible it is surprising that it was overlooked by the topological autocorrelation. This artifact can be attributed to the lower dimensionality of the autocorrelation vectors which may lead to a slight preference for larger structures. This preference results from the summation in Equation 5.10 which inevitably gives more weight to structures with more atoms.

While the RDF is prone to the same problem its higher dimensionality can alleviate the size effect to some extent. The conformation dependence of the RDF code, on the other hand, makes the result dependent on the particular 3D conformation – a fact which significantly complicates the interpretation of the results. However, the use of a single low-energy CORINA conformation leads to reasonable results as can be seen from the structures recovered by all descriptors.

Chart 5.3 shows some of the D2 antagonists recovered only by one of the used structure representations. Each structure represents a different chemotype which was recovered only by the corresponding descriptor. Five structures representing some of the chemotypes recovered by all descriptor types are shown as well.

In contrast to Chart 5.2 no common fragment can be discovered amongst the structures recovered by Daylight fingerprints. The chemotypes discovered exclusively by some of the used descriptors are quite diverse in all cases. This is not surprising having in mind the relatively low early recognition performance in this case, BEDROC scores between 0.2 and 0.28, cf. Table 5.5. Thus, it clearly demonstrates that each descriptor covers different parts of the activity space.

186 Virtual Screening Applications

Chart 5.3 D2 antagonists from WOMBAT

a) recovered exclusively with Daylight fingerprints

F N N O O O N O O O O O 24 O O O O 23

N N O N N N N O O 25 26 27 N b) recovered exclusively with topological autocorrelation

O O Cl N S O S N O F N N N N 28 29 S O S O N N N O N N O N N N N O 30 31 Cl 32 c) recovered exclusively with RDF code

O

S O N N N

33 O N N O N O N O O N O 36 37 34 N 35 d) recovered with all descriptor types

O

N S N N O N N Cl N 38 N N O F F O N F N O O N F N

N N S O O O

40 F 41 42 39

Results and Discussion 187

5.3.2.2.2 How different are the compounds retrieved by each method?

In this section we compare both virtual screening methods in terms of recovered chemotypes. Since similar tendencies were observed with all structure representations, the following discussion concentrates on the results obtained with autocorrelation vectors, which recovered the most diverse set of actives (see the previous section). The results are summarized in Table 5.8.

Table 5.8 WOMBAT active structures and chemotypes recovered by different virtual screening methods. Mean values over ten similarity search runs with different training sets and topological autocorrelation vectors as descriptor.

similarity search with data fusion SOM novelty detection active structures chemotypes active structures chemotypes activity total unique total unique total unique total unique 5-HT3 320 64 115 12 288 31 110 7 5-HT1A 1178 222 313 43 1128 172 308 38 D2 856 201 26643 795 140 259 36 Renin 518 16 136 3 512 10 134 2 AT1 1110 69 206 8 1081 39 203 4 Thrombin 689 119 204 26 655 85 202 24 HIV-1 P 1341 230 396 50 1228 118 377 32 PKC 94 15 20 1 85 6 20 2

A small advantage of the similarity search is observed by the slightly higher absolute number of recovered actives (“actives” column in Table 5.8). However, in terms of recovered chemotypes both methods show a similar performance (“chemotypes” column in Table 5.8). Considering the unique structures and chemotypes recovered by each method it is clear that, even though the same descriptor and training set was used, both methods have recovered different active structures. This observation comes as no surprise considering previous work in this field.58 Thus, whenever it is possible, the use of more than one virtual screening method is preferable. Combining the lists of compounds returned by each separate method usually leads to an improvement in the obtained results, as has been shown in a number of studies,39,58 including our previous work on SOM novelty detection.18 188 Virtual Screening Applications

Chart 5.4 5-HT3 antagonists from WOMBAT

a) recovered exclusively with similarity search

O N N N O N N N N Br N O 43 44 O 45 N O N

N F N N N N N O N N S 46 N 47 N

b) recovered exclusively with SOM novelty detection

N N NN N N S N O N N N N O N

48 49 50 51 c) recovered with both methods O N S N N N N N N 52 53 54 N

O N N N N N O N N O OCl

N N 55 56 N

When only a single method has to be used, the numbers shown in Table 5.8 do not show preference for any of the two methods investigated in this study. However, the similarity search recovers a slightly larger number of actives, while the SOM novelty detection is twice as fast. Thus, we suggest that SOM novelty detection is used when the execution speed is of concern. Since approximately the same number of different chemotypes is recovered by both Results and Discussion 189 methods (even if the particular chemotypes are not 100% the same) the retrieved compounds will still cover significant parts of the activity landscape.

Using the same activity class – 5-HT3 antagonists – as in the previous section the ability of both methods – similarity search with data fusion and SOM novelty detection – to discover different chemotypes is illustrated. Chart 5.4 shows five of the seven 5-HT3 active structures recovered exclusively by similarity search and the four active structures unique to SOM novelty detection. Each structure represents an active chemotype found only by the corresponding method. The results of the same particular run as in the previous section with topological autocorrelation as a structure representation are used. Five active structures recovered with both methods are shown as well.

5.3.2.2.3 Is the retrieval of new active chemotypes possible?

The clustering procedure, used for selecting the training set out of the MDDR actives (cf. Materials and Methods, Screening protocol) ensured a diverse training set. Around 70 different chemotypes were found amongst the one hundred MDDR active compounds used as a training set. However, the number of recovered WOMBAT chemotypes exceeded the number of the chemotypes in the training set by a factor as large as six in the case of HIV-1P inhibitors (396 different chemotypes were recovered with topological autocorrelation, cf. Table 6, HIV-1 P row). As demonstrated by structures 4, 5, and 6 in Chart 5.2 some of the chemotypes, as identified by Meqi UnSkCycMqn index, still possess a high degree of structural similarity. However, this observation alone cannot account for the much higher number of chemotypes recovered from WOMBAT. Thus, the retrieval of different chemotypes from the one contained in the training set with the use of the virtual screening methods investigated in this study is possible.

5.3.2.2.4 Do the “false positives” look promising as new leads?

Until now we have concentrated our attention exclusively on the recovered known active structures. However, while providing a way to assess the performance of a virtual screening method in a retrospective fashion, the recovery of structures which are already known to be active is not very interesting. After all, it was known beforehand that these are active structures. On the other hand, although in the retrospective experiments we use the term false positive for any structure which does not belong to our set of known actives, very few, if any, 190 Virtual Screening Applications

of these structures have been tested against the screened enzyme or receptor. Therefore, these are structures with unknown activity rather than strictly false positives.

In this section, using again the results for 5-HT3 and D2 antagonists, we provide a short analysis of these “false positives”. We focus on the one hundred top-ranked false positives in an attempt to show the applicability of the proposed methods in a situation where very limited resources for subsequent biological assays are available, i.e., in a university research laboratory. In this respect, a fact we want to stress is that the actual one hundred top-ranked structures were predominantly known actives – between 50 and 95%, depending on the activity class. However, to simulate a prospective screening situation we concentrate here on the top-ranked false positives. The same particular virtual screening runs as in the previous section were used and the results with topological autocorrelation vectors are discussed.

The one hundred top-ranked structures of unknown activity represented a large number of different chemotypes – 64 for similarity search and 70 for SOM novelty detection when screening for 5-HT3 antagonists and 54 for similarity search and 69 for SOM novelty detection when screening for D2 antagonists. Chart 5.5 shows the five active structures which represent the most frequently (at least three times) occurring chemotypes amongst the one hundred top-ranked false positives for the similarity search with data fusion – Chart 5.5a – and for the SOM novelty detection – Chart 5.5b when screening for 5-HT3 antagonists.

Some similarities between the structures on Chart 5.5 and some of the known 5-HT3 antagonists depicted on Chart 5.2 and Chart 5.4, like the presence of different aza-bicyclic systems can be seen. In addition, most of the false positives shown in Chart 5.5 have been tested and found to be active against different receptors, which are closely related to 5-HT3. Compounds 57, 60, 61 and 63 are antagonists for different members of the serotonin family of receptors, while compound 65 has been found to act as a dopamine transporter (DAT) antagonist. Of course not all of the shown compounds have been tested against the members of the serotonin or related families of receptors. Compound 59, for example, belongs to the

activity class of bronchodilators in MDDR, while compound 64 is marked as a potent (pKi of 7.5) HIV-1RT inhibitor in WOMBAT. However, more often than not the top-ranked false- positives are related to the activity under investigation. Thus, considering the top-ranked false-positives is likely to result in the discovery of previously unknown active structures.

Results and Discussion 191

Chart 5.5 Active structures which represent the most frequently (at least three times) occurring chemotypes amongst the one hundred top-ranked false-positives

a) recovered with similarity search

Cl Cl N N N O O O O N O NO N

S 57 58 59 N O O N N O N N N N O Cl O 60 61 b) recovered with SOM novelty detection

N N N O O N S NN N O N F 62 63 64 N N N O O N O N Br O N O O 65 66 N

Similar observations can be made when examining the one hundred top-ranked false- positives for the seven remaining activity classes. Taking the D2 antagonists activity class, most of the false-positive compounds belonged to the antipsychotic MDDR activity class or are marked as antagonists to different alpha adrenergic receptors. The dopamine antagonists, on one hand, are members of the broader antipsychotic MDDR activity class. On other hand, a large percentage of the D2 antagonists in WOMBAT acts as antagonists against different alpha adrenergic receptors as well. Therefore, the top-ranked false-positive compounds recovered by the virtual screening methods investigated in this study are likely to share the target activity.

5.3.2.3 Scenario 2 summary

We conclude the discussion about scenario 2 – selecting compounds for a subsequent lead- optimization – by summarizing the main points from the above discussion. 192 Virtual Screening Applications

• using a training set consisting of more than one hundred active compounds did not improve the results

• a high bias towards retrieving actives from the database used for building the training set was found

• similarity search with data fusion gave slightly better performance

• all three descriptor types – Daylight fingerprints, topological autocorrelation, and RDF codes – gave reasonable performance and no descriptor was identified as best for all activity classes

• the fusion of the ranked lists obtained with different structure representations improves the results of a virtual screening experiment

• while no particular advantage of using only a 3D descriptor – RDF code based on single low-energy conformation – was found, this descriptor covers different aspects of the activity spaces compared to the 2D descriptors and brought new information in the fusion experiments

• a most diverse sets of active compounds – in terms of graph-based chemotype index – was recovered by using topological autocorrelation to describe the structures

• the recovery of chemotypes not present in the training set was possible with both methods and with all descriptor types

• an analysis of the one hundred top-ranked structures of unknown activity (false positives) reveal that chemotypes highly susceptible of sharing the target activity are contained in this set.

5.3.3 Scenario 3: Is a given compound active?

The problem investigated in this scenario is very similar to a classification task in which only active structures are used as training set. It is well known30 that for most machine- learning techniques usually the larger the training set is the better are the results obtained. In addition, the bias towards performing better on the same database used to select the training set exists naturally. Therefore, we will not discuss the questions about the optimal size of the training set and about the difference in predicting compounds from different chemical spaces in details, but rather the results with training sets, consisting of one hundred active compounds when classifying actives from an external database (WOMBAT) will be Results and Discussion 193 discussed. In addition, for reasons which will become clear in the next section, the result obtained with SOM novelty detection will be the main focus on our discussion.

5.3.3.1 Which method to use?

The task of turning a similarity search into a classifier is immediately confronted with the question of determining a threshold value for the used similarity measure. Screened structures which obtain similarity scores lower than this threshold are then considered unlikely to share the activity of the training set.

There are many reports2,9,13,54 which utilize different kinds of binary fingerprints for similarity search or for library design. However, the linking of the Tanimoto coefficient value to the probability of activity has been somewhat difficult. Initially two molecules with a pair- wise Tanimoto coefficient of 0.85 were expected to have an 80% chance of sharing the same activity. However, in later studies this probability was re-evaluated and brought down to 30%. A Tanimoto coefficient as low as 0.55 was found necessary in order to include most examples in a patent.15 Thus, the value of such a threshold is very hard to determine and the optimal threshold value may vary from one activity class to another. In addition, the problem is even more pronounced when a data fusion is used. This is demonstrated by the fact that in our experiments using a threshold on the binary Tanimoto coefficient of 0.75 predicted, on average, in 823, 768, 809, 1909, 587, 3121, 1835, and 926 structures as belonging to the 5- HT3, 5-HT1A, D2, Renin, AT1, Thrombin, HIV-1 P, and PKC activity classes, respectively.

In the case of real-valued structure representations the problem is even more pronounced since not so much prior studies exist. Thus, we decided to avoid the searching for a “best” threshold value of the real-valued Tanimoto coefficient. Such a value is bound to be activity and descriptor specific and, therefore, with limited practical application. As an example, illustrating the above point, using a threshold of 0.75 with topological autocorrelation almost all screened structures were classified as actives when screening for 5-HT3 antagonists. Using the same threshold with RDF codes classified, on average, 1074 of the WOMBAT structures as 5-HT3 antagonists.

The SOM novelty detection handles the threshold determination implicitly. Based on the difficulties encountered by determining a reasonable value of the threshold, needed to turn a similarity search method into a classifier, the SOM novelty detection was identified as the method of choice in this scenario. 194 Virtual Screening Applications

5.3.3.2 Which descriptor to use?

The precision and recall values obtained with SOM novelty detection and the different structure representations when retrieving WOMBAT actives are summarized in Table 5.9. Mean values over the ten repetitions with different training sets are reported.

Table 5.9 Precision and recall values obtained with SOM novelty detection when retrieving WOMBAT actives.

DFP AC2D RDF predicted predicted predicted recall precision recall precision recall precision activity as active as active as active 5-HT3 30065 53.5 1.9 7175 35.7 6 645 15.1 19.8 5-HT1A 26307 58 7.5 10420 40.7 10.7 1527 13.2 26.2 D2 10483 30.3 9.7 7447 21.2 9.1 1565 9.1 18.9 Renin 6737 73.1 6.6 2561 61.1 18.3 325 26.1 57.3 AT1 1611 64.3 56.5 3125 58.5 28.7 369 13.7 58.7 Thrombin 19592 54.4 5.7 12661 36 5.5 3360 14.4 10.5 HIV-1 P 24784 76.8 12.3 13542 52.6 13.2 1142 16.6 47.9 PKC 35669 77.6 0.8 15421 61.7 0.8 3104 67.4 5.9

A relatively broad description of the active space was achieved by using Daylight fingerprints. As much as 10% of all screened structures were classified as actives in some cases (5-HT3, 5-HT1A, HIV-1P, PKC). Consequently, the highest recall values were obtained. However, this comes at the price of many false positives. In other words, the confidence that a structure is correctly predicted as active is low.

On the other extreme, a very tight description of the active space is achieved when RDF codes are used to describe the structures. Consequently, the highest precision values are obtained. Therefore, on average, there is around a 30% chance that a structure is correctly predicted as active with RDF code. This high precision comes at the price of a relatively high false negative ratio as indicated by the low recall values. This tight description of the chemical space spanned by the query structures can be attributed to different reasons. One such reason is that the RDF code descriptor is conformation sensitive. This fact, combined with the aforementioned use of a single low-energy conformation, may lead to narrowing the active space, since in addition to structural and physicochemical features the location of the Results and Discussion 195 atoms in the 3D space adds additional constraints. Another possible reason is that the selected dimensionality – four combined 64-dimensional RDF code – is actually too high and the underlying SOM is overfitted to the particular set of query structures.

As can be seen from Table 5.9 on average around 3.5 % of all screened structures were predicted as actives when using topological autocorrelation. The obtained results were somehow in between SOM novelty detection with binary fingerprints and with RDF codes. A slightly higher false positive ratio than the one obtained with RDF codes and a slightly lower true positive ratio than the one obtained with Daylight fingerprints was observed. Thus, the use of autocorrelation vectors provided the best precision/recall trade-off and is the recommended descriptor in this scenario.

5.3.3.3 Scenario 3 summary

We conclude the discussion about scenario 3 – is a given structure active – by summarizing the main points from the above discussion.

• the determination of a good threshold value for the similarity search with data fusion is activity and descriptor dependent

• the SOM novelty detection was identified as the method of choice

• the topological autocorrelation was identified as the descriptor providing the best precision/recall trade-off

5.3.4 Scenario 4: Identification of the most active compound

The objective in this scenario was to investigate to what extent the ranking produced by a virtual screening method correlates with known activity values. To achieve this, the known active structures from the WOMBAT database for which an activity values are available were ranked using similarity search with data fusion and SOM novelty detection. The correspondence between the obtained virtual screening scores and the activity values were measured by Kendall’s τ correlation coefficient. The results are summarized in Table 5.10.

196 Virtual Screening Applications

Table 5.10 Kendall’s rank correlation between the rank obtained in the virtual screening and the rank according to the activity values when retrieving WOMBAT actives. Mean and standard deviation over ten runs with different training sets.

DFP AC2D RDF SSDF ndSOM SSDF ndSOM SSDF ndSOM activity mean sd mean sd mean sd mean sd mean sd mean sd 5-HT3 0.13 0.05 0.16 0.04 0.18 0.02 0.14 0.02 0.15 0.03 0.18 0.04 5-HT1A 0.18 0.03 0.17 0.03 0.18 0.02 0.17 0.02 0.12 0.02 0.11 0.03 D2 0.15 0.02 0.16 0.02 0.14 0.01 0.13 0.01 0.07 0.02 0.09 0.01 Renin 0.14 0.04 0.09 0.06 0.03 0.05 0.02 0.05 0.10 0.05 0.12 0.04 AT1 0.01 0.10 0.01 0.09 0.10 0.03 0.10 0.04 0.12 0.04 0.10 0.05 Thrombin 0.22 0.02 0.22 0.03 0.21 0.03 0.19 0.04 0.23 0.03 0.24 0.04 HIV-1 P 0.15 0.09 0.17 0.08 0.14 0.06 0.11 0.05 0.09 0.06 0.09 0.05 PKC 0.37 0.04 0.34 0.08 0.29 0.04 0.28 0.05 0.29 0.08 0.25 0.06

As can be seen from Table 5.10 no significant difference between both virtual screening methods was found. In all of the screened activity classes no significant correlation – Kendall’s τ between -0.1 and 0.2 – was found. The best results were obtained for the PKC activity class. Even in this case no Kendall’s τ larger than 0.37 was achieved. The low correlation found was not unexpected since both virtual screening methods do not take the activity of the query structures into account. Remember that the query structures were actually drawn from MDDR – a data base in which no measured activity values are available. Therefore, it is not justified to make assumptions regarding the potency of a given structure based on its position in the ranked list, produced by either similarity search or SOM novelty detection.

5.4 Conclusions

Summaries containing scenario-specific conclusions were presented throughout the discussion. Shortly, the applicability of the two different virtual screening methods – similarity search with consequent data fusion and novelty detection with Self-Organizing Maps – in four different virtual screening scenarios was tested. Three different ways of representing chemical structures – binary fingerprints, topological autocorrelation and radial Conclusions 197 distribution function (RDF) codes – were examined in combination with both virtual screening methods. Both virtual screening methods were found applicable for scenario 1 – prioritizing compounds for subsequent high-throughput screening – and scenario 2 – selecting a predefined (small) number of potentially active compounds from a large chemical data base. The SOM novelty detection is preferred for scenario 3 – assessing the probability that a given structure will exhibit a given activity. Both methods were found inapplicable for scenario 4 – selecting the most active structure(s) for a biological assay. The performance of the different descriptors was found to be dependent on the activity class. While no “best-for-all” descriptor was identified, it was found that the topological autocorrelation usually offers the best dimensionality/performance ratio. The use of 3D vectorial descriptor based on single low- energy conformation (RDF codes) alone gave similar results to the 2D descriptors. However, it covered different parts of the activity spaces under investigation. Consequently, the fusion of the ranked lists obtained with RDF codes and a 2D descriptor improved the results. A bias towards retrieving compounds from the same database which was used to select the training set was found. Increasing the size of the training set beyond one hundred compounds did not bring a significant improvement in all scenarios. The studied virtual screening methods were able to recover chemotypes not present in the training set. In addition, an analysis of the top- ranked false positive structures revealed that these structures are likely to share the target activity. Therefore, the proposed methods are likely to work good in prospective virtual screening experiments.

Finally, all ranked lists used throughout this study with the actual similarity scores are freely available at http://www2.chemie.uni-erlangen.de/people/Dimitar_Hristozov/sprt_info. These lists can be used to calculate any performance metric for comparative purposes. The trained self-organizing maps, used as novelty detectors for each activity, are included as well. These can be used to screen any database of chemical compounds, provided that the structures are described with the same descriptor. A simple Python script is provided for this purpose.

A full-featured application allowing the rapid screening of large databases and including the described ligand-based virtual screening methods together with different data fusion techniques is under development and will be available soon.

198 Virtual Screening Applications

5.5 References

(1) Walters, W. P.; Stahl, M. T.; Murcko, M. A. Virtual Screening - an Overview. Drug Discov. Today 1998, 3, 160-178.

(2) Bajorath, J. Selected Concepts and Investigations in Compound Classification, Molecular Descriptor Analysis, and Virtual Screening. J. Chem. Inf. Model. 2001, 41, 233-245.

(3) Bajorath, J. Integration of Virtual and High-Throughput Screening. Nat. Rev. Drug Discov. 2002, 1, 882-884.

(4) Oprea, T. I.; Matter, H. Integrating Virtual Screening in Lead Discovery. Curr. Opin. Chem. Biol. 2004, 8, 349-358.

(5) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical Similarity Searching. J. Chem. Inf. Model. 1998, 38, 983-996.

(6) Bleicher, K. H.; Böhm, H. J.; Müller, K.; Alanine, A. Hit and Lead Generation: Beyond High-Throughput Screening. Nat. Rev. Drug Discov. 2003, 2, 369-378.

(7) Kearsley, S. K.; Sallamack, S.; Fluder, E. M.; Andose, J. D.; Mosley, R. T.; Sheridan, R. P. Chemical Similarity Using Physiochemical Property Descriptors. J. Chem. Inf. Model. 1996, 36, 118-127.

(8) Bologa, C.; Revankar, C. M.; Young, S. M.; Edwards, B. S.; Arterburn, J. B.; Kiselyov, A. S.; Parker, M. A.; Tkachenko, S. E.; Savchuck, N. P.; Sklar, L. A.; Oprea, T. I.; Prossnitz, E. R. Virtual and Biomolecular Screening Converge on a Selective Agonist for GPR30. Nat. Chem. Biol. 2006, 2, 207-212.

(9) Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. J. Chem. Inf. Model. 2004, 44, 1177-1185.

(10) Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Comparison of Topological Descriptors for Similarity-Based Virtual Screening Using Multiple Bioactive Reference Structures. Org. Biomol. Chem. 2004, 2, 3256- 3266. References 199

(11) Bender, A.; Jenkins, J. L.; Glick, M.; Deng, Z.; Nettles, J. H.; Davies, J. W. "Bayes Affinity Fingerprints" Improve Retrieval Rates in Virtual Screening and Define Orthogonal Bioactivity Space: When Are Multitarget Drugs a Feasible Concept? J. Chem. Inf. Model. 2006, 46, 2445-2456.

(12) Chen, B.; Harrison, R. F.; Papadatos, G.; Willett, P.; Wood, D. J.; Lewell, X. Q.; Greenidge, P.; Stiefl, N. Evaluation of Machine-Learning Methods for Ligand-Based Virtual Screening. J. Comput. Aid. Mol. Des. 2007, 21, 53-62.

(13) Martin, Y. C.; Kofron, J. L.; Traphagen, L. M. Do Structurally Similar Molecules Have Similar Biological Activity? J. Med. Chem. 2002, 45, 4350-4358.

(14) Matter, H. Selecting Optimally Diverse Compounds From Structure Databases: A Validation Study of Two-Dimensional and Three-Dimensional Molecular Descriptors. J. Med. Chem. 1997, 40, 1219-1229.

(15) Martin, Y. C. What Works and What Does Not: Lessons From Experience in a Pharmaceutical Company. QSAR Combinat. Sci. 2006, 25, 1192-1200.

(16) Markou, M.; Singh, S. Novelty Detection: a Review - Part 1: Statistical Approaches. Signal Process. 2003, 83, 2481-2497.

(17) Markou, M.; Singh, S. Novelty Detection: a Review - Part 2: Neural Network Based Approaches. Signal Process. 2003, 83, 2499-2521.

(18) Hristozov, D.; Oprea, T. I.; Gasteiger, J. Ligand-based Virtual Screening by Novelty Detection with Self-Organizing Maps, 2007, submitted to Journal of Chemical Information and Modeling.

(19) Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and Scoring in Virtual Screening for Drug Discovery: Methods and Aplications. Nat. Rev. Drug Discov. 2004, 3, 935-949.

(20) MDL Information Systems. Inc, MDL Drug Data Report, version 2006.1.

(21) Olah, M.; Mracec, M.; Ostopovici, L.; Rad, R.; Bora, A.; Hadaruga, N.; Olah, I.; Banda, M.; Simon, Z.; Mracec, M.; Oprea, T. I. WOMBAT: World of Molecular Bioactivity. In Cheminformatics in Drug Discovery; Oprea, T. I., Ed.; Wiley-VCH: New York, 2003. 200 Virtual Screening Applications

(22) Taylor, R. Simulation Analysis of Experimental Design Strategies for Screening Random Compounds As Potential New Drugs and Agrochemicals. J. Chem. Inf. Model. 1995, 35, 59-67.

(23) Butina, D. Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. J. Chem. Inf. Model. 1999, 39, 747-750.

(24) Truchon, J. F.; Bayly, C. I. Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem. J. Chem. Inf. Model. 2007, 47, 488-508.

(25) Edgar, S. J.; Holliday, J. D.; Willett, P. Effectiveness of Retrieval in Similarity Searches of Chemical Databases: A Review of Performance Measures. J. Mol. Graph. Model. 2000, 18, 343-357.

(26) Hanley, J. A.; McNeil, B. J. The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve. Radiology 1982, 143, 29-36.

(27) Hanley, J. A.; McNeil, B. J. A Method of Comparing the Areas Under Receiver Operating Characteristic Curves Derived From the Same Cases. Radiology 1983, 148, 839-843.

(28) Triballeau, N.; Acher, F.; Brabet, I.; Pin, J. P.; Bertrand, H. O. Virtual Screening Workflow Development Guided by the "Receiver Operating Characteristic" Curve Approach. Application to High-Throughput Docking on Metabotropic Glutamate Receptor Subtype 4. J. Med. Chem. 2005, 48, 2534-2547.

(29) Cleves, A. E.; Jain, A. N. Robust Ligand-Based Modeling of the Biological Targets of Known Drugs. J. Med. Chem. 2006, 49, 2921-2938.

(30) Witten, I. H.; Eibe, F. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: San Francisco, 2000.

(31) Yao, Y. Y. Measuring Retrieval Effectiveness Based on User Preference of Documents. J. Am. Soc. Inform. Sci. 1995, 46, 133-145.

(32) Whittle, M.; Gillet, V. J.; Willett, P.; Alex, A.; Loesel, J. Enhancing the Effectiveness of Virtual Screening by Fusing Nearest Neighbor Lists: A Comparison of Similarity Coefficients. J. Chem. Inf. Model. 2004, 44, 1840-1848. References 201

(33) Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Enhancing the Effectiveness of Similarity-Based Virtual Screening Using Nearest- Neighbor Information. J. Med. Chem. 2005, 48, 7049-7054.

(34) Kohonen, T. Self-Organizing Maps; Springer: Berlin, 2001.

(35) Sykora, V. Chemical Descriptors Library, 2007, http://cdelib.sourceforge.net (accessed 06.2006)

(36) Moreau, G.; Broto, P. Autocorrelation of a Topological Structure: A New Molecular Descriptor. New J. Chem. 1980, 4, 359-360.

(37) Bauknecht, H.; Zell, A.; Bayer, H.; Levi, P.; Wagener, M.; Sadowski, J.; Gasteiger, J. Locating Biologically Active Compounds in Medium-Sized Heterogeneous Datasets by Topological Autocorrelation Vectors: Dopamine and Benzodiazepine Agonists. J. Chem. Inf. Model. 1996, 36, 1205-1213.

(38) Spycher, S.; Pellegrini, E.; Gasteiger, J. Use of Structure Descriptors To Discriminate Between Modes of Toxic Action of Phenols. J. Chem. Inf. Model. 2005, 45, 200-208.

(39) Fechner, U.; Franke, L.; Renner, S.; Schneider, P.; Schneider, G. Comparison of Correlation Vector Methods for Ligand-Based Similarity Searching. J. Comput. Aid. Mol. Des. 2003, 17, 687-698.

(40) Spycher, S.; Nendza, M.; Gasteiger, J. Comparison of Different Classification Methods Applied to a Mode of Toxic Action Data Set. QSAR Combinat. Sci. 2004, 23, 779-791.

(41) Teckentrup, A.; Briem, H.; Gasteiger, J. Mining High-Throughput Screening Data of Combinatorial Libraries: Development of a Filter to Distinguish Hits From Nonhits. J. Chem. Inf. Model. 2004, 44, 626-634.

(42) Hutchings, M. G.; Gasteiger, J. Residual Electronegativity - an Empirical Quantification of Polar Influences and Its Application to the Proton Affinity of Amines. Tetrahedron Lett. 1983, 24, 2541-2544.

(43) Gasteiger, J.; Hutchings, M. G. New Empirical Models of Substituent Polarisability and Their Application to Stabilisation Effects in Positively Charged Species. Tetrahedron Lett. 1983, 24, 2537-2540. 202 Virtual Screening Applications

(44) Gasteiger, J.; Marsili, M. Iterative Partial Equalization of Orbital Electronegativity--a Rapid Access to Atomic Charges. Tetrahedron 1980, 36, 3219-3228.

(45) Hollas, B. An Analysis of the Autocorrelation Descriptor for Molecules. J. Math. Chem. 2003, V33, 91-101.

(46) ADRIANA.Code, version 1.0. Molecular Networks GmbH, Erlangen, Germany, http://www.molecular-networks.com (accessed 06.2007)

(47) Hemmer, M. C.; Steinhauer, V.; Gasteiger, J. Deriving the 3D Structure of Organic Molecules From Their Infrared Spectra. Vib. Spectrosc. 1999, 19, 151-164.

(48) Sadowski, J.; Gasteiger, J. From Atoms and Bonds to Three-Dimensional Atomic Coordinates: Automatic Model Builders. Chem. Rev. 1993, 93, 2567-2581.

(49) CORINA, version 3.2. Molecular Networks GmbH, Erlangen, Germany, http://www.molecular-networks.com (accessed 06.2007).

(50) Renner, S.; Schwab, C. H.; Gasteiger, J.; Schneider, G. Impact of Conformational Flexibility on Three-Dimensional Similarity Searching Using Correlation Vectors. J. Chem. Inf. Model. 2006, 46, 2324-2332.

(51) Sammon, J. R. A Nonlinear Mapping for Data Structure Analysis. IEEE T. Comput. 1969, C-18, 401-409.

(52) R Development Core Team. R: A language and environment for statistical computing, version 2.2.1, 2005, http://www.r-project.org (accessed 01.2006).

(53) Venables, W. N.; Ripley, B. D. Modern Applied Statistics With S; Springer: New York, USA, 2002.

(54) Brown, R. D.; Martin, Y. C. The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding. J. Chem. Inf. Model. 1997, 37, 1-9.

(55) MeqiSuite, version 2.30, Pannanugget Consulting L.L.C., Kalamazoo, MI, USA, http://www.pannanugget.com (accessed 03.2007)

(56) Johnson, M. An Introduction to the MeqiSuite Indices. Technical Report 2006/001, Pannanugget Consulting, Inc., Kalamazoo, MI, USA, 2006 References 203

(57) Ginn, C.; Willett, P.; Bradshaw, J. Combination of Molecular Similarity Measures Using Data Fusion. Persp. Drug Discov. Des. 2000, 20, 1-16.

(58) Sheridan, R. P.; Kearsley, S. K. Why Do We Need So Many Chemical Similarity Search Methods? Drug Discov. Today 2002, 7, 903-911.

205

6 Conclusion and Outlook

The number of known chemical compounds has exceeded 30 millions and another 800,000 new compounds are added each year.1 These overwhelming numbers made the assimilation and understanding of this data hard even for highly qualified scientists in their own special field. As a result, the field of machine learning has emerged in an attempt to utilize the constantly increasing computational power in a manner, which helps a human to discover some of the knowledge, hidden amongst these data.

The main goal of this work was to apply some of the machine learning techniques in an attempt to discover knowledge from chemical data. Two different fields – chemotaxonomy and virtual screening – were investigated.

Our explorations of the relationships between the taxonomical classification of the plant family Asteraceae and the secondary metabolism of the Asteraceae species have demonstrated how by use of different classification techniques knowledge can be gained from relatively small chemical data sets. As a first attempt, described in Chapter 2, different classification methods were applied to the assignment of a class of plant secondary metabolites – sesquiterpene lactones (STL) – to seven tribes of the plant family Asteraceae. A classification model capable of classifying the sesquiterpene lactones into seven Asteraceae tribes was developed. The applicability of the model for (1) identifying plant sources for a given STL and (2) for studying the relationship in the secondary metabolism across different tribes of individual plant species was shown. Good agreement with the taxonomic division proposed by Bremer2 was found. A k-nearest neighbour classifier with k=1 gave the best results, regardless of the used structural descriptor. Two chemical structure descriptors – histogram of atom counts, augmented with stereo information and RDF codes – were investigated. The RDF code gave better results and the difference in the performance was found statistically significant. Two approaches to identifying patterns which are likely to be misclassified were studied: (1) distance metric based on principal component analysis and Hotelling T2 statistic3 and (2) rejection rule, based on the a posteriori probabilities estimates and the distances to the nearest neighbors.4 While the former did not bring any significant improvement, the latter provided a useful way to reject patterns, in which classification the machine learning technique was not confident enough. 206 Conclusion and Outlook

Chapter 3 described a logical extension of the previous study. The possibility of assigning sesquiterpene lactones to more than one Asteraceae tribe simultaneously was investigated. To achieve this, the concept of multi-labeled classification was introduced. This approach overcame the problem of secondary metabolites which appear in several taxa and helped us to model the reality as closely as possible. Two multi-labeled classification methods – cross- training with support vector machine5 as a classifier and multi-labeled k-nearest neighbor6 - have been successfully applied for the assignment of sesquiterpene lactones to more than one Asteraceae tribe simultaneously. The utility of the proposed classification model for a targeted collection of plant material with the aim of finding a particular natural compound was demonstrated. The SVM model yielded better results, outperforming both the multi-labeled k- nearest neighbor and the single-labeled k-NN classifier described in Chapter 2. Thus, the use of more sophisticated classification technique has resulted in a better classification performance compred to the relatively simple k-NN classifier described in Chapter 2. Considering the STLs which have been isolated from more than one tribe, i.e., the multi- labeled STLs, both methods (cross-training with support vector machines and multi-labeled k- nearest neighbor) performed reasonably, although not as good as separating the individual tribes. The proposed methodology was found valuable (1) to study the relationships between the secondary metabolism of the plant family Asteraceae and its current taxonomic classification; and (2) to assist in the targeted collection of plant material with the aim of isolating particular sesquiterpene lactones.

The studies presented in Chapter 2 and Chapter 3 provided an exploration in the field of chemotaxonomy and secondary metabolism of plant species from Asteraceae. However, the importance of secondary metabolites as taxonomic markers is sometimes questioned.7 In addition, the taxonomic classification of Asteraceae is disputed even amongst the botanists. 8,9 Therefore, to fully support a given taxonomic division a comparative approach is needed. A future study, for example, may attempt to build classification models using different taxonomic classifications to assign a class label to each STL. Afterwards the quality of each classification model can be interpreted as the degree of agreement between the secondary metabolism of the Asteraceae plants and the corresponding taxonomical classification.

Starting with Chapter 4, an attempt to explore the vast amount of data stored in current chemical databases was presented. The aims of the work described in Chapter 4 were to examine the applicability of novelty detection with Self-Organizing Maps – a new method, Conclusion and Outlook 207 devised for ligand-based virtual screening and to compare it with the most commonly used similarity search with binary fingerprints. Both of the investigated variations of SOM novelty detection techniques – with single structure representation and with multiple structure representations based on topological autocorrelation descriptors were found useful for ligand- based virtual screening. Small sets of compounds highly enriched in active structures were obtained by considering the intersection between the ranked lists obtained by a combined application of SOM novelty detection with single structure representation and of similarity search with subsequent data fusion. The SOM novelty detection method with a 44- dimensional concatenated autocorrelation vector was found to complement the Daylight fingerprints based similarity search. Better enriched lists were obtained by merging these results. The SOM novelty detection method with a 44-dimensional concatenated autocorrelation vector recovered significant amount of chemotypes which are missed by the similarity search. The SOM novelty detection method was found applicable as a library design tool for discarding a large number of compounds which are unlikely to posses a given biological activity without the need of an artificial threshold. Using multiple structure representations in concert with a Mahalanobis distance recovered between 34% and 93% of the actives in the top 100 ranked structures. This corresponds to enrichment factors between 105 and 470. Thus, it is the recommended method when a short list of lead candidates is required.

Chapter 5 continued our exploration of ligand-based virtual screening in more concrete practical scenarios. Prioritizing compounds for a subsequent high-throughput screening experiment (scenario 1), selecting compounds for a subsequent lead-optimization (scenario 2), assessing the probability that a given structure will exhibit a particular biological activity (scenario 3), and the identification of the most active structure (scenario 4) were examined Three different ways of representing chemical structures – binary fingerprints, topological autocorrelation and radial distribution function (RDF) codes – were examined in combination with similarity search and SOM novelty detection. Both virtual screening methods were found applicable for scenario 1 and scenario 2. The SOM novelty detection is preferred for scenario 3. Both methods were found inapplicable for scenario 4. The performance of the different descriptors was found to be dependent on the activity class. However, it was found that the topological autocorrelation usually offers the best dimensionality/performance ratio. The use of a 3-dimensional vectorial descriptor – RDF codes – brought a limited amount of new information. A bias towards retrieving compounds from the same database which was used to 208 Conclusion and Outlook

select the training set was found. Thus, any results produced in this fashion should be considered as optimistic estimates of the true performance of a virtual screening method. Increasing the size of the training set beyond one hundred compounds did not bring a significant improvement in all scenarios. The studied virtual screening methods were able to recover chemotypes not present in the training set. In addition, an analysis of the top-ranked false-positive structures revealed that these structures are likely to share the target activity. Therefore, the proposed methods are likely to work good in prospective virtual screening experiments.

The extensive studies in Chapter 4 and Chapter 5 have shown how machine learning techniques can help in the discovery of knowledge from large chemical databases. The concept of novelty detection was proven useful in this context. The use of Self-Organizing Maps is only one of the numerous novelty detection techniques.10,11 It will be no doubt interesting and enlightening to investigate the performance of other techniques, for example support vector machines.12 The use of a vectorial descriptor based on a single low-energy 3- dimensional conformation did not bring significant improvements. A more detailed investigation of the reasons for this behavior, especially a study, which accounts for the conformational flexibility, will be valuable.

References

(1) Wengenmayr, R. The Global Archive of Science. MaxPlanckResearch 2006, 66-69.

(2) Bremer, K. Asteraceae: Cladistics and Classification.; Timber Press: Portland, 1994.

(3) Eriksson, L.; Jaworska, J.; Worth, A. P.; Cronin, M. T. D.; McDowell, R. M.; Gramatica, P. Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of Classification- and Regression-Based QSARs. Environ. Health. Persp. 2003, 111, 1361-1375.

(4) Arlandis, J.; Perez-Cortez, J. C.; Cano, J. Rejection Strategies and Confidence Measures for a K-NN Classifier in an OCR Task. 16th International Conference on Pattern Recognition (ICPR'02) 2002, 1, 10576-10580.

(5) Boutell, M. R.; Luo, J.; Shen, X.; Brown, C. M. C. Learning Multi-Label Scene Classification. Pattern. Recogn. 2004, 37, 1757-1771. Conclusion and Outlook 209

(6) Zhang, M.-L.; Zhou, Z.-H. A k-Nearest Neighbor Based Algorithm for Multi-label Classification. In 2005 IEEE International Conference on Granular Computing, 2005, Vol. 2, pp. 718-721.

(7) Wink, M. Evolution of Secondary Metabolites From an Ecological and Molecular Phylogenetic Perspective. Phytochemistry 2003, 64, 3-19.

(8) Wagenitz, G. Systematics and Phylogeny of TheCompositae (Asteraceae). Plant Syst. Evol. 1976, 125, 29-46.

(9) Jansen, R. K.; Holsinger, K. E.; Michaels, H. J.; Palmer, J. D. Phylogenetic Analysis of Chloroplast DNA Restriction Site Data at Higher Taxonomic Levels: An Example From the Asteraceae. Evolution 1990, 44, 2089-2105.

(10) Markou, M.; Singh, S. Novelty Detection: a Review - Part 1: Statistical Approaches. Signal Process. 2003, 83, 2481-2497.

(11) Markou, M.; Singh, S. Novelty Detection: a Review - Part 2: Neural Network Based Approaches. Signal Process. 2003, 83, 2499-2521.

(12) Schölkopf, B.; Williamson, R. C.; Smola, A. J.; Shawe-Taylor, J.; Platt, J. C. Support Vector Method for Novelty Detection. In Advances in Neural Information Processing Systems 12; Solla, S. A., Leen, T. K., Müller, K. R., Eds.; MIT Press: 2007.

211

7 Summary

This work demonstrated the applicability of different machine learning techniques for extracting knowledge from chemical databases of different size. Two different fields – chemotaxonomy and ligand-based virtual screening were studied. The former demonstrated how a relatively small chemical data set can be coupled with different machine learning techniques in a way, which allows us to better our understanding of the relationships between plants’ secondary metabolism and their taxonomic classification. The ligand-based virtual screening demonstrated how the large amount of chemical data stored across different large chemical databases can be used in a knowledge-driven way for the discovery of new potential drugs.

Chapter 2 presented the application of different classification techniques to the assignment of sesquiterpene lactones – important secondary metabolites in the plant family Asteraceae – to the Asteraceae tribe from which they have been isolated. The performance of different machine learning – more precisely classification – techniques was investigated. Good agreement with the taxonomic division proposed by Bremer was obtained. In addition, the problem of the applicability domain of the built models was investigated and some practical guidance was given.

Chapter 3 extended the study presented in Chapter 2 in a logical way. The simultaneous occurrence of secondary metabolites in different taxa was taken into account. A machine learning area, known as multi-labeled classification, was introduced in an attempt to model the reality as closely as possible. With this approach interesting relationships between the studied Asteraceae tribes were discovered. In addition, the practical application of the built classification models to targeted collection of plants with the aim of finding natural products with desired properties was shown.

Chapter 4 demonstrated how machine learning techniques can help in the navigation of large chemical spaces. A new approach to ligand-based similarity searching based on a machine learning technique known as novelty detection was described. Its applicability for the knowledge-driven selection of chemical compounds with potential biological activity has been demonstrated comparative to the most common ligand-based virtual screening approach – similarity searching. 212 Summary

Chapter 5 extended the work described in Chapter 4 to more concrete practical scenarios. Four such scenarios: prioritizing compounds for a subsequent high-throughput screening experiment; selecting compounds for a subsequent lead-optimization, assessing the probability that a given structure will exhibit a particular biological activity, and the identification of the most active structure were examined. The applicability of different ligand-based virtual screening methods and chemical structure representations in each of the above scenarios was tested. Different measures for the success of the virtual screening experiment in each scenario were presented and discussed. The optimal size of the training set, the difference in the chemical spaces covered by two large databases of biologically active compounds – MDL Data Drug Report (MDDR) and World Of Molecular BioAcTivity (WOMBAT), the bias introduced by the training set selection, the differences in the compounds recovered by different methods or/and descriptors were discussed and the best method-descriptor combination was identified for each scenario.

The findings of this work can be used as guidance for future studies, including investigations in both chemotaxonomy and ligand-based virtual screening, as well as in other chemistry related areas. Concerning chemotaxonomy, a comparative study using the existing different taxonomic divisions of Asteraceae will no doubt discover new interesting relationships between Asteraceae plant species. With regards to ligand-based virtual screening the investigation of other novelty detection techniques and a study, which accounts for the conformational flexibility of the ligands, will be valuable. On the other hand, the definition of the applicability domain of a machine learning model, discussed in Chapter 2, is of benefit for any machine learning method which is used with predictive purposes. The multi-labeled classification, presented in Chapter 3, may benefit other chemoinformatics fields – like, for example, predicting multi-target drugs. The novelty detection technique presented and discussed in Chapter 4 and Chapter 5 offers an alternative for any case where information about only one of the possible states of a given (chemical) system is known. As such, it may help in discovering knowledge from data in various situations where the classic classification algorithms are not applicable.

The studies presented in this work have shown the applicability of machine learning techniques to different chemistry related problems. We have demonstrated how, with the help of different machine learning techniques, knowledge can be gathered from both small and large chemical databases. This knowledge is of great value in the modern, data-rich world. 213

8 Zusammenfassung

Die hier vorliegende Arbeit zeigt die Eignung unterschiedlicher Techniken maschinellen Lernens zur Extraktion von Wissen aus chemischen Datenbanken verschiedener Größe. Zwei unterschiedliche Gebiete – Chemotaxonomie und ligand-basiertes virtuelles Screening wurden hierfür untersucht. Ersteres zeigt, wie ein verhältnismäßig kleiner chemischer Datensatz in Kombination mit unterschiedlichen Techniken des maschinellen Lernens dazu verwendet werden kann unser Verständnis über die Zusammenhänge des sekundären Metabolismus von Pflanzen und ihrer taxonomischen Klassifikation zu verbessern. Ligand- basiertes virtuelles Screening zeigt, wie umfangreiche Mengen chemischer Daten, die über verschiedene große, chemische Datenbanken verteilt sind mit einem wissensbasierten Ansatz zur Entdeckung neuer potentieller Medikamente genutzt werden können.

Kapitel 2 demonstriert die Anwendung unterschiedlicher Klassifikationstechniken bei der Zuordnung von Sesquiterpenlaktonen – wichtige sekundäre Metabolite in der Pflanzenfamilie Asteraceae – zu dem Stamm der Asteraceae aus dem sie isoliert wurden. Die Effizienz verschiedener Klassifikationstechniken wurde untersucht. Hierbei konnte eine gute Übereinstimmung mit der von Bremer vorgeschlagenen taxonomischen Einteilung erreicht werden. Darüber hinaus wurden die Anwendungsbereiche der erstellten Modelle untersucht und es konnten einige praktische Anwendungshinweise gegeben werden.

Kapitel 3 erweitert die in Kapitel 2 präsentierte Studie auf logische Weise. Die gleichzeitige Anwesenheit sekundärer Metabolite in unterschiedlichen Taxa wurde berücksichtigt. Eine Methode des maschinellen Lernens – auch bekannt als multi-labeled Klassifizierung – wurde eingesetzt um die Realität so gut wie möglich zu reproduzieren. Mit diesem Ansatz konnten interessante Zusammenhängen zwischen den unterschiedlichen untersuchten Asteraceae Stämmen erkannt werden. Darüber hinaus wurde die praktische Anwendung des erstellten Klassifizierungsmodells anhand von gezielten Pflanzensammlungen, die es zum Ziel hatten natürliche Produkte mit bestimmten Eigenschaften zu finden, gezeigt.

Kapitel 4 zeigt wie die Techniken des maschinellen Lernens dazu genutzt werden können um sich in großen, chemischen Räumen zu orientieren. Ein neuer Ansatz zur ligand-basierten Ähnlichkeitssuche, der auf einer Technik des maschinellen Lernens beruht die auch unter dem Namen Neuheitserkennung (novelty detection) bekannt ist wurde erprobt. Die Leistungsfähigkeit der Neuheitserkennung zur wissensbasierten Suche chemischer 214 Zusammenfassung

Verbindungen mit potentieller biologischer Aktivität zeigte sich in einer vergleichende Studie mit der am häufigsten zum ligand-basierten virtuellen Screening eingesetzten Methode – der Ähnlichkeitssuche.

Kapitel 5 nutzt die in Kapitel 4 gewonnenen Erkenntnisse für praxisorientierte Anwendungen. Es wurden vier Szenarien untersucht: Die Priorisierung chemischer Verbindungen für ein nachfolgendes Hochdurchsatz Screening, die Auswahl chemischer Verbindungen für eine nachfolgende Leitstrukturoptimierung, die Abschätzung der Wahrscheinlichkeit inwieweit eine chemischer Verbindung eine bestimmte biologische Aktivität zeigt und die Identifizierung derjenigen chemischen Verbindung die die größte Aktivität zeigt wurden hierbei untersucht. Des Weiteren wurde die Eignung unterschiedlicher ligand-basierter Methoden des virtuellen Screenings und verschiedener, chemischer Strukturrepräsentationen für jedes der vier Szenarien überprüft. Unterschiedliche Kriterien zur Bewertung der Güte des durchgeführten virtuellen Screening Experiments wurden für jedes Szenario untersucht und diskutiert. Die optimale Größe des Trainingsdatensatzes, die unterschiedliche Abdeckung des chemischen Raums zweier großer Datenbanken für biologisch aktive Verbindungen – MDDR (MDL Data Drug Report) und WOMBAT (World of Molecular BioAcTivity), der systematische Fehler hervorgerufen durch die Auswahl der Trainingsdatensatzes, die Unterschiede der chemischen Verbindungen die mit den verschiedenen Verfahren und/oder Deskriptoren gefunden werden konnten wurden diskutiert. Darüber hinaus wurde für jedes Szenario die beste Kombination aus Verfahren und Deskriptor bestimmt.

Die Erkenntnisse, die in dieser Arbeit gewonnen wurden, können als Leitfaden für weitere Studien, sowohl für Untersuchungen auf dem Gebiet der Chemotaxonomie und des ligand- basierten virtuellen Screenings, als auch in anderen Arbeitsfeldern der Chemie genutzt werden. Auf dem Gebiet der Chemotaxonomie beispielsweise, könnte eine vergleichende Studie auf Basis der bestehenden taxonomischen Einteilung der Pflanzenfamilie der Asteraceae neue Erkenntnisse über das wechselseitige Verhältnis zwischen den einzelnen Asteraceae Spezies ans Licht bringen. Basierend auf dem ligand-basierten virtuellen Screening wäre eine Studie interessant, die mit weiteren Techniken der Neuheitserkennung den Einfluss der konformativen Flexibilität des Liganden untersucht. Darüber hinaus ist die in Kapitel 2 diskutierte Definition des Anwendungsbereichs des Models für maschinelles Lernen auch für alle Methoden des maschinellen Lernens, die sich mit Vorhersagen befassen von Zusammenfassung 215

Interesse. Die in Kapitel 3 beschriebene multi-labeled Klassifikation ist auch in anderen Arbeitsfeldern der Chemoinformatik von Nutzen, beispielsweise bei der Vorhersage von Wirkstoffen die an unterschiedliche Rezeptoren binden. Die in den Kapiteln 4 und 5 vorgestellte Technik der Neuheitserkennung zeigt Alternativen für Anwendungsbereiche auf, in denen nur Informationen über einen möglichen Zustand eines (chemischen) Systems verfügbar sind. Deshalb kann die Methode der Neuheitserkennung auch zur Wissensgewinnung in vielen Fällen eingesetzt werden, in denen die klassischen Klassifizierungsalgorithmen nicht anwendbar sind.

Die in dieser Arbeit präsentierten Studien zeigen die Anwendbarkeit von Techniken des maschinellen Lernens anhand verschiedener Problemkreise aus dem Gebiet der Chemie. Unter Verwendung unterschiedlicher Techniken des maschinellen Lernens konnte gezeigt werden, wie Wissen aus kleinen sowie großen chemischen Datenbanken extrahiert werden kann. Dieses Wissen ist in unserer modernen und an Informationen reichen Welt von großem Wert.

217

Appendix A. Publications

• Dimitar Hristozov, Tudor I. Oprea, and Johann Gasteiger Virtual screening applications – a study of ligand-based methods and different structure representations in four different scenarios submitted to J. Comput.-Aided Mol. Des.

• Dimitar Hristozov, Tudor I. Oprea, and Johann Gasteiger Ligand-based Virtual Screening by Novelty Detection with Self-Organizing Maps accepted fot publication in J. Chem. Inf. Model.

• Dimitar Hristozov, Johann Gasteiger, and Fernando B. Da Costa Multi-labeled Classification Approach to Find a Plant Source for Terpenoids submitted to J. Chem. Inf. Model.

• Dimitar Hristozov, Johann Gasteiger, and Fernando B. Da Costa “Sesquiterpine Lactones-based Classification of the Family Asteraceae Using Neural Networks and k-Nearest Neighbours” J. Chem. Inf. Model., 2007, 47, 1, 9-19

• Claudia E. Domini, Dimitar Hristozov, Beatriz Almagro, Iván P. Román, Soledad Prats, and Antonio Canals Sample Preparation for Chromatographic Analysis of Environmental Samples in “Chromatographic Analysis of the Environment”, Nollet, L.M.L. (Edt.), 2006, 31- 131

• Dimitar Hristozov, Claudia E. Domini, Veselin Kmetov, Violeta Stefanova, Deana Georgieva, and Antonio Canals Direct Ultrasound-assisted Extraction of Heavy Metals from Sewage Sludge Samples for ICP-OES Analysis Anal. Chim. Acta, 2004, 516, 1-2, 187-196 218

• Veselin Kmetov, Violeta Stefanova, Dimitar Hristozov, Deana Georgieva, and Antonio Canals Determination of Calcium, Iron and Manganese in Moss by Automated Discrete Sampling Flame Atomic Absorption Spectrometry as an Alternative to the ICP–MS Analysis Talanta, 2003, 59, 1, 123-136

• Plamen Penchev, Dimitar Hristozov, and Georgi Andreev Searching in UV/VIS Library University of Plovdiv "P. Hilendarski" Scientific Works, vol. 30, book 5, 2001- chemistry

• Veselin Kmetov, Dimitar Hristozov, Violeta Stefanova, Stojan Tenev, and Ljubomir Futekov Computer Automated System for Micro-sampling in FAAS, Software for Signal Acquisition and Treatment University of Plovdiv "P. Hilendarski" Scientific Works, vol. 30, book 5, 2001- chemistry

219

Appendix B. Lebenslauf

Name Dimitar Panayotov Hristozov Geburtsdatum 9 April 1977 Geburtsort Varna (Bulgarien) Staatsangehörigkeit Bulgarisch Eltern Panayot Dimitrov Hristozov und Mariya Ljubomirova Hristozova, geb. Hristeva

Schulbildung 09/1983 – 06/1990 Grundschule in Plovdiv, Bulgarien 09/1990 – 06/1995 Mathematisches Gymnasium “acad. Kiril Popov”, Plovdiv, Bulgarien

Hochschulausbildung 10/1995 – 06/2002 Hauptstudium der Chemie an der Universität “Paisii Hilendarski”, Plovdiv, Bulgarien Diplomarbeit in der Gruppe Atomspektroskopie bei Prof. Dr. Antonio Canals, Dr. Veselin Kmetov, Dr. Violeta Stefanova Thema: “Comparison between microwave digestion and ultrasound-assisted extraction of heavy metals from sewage sludge. Optimization by experimental design.” 09/2002 Abschluss als Diplom-Chemiker seit 01/2003 Anfertigung der Dissertation bei Prof. Dr. Johann Gasteiger am Computer-Chemie-Centrum und Institut für Organische Chemie der Friedrich-Alexander Universität Erlangen-Nürnberg Auslandsaufenthalt 02/2002 – 09/2002 Socrates-Erasmus Student bei Prof. Dr. Antonio Canals an der Universität Alicante, Alicante, Spanien