Universität des Saarlandes Zentrum für Bioinformatik Bachelorstudiengang Bioinformatik

Bachelorarbeit Inferring virological response to antiretroviral combination therapy based on past treatment lines

vorgelegt von Fabian Müller

am 20. März 2008

angefertigt unter der Leitung von Prof. Dr. Thomas Lengauer, Ph.D. betreut von André Altmann begutachtet von Prof. Dr. Thomas Lengauer, Ph.D. Prof. Dr. Hans-Peter Lenhof Erklärung

Hiermit erkläre ich, dass ich die vorliegende Arbeit selbständig verfasst und alle ver- wendeten Quellen angegeben habe. Saarbrücken, den 20. März 2008 Given the increasing number of possible drug combinations and the genetic diversity of HIV, it is unlikely that simple hand-crafted rules will capture the complex interplay between drug cocktails and mutational patterns that deter- mine response to antiretroviral therapy.

 Altmann et al. [1]

Abstract

Despite the fact that there are several antiretroviral agents available, therapy failure is still a major issue in the quest of ghting HIV. The most promising regimens today are combination therapies, comprising multiple compounds from dierent drug classes. However, complete viral eradication is still not achievable with current strategies. A major cause for that is viral resistance against antiretroviral compounds. Bioinformatics approaches can assess the eectiveness as probabilities of success for dierent promising drug combinations for an individual patient, based on statistics and thus help with choosing an appropriate regimen. Current predictive methods often rely on the genotype of the virus and the drug composition of the therapy. However, genotypes are not always available and even if they are, resistances might be latent in the viral population. Including information on past treatments of a patient might cover resistances (especially hidden ones) indirectly and thus improve prediction. Eventually, methods based on both, the viral genotype and therapy history, might further enhance prediction. Furthermore, it could be useful to have a visual representation of potential therapy- changes in a database. Thereby, it might be possible to identify general habits in drug prescription and observe their eect on the outcome of treatment. With this additional information, an improvement in predicting therapy success could be accomplished. In this work, therapy success is assessed using several encodings for a patient's past treatments. Graph representations prove to be useful for the analysis of therapy sequences and the development of new representations for prediction. Results support, that pre- diction via therapy history encodings is quite eective, especially when combined with genotypic information. Contents 4

Contents

1 Introduction 5 1.1 Human Immunodeciency Virus (HIV) ...... 5 1.1.1 Structure ...... 5 1.1.2 Replication Cycle ...... 6 1.1.3 Pathogenic Mechanisms and AIDS ...... 8 1.2 Current HIV-Therapies ...... 8 1.2.1 Therapy Compounds ...... 9 1.2.2 Viral Resistance and Therapy Failure ...... 10 1.2.3 Highly Active AntiRetroviral Therapy (HAART) ...... 11 1.3 Predicting Therapy Success ...... 12 1.3.1 Related Approaches in Predicting Therapy Success ...... 12 1.3.2 Motivation for History Based Encodings ...... 13

2 Materials and Methods 14 2.1 Statistical Learning Methods ...... 14 2.1.1 Logistic Regression ...... 14 2.1.2 Random Forests ...... 14 2.1.3 Validation ...... 15 2.2 The EuResist Database ...... 16 2.3 Feature Encodings ...... 16 2.3.1 Base Encoding for Current Therapy ...... 17 2.3.2 Binary Compound Variables for Therapy History ...... 17 2.3.3 Continuous Representation for Therapy History ...... 17 2.3.4 Second Order Variables for Past and Current Compounds . . . . . 18 2.4 Clustering Compound Combinations ...... 19 2.5 Therapy Sequence Graphs ...... 20 2.5.1 Transitions-Occurrence Graphs ...... 20 2.5.2 N-gram Graphs ...... 20 2.6 Graph Related Feature Encodings ...... 21 2.6.1 Similarity to Previous Therapy ...... 21 2.6.2 Transitions from Previous Therapy Clusters ...... 22 2.7 Analysis of Therapy Success based on Single Compound Replacements . . 23 2.8 Consolidation of Genotype and Therapy History-based Methods ...... 23 2.9 Realization ...... 24

3 Results and Discussion 25 3.1 Feature Encodings ...... 25 3.2 Therapy Sequence Graphs ...... 30 3.3 Graph Related Feature Encodings ...... 34 3.4 Analysis of Therapy Success based on Single Compound Replacements . . 37 3.5 Consolidation of Genotype and Therapy History-based Methods ...... 38

4 Conclusion 39 1 INTRODUCTION 5

1 Introduction

1.1 Human Immunodeciency Virus (HIV) For understanding current treatment methods of HIV , it is essential to have some basic knowledge about the virus' molecular biology and replication cycle. HIV belongs to the family of Retroviridae and is a member of the genus Lentivirus. Its primary targets are cells of the human immune system such as lymphocytes (CD4+ T-cells), monocytes (macrophages) and dendritic cells. HIV is thought to be descended from the simian immunodeciency virus (SIV), which infects non-human primates. There are two major types of HIV, designated HIV-1 and HIV-2, which can be subdivided into several subtypes (Figure 1). Those subtypes dier in their geographic distribution: for instance in Africa types HIV-1A and D are most predominant, while in subtype B occurs most often.

Figure 1: HIV Types and Subtypes [2]

1.1.1 Structure HIV's molecular structure is considered complex. The virus particle is roughly spherical and about 100nm in diameter. Its genetic information is located on diploid RNA and contains approximately 9,200 basepairs (HIV-1). It covers structural genes for its capsid, envelope, regulatory, accessory and replication associated proteins. Both ends of the RNA strands consist of long terminal repeats, which are important for transcription control. 1 INTRODUCTION 6

Figure 2: Schematic of the HIV particle [3]

As shown in Figure 2, the virion is enveloped by a lipid membrane acquired from the host cell. Integrated in this membrane are trimetric viral complexes of the proteins gp41 (transmembrane protein) and gp120 (surface glycoprotein). The next inner layer, designated matrix, comprises the protein p17, followed by another shell called the capsid, which is built by approximately 2,000 copies of the p24 -protein. The viral RNA itself is stabilized by a ribonucleoprotein complex of p7. Also enclosed in the envelope are the viral proteins reverse transcriptase, protease and integrase as well as accessory proteins.

1.1.2 Replication Cycle As all , HIV has a replication cycle that involves recognition of the host cell, integration into the host DNA and the construction and release of new virions. A schematic overview on HIV's replication can be found in Figure 3. In the early phase of , the viral surface proteins gp120 specically recognize the surface of the host cell (e.g. CD4+ T-cells). The virus particle fuses with the cell's membrane (mediated by the gp41-protein) and its RNA is released into the host and uncoated from the stabilizing protein. The RNA can then be transcribed into DNA. This process is catalyzed by reverse transcriptase. After construction of the complemen- tary DNA-strand, the viral DNA is transported into the host cell's nucleus, where it is integrated into the host DNA by the viral enzyme integrase. In the late phase, the viral genes are transcribed along with the host cell's genes and after splicing and translation the viral protein chains (polyproteins) are ready for further processing such as glycosylation of the transmembrane protein complexes. Furthermore, cleavage of the polyproteins, catalyzed by the viral protease, is necessary for the proteins to function and to form mature, infectious virus particles. Unspliced viral RNA is packed into the new viral structures consisting of these trans- lated viral proteins. Gp120-gp41-protein-complexes are integrated into the host cell's 1 INTRODUCTION 7

Figure 3: HIV replication cycle [4] 1 INTRODUCTION 8 membrane and during budding from the cell the newly constructed virions are enveloped by fragments of this enriched cell membrane. Replication is estimated to take 1-2 days. The time span between release of a new generation of virus particles and infection is believed to be 1-3 days.

1.1.3 Pathogenic Mechanisms and AIDS During viral replication, the HIV-particles cause severe damage to the host's immune system. Proposed pathogenic mechanisms include lysis of infected cells during viral replication, the disruption of the host's lymphoid architecture as well as autoimmune responses, such as superantigens or self-induced apoptosis of infected cells. The course of infection with HIV can be divided into three phases:

• In the acute phase, which accounts for the rst 5 to 10 weeks of infection, the virus population grows rapidly and the host's immune system is activated.

• During the asymptomatic phase CD4+ T-cell counts decrease steadily at a low rate while viral replication stays constant at a low rate (lasts between 2-20 years).

• The end stage is the symptomatic phase: CD4+ T-cell counts fall below 200 cells per microliter and the host's tissue is damaged.

HIV is strongly associated with the Acquired ImmunoDeency Syndrome (AIDS). The current denition of AIDS is given by the U.S. Centers of Disease Control and Preven- tion [5], diagnosing AIDS when a person is HIV-positive and fullls at least one of the following criteria:

• The person has a CD4+ T-cell count below 200 cells/µl

• The person's CD4+ T-cells account for fewer than 14% of all lymphocytes

• The person has been diagnosed with at least one of 25 AIDS-dening illnesses, including opportunistic infections, certain cancers, brain and nerve diseases and the HIV wasting syndrome.

Table 1 sums up current regional statistics on the prevalence of HIV and AIDS. Espe- cially alarming are the numbers of the developing countries such as in the Sub-Saharan region. Approximately one third of the population infected with HIV lives in this African region. The prevalence in those developing countries is also estimated very high. Although the numbers of the central European countries seem not that startling, much more clinical data exists. The data for this work originated from Europe as well.

1.2 Current HIV-Therapies Despite the fact that viral eradication is still not achievable with current methods, ther- apies encompassing several drugs (Highly Active AntiRetroviral Therapies) lead to a reduction of HIV-related morbidity and mortality [7]. 1 INTRODUCTION 9

Table 1: Regional statistics for HIV and AIDS, end of 2007 [6]

1.2.1 Therapy Compounds There are several drugs available, interfering with dierent steps in HIV's replication cycle and thus inhibit the construction of infectious viral particles:

Reverse Transcriptase Inhibitors (RTIs) impede the construction of viral DNA by inhibiting the viral reverse transcriptase. RTIs can be subdivided into two classes: Nucleosid Reverse Transcriptase Inhibitors (NRTIs) bear resemblance with nucleo- sides. When used by the reverse transcriptase for the construction of viral DNA, they cause termination of DNA chain elongation and thus lead to an incomplete transfer of the viral RNA to DNA. NonNucleosid Reverse Transcriptase Inhibitors (NNRTIs) directly target the active site of reverse transcriptase, specically bind to it and thus inhibit its catalytic function.

Protease Inhibitors (PIs) hinder maturation of viral proteins by specically inhibiting the viral enzyme protease. This is achieved by a direct contact to its binding pocket. The result of impaired protease are immature virions.

Integrase Inhibitors (IIs) are currently developed to prevent the integration of the viral DNA into the host cell's genome by direct blocking of the viral integrase.

Fusion and Entry Inhibitors (FIs and EIs) interfere with the binding of virus particles to the host cell's surface.

This work will concentrate on RTIs and PIs since they are most widely used and thus most data is available for them. A list of the analyzed compounds is given in Table 2. 1 INTRODUCTION 10

Class Drug Abbreviation Year of Approval Zidovudine ZDV 1987 Didanosine ddI 1991 Zalcitabine ddC 1992 NRTI Stavudine d4T 1994 Lamivudine 3TC 1995 Abacavir ABC 1998 Tenofovir TDF 2001 Nevirapine NVP 1996 NNRTI Delavirdine DLV 1997 Efavirenz EFV 1998 Saquinavir SQV 1995 Ritonavir RTV 1996 Indinavir IDV 1996 PI Nelnavir NFV 1997 Aprenavir APV 1999 Lopinavir LPV 2000 Atazanavir ATV 2003

Table 2: A selection of therapy compounds and their abbreviations (RTIs and PIs) with their years of approval by the U.S. Food and Drug Administration [8]

1.2.2 Viral Resistance and Therapy Failure The ability of the virus to replicate despite the presence of a given drug is referred to as drug resistance. It is a common reason for treatment failure. Viral resistance arises from random mutations in the viral genes coding for the molecu- lar structure of the particular drug target (i.e. reverse transcriptase, protease, integrase, etc.). Usually more than one mutation is necessary to cause resistance. A number of associated mutations have been identied already by phenotypic and genotypic resistance testing [9]. The interplay of lack of proof-reading in viral replication, genetic diversity and selection accounts for the spreading of the resistance mutations in the viral popula- tion [9]. An alternative base sequence in the aected genes leads to an altered amino acid sequence, which then results in an alternative structure of the target protein. Thus, it is possible that a certain drug no longer has the ability to inhibit the enzyme's function structurally. This results in uninhibited viral replication. Furthermore, pre-existence of some resistant genotypes in the viral population of a particular patient is quite common. Another problem is posed by cross resistances. This describes the phenomenon, that certain mutations might render the virus resistant to more than one drug. For instance, it is likely that resistance to a particular PI is accompanied by resistances to other drugs from that class because of structural similarity of the compounds. It is also assumed, that when a viral genotype measure for a patient is taken, some resistant genotypes in the viral population elude detection and thus remain hidden. 1 INTRODUCTION 11

Therapy failure is dened as the inability to suppress viral replication completely. There are two established denitions of failure [10]: Immunological failure is manifested by low CD4+ T-cell counts while virological failure refers to an increase in the viral load (VL; copies of virus particles per milliliter of blood). Usually virological failure precedes immunological failure in therapy. Although complete suppression of viral replication is not possible with current methods, therapies can have an eect on the VL. Therefore, in this work, a therapy is considered a failure, when the lowest VL measure during therapy (at least 14 days after therapy start) is larger than 500 viral copies/ml. If it is below 500 copies per millilter, the therapy is considered successful (if no measure exists, undened). Viral resistance poses a major threat to therapy success, however it is not the only cause of therapy failure [7]. Subinhibitory drug levels, host immune failure, or incomplete adherence are other causes.

1.2.3 Highly Active AntiRetroviral Therapy (HAART) In order to minimize the probability of preexisting resistant viral genotypes and in order to exacerbate development of resistances, it is common use today to combine several drugs from possibly dierent drug classes into one therapy and to apply treatment as early as possible. Those combination therapies are designated Highly Active Antiretroviral Therapies (HAARTs). Formally a therapy is considered a HAART, if it is one of the following drug combinations [11]:

• two NRTIs plus one PI or NNRTI

• one NRTI in combination with at least one PI and at least one NNRTI

• Ritonavir and Saquinavir combined with one NRTI and no NNRTIs

• Abacavir and Tenofovir plus another NRTI without any NNRTIs and PIs

In very acute cases of infection, therapies containing a large amount of compounds (in the studied dataset up to 11) from possibly all drug classes are prescribed in order to take control of viral replication. Those therapies are referred to as salvage therapies. It is to say, that HIV therapies in general are known to exhibit heavy side eects such as diarrhea, headache and in the long term even disorders in the lipid household, nerve inammation and organic damages. The strategy to interrupt treatment for a certain amount of time in order to reestab- lish non-resistant viral genotypes in the population and recover from therapy induced disorders is sometimes applied. However, this has been shown to be dicult, since viral resistances can be stored in latently infected cells and thus lead to fast re-development of viral resistances. Current treatment strategies also schedule changes in the compound combinations in order to apply more drug pressure to the virus population: Figure 4 shows that when a drug combination is applied, therapy starts to fail after a certain amount of time (indicated by an increase in VL). Thus, a new drug combination is introduced to suppress viral replication. This new therapy also fails after a while resulting in the prescription of a new therapy and so on. From the gure, it can be clearly seen, that treating HIV involves constant eort in assembling the most promising compound combinations and 1 INTRODUCTION 12

Figure 4: Viral load isolates with respect to dierent therapy changes for an individual patient [9] keep the viral population and the immune system under constant surveillance. HIV treatment is a lifelong therapy. Therefore, it is also a necessary means to improve the prediction of success for possible future treatments for an individual patient.

1.3 Predicting Therapy Success The common workow in predicting an output variable (in our case whether a given therapy is a success1) is the following:

1. Prepare the input data, which will be used to assess the output. These inputs may have multiple dimensions accounting for the dierent features encoded. 2. Train a statistical model on the input data (training set) via a chosen statistical learning method. 3. Use this model to predict unseen data. 4. For validation purposes, predict the output for data with known output (test set). This data should have no overlap with the training set. A variety of encodings for the error of the model is conceivable. 5. The trained model and its parameters can be analyzed and interpreted. Importance can be assigned to the dierent inputs.

1.3.1 Related Approaches in Predicting Therapy Success In an approach by Altmann et al. [1], multiple statistical learning methods were applied to several viral evolutionary and phenotypic feature encodings. The predicted output

1recall: A therapy is considered successful when there is a VL measure during therapy at least 14 days after therapy start and this count is smaller than 500 viral copies / ml. 1 INTRODUCTION 13 was binary (whether a given therapy was successful or not). The input data consisted of 6,337 encodings of drug combinations and viral genotypic information. Findings state that the feature encoding has a much larger impact on the quality of prediction than the statistical learning method used. However, the inputs did not contain any information on the therapies prescribed before the one, that is to be predicted (therapy history). Larder et al. used articial neural networks as statistical learning method for a real- valued therapy response prediction (change in viral load) [12]. Their input data consisted of encodings for just the genotype and therapy compounds on the one hand and addi- tional CD4+ T-cell counts and very rudimentary therapy history information on the other hand. The dataset had a size of 1,154 datapoints. They inferred from their results, that the inclusion of those additional features improved prediction quite signicantly. Their explanation was, that resistant viral genotypes at low levels undergo detection in geno- typing. They concluded, that the inclusion of more comprehensive previous treatment history information may further enhance prediction. By subdividing their data accord- ing to the number of drugs in the therapy, another result was an improved prediction by including also therapies with fewer than three compounds.

1.3.2 Motivation for History Based Encodings Motivation for using history based feature encodings rather than genotypic ones is the vast amount of data (approximately 35,000 therapies in the dataset used; see section 2.2), that is available in the database in comparison to the number of datapoints (around 4,000), that also include a genotypic encoding (sequence of viral proteins). The idea is to capture hidden resistance developments of the patient's viral population indirectly through past treatments: Figure 4 shows that modications in treatment can be due to an increase in viral load. Thus it might be likely that a therapy prescribed in the patient's past leads to a viral resistance development against the compounds in the current therapy. This resistance might be hidden in the active and inactive viral population of a patient and thus aect current therapy without being detectable by genotyping. 2 MATERIALS AND METHODS 14

2 Materials and Methods

2.1 Statistical Learning Methods In this work two statistical learning methods are used for the binary classication task of predicting therapy success:

2.1.1 Logistic Regression A possible model for classication is given by the so-called logistic regression. Here the binary case is represented [13]: Let the output classes be in G = {1, 2} (in our case therapy success/failure) and X be the p-dimensional input vector with a certain input datapoint given by x. The probabilities of x belonging to a specic class can then be described by the following equations: T exp(β0 + β x) P (G = 1|X = x) = T 1 + exp(β0 + β x) 1 P (G = 2|X = x) = T 1 + exp(β0 + β x) The logit transformation ( p ) yields a linear decision boundary, which determines log( 1−p ) classication: P (G = 1|X = x) log = β + βT x P (G = 2|X = x) 0 Logically, the model classies a datapoint x to class 1, if x yields a value greater than 0 in this linear expression and to class 2 for a value less than 0. The parameters in the T p + 1 dimensional vector β = (β0, . . . , βp) need to be t based on a training set (say N datapoints). Appropriate values for β can be determined by maximizing the log-likelihood function N X l(β) = log P (G = gi|X = xi, β) i=1 via derivation and Newton iteration. Here the xi are the datapoints in the training set with corresponding classes gi. The fact that the class-probabilities are estimated makes this linear model highly interpretable. It is also computationally relatively inexpensive.

2.1.2 Random Forests Another model chosen for this work is a non-linear one: Random Forests are based on decision trees. Using the same notation as above, a decision tree for a random forest is constructed the following way (see also Figure 5) :

2 1. Draw a set N0 by bootstrapping . This set represents the root node 2. Choose a number m, much smaller than the dimension of the inputs (p)

2draw N samples from the original training set with replacement 2 MATERIALS AND METHODS 15

Figure 5: Construction of a tree for random forests

3. For each constructed node (until output is distinct):

• Choose m input variables at random • Construct new nodes according to the split that minimizes some measure of impurity (i.e. entropy of the output, some estimated model error, ...) • Iterate step 3 over all newly constructed nodes

A number n of those trees is combined into a forest. This forest then classies to the majority vote of the trees in it. In contrast to normal decision trees, the trees in a random forest are fully grown and remain unpruned. Random forests are highly accurate in many cases and allow interpretation. More detailed information on random forests can be found in [14].

2.1.3 Validation In this work, the quality of the statistical models is validated by Receiver Operating Characteristic (ROC) curves, which describe the relation of the True Positive Rate (here: the rate of successful therapies classied as successful) against the False Posi- tive Rate (rate of failing therapies classied as successful) in dependance of a cuto for classication to one of the two classes. This cuto-parameter can be seen as a tradeo parameter between sensitivity and specicity. The larger the Area Under the Curve (AUC), the better the prediction. Random guessing would for instance result in an AUC of approximately 0.5. Figure 9 shows ROC curves for the task of classifying the control inputs of just the drug combination of the therapy (left) and therapy cluster transitions (right). In addition 10-fold cross validation is applied for the validation of each of the prediction cases. 2 MATERIALS AND METHODS 16

Figure 6: Histograms: Number of therapies received by a patient (left); Number of com- pounds in therapies (right)

2.2 The EuResist Database The input data for this work originated from the EuResist Project [15]. In the version of the database used here (August 2007), three databases from indepen- dent European institutes were combined:

• ARCA () • AREVIR () • KAROLINSKA () This results in an abundant amount of data of 59,982 therapies originating from 17,078 patients. Figure 6 (left) shows the distribution of the amount of therapies per patient. It can be seen, that patients with few therapies preponderate. There are over 214,500 viral load isolates contained in the database. Therefore, 35,149 therapies have a dened outcome of success according to the denition given above. Ap- proximately 22,000 viral genotypes are collected, but only about 4,000 can directly be associated with a change in therapy. Therapies encompass 1,689 dierent drug combina- tions of PIs and RTIs. The distribution of number of compounds per therapy is shown in Figure 6 (right). Just by number, most therapies are considered HAARTs (3-4 com- pounds). The large number of therapies containing one or two compounds corresponds mainly to older therapies. There is a subset of 7,416 patients (approx. 28,814 therapies), for whom all treatments have been recorded in the database.

2.3 Feature Encodings In this section, dierent input encodings for therapies, used in this work in order to predict a binary output variable (therapy success or failure), are described. According to 2 MATERIALS AND METHODS 17 the denition of success given above, only therapies with a duration of longer than 14 days with a viral load isolate in the specied time period are left for input. Logistic regression and random forests are applied to each of the encodings for prediction purposes.

2.3.1 Base Encoding for Current Therapy For all therapies the compound composition of the particular therapy is encoded binary. A 17-digit vector holds input variables for the presence of each of the compounds of interest in the current therapy3. In this work the compounds of interests are the NRTIs, NNRTIs and PIs described in Table 2, since they have been in clinical use for a while and thus a certain amount of data was available. Therapies in the database including other compounds where excluded from the input data, because other drugs might falsify prediction.

2.3.2 Binary Compound Variables for Therapy History In addition to the binary encoding for the current therapy, the compounds of therapies in the past of that therapy are coded into inputs. This additional 17-digit history indicator vector is constructed similarly as the base encoding: the binary variable for a compound in therapy history has the value 1, if the compound was included in any of the previous therapies of the particular patient and 0 otherwise. An additional approach for those features is to apply a certain cuto for the duration of a therapy in the history. This means, that the history compound indicators are only set to 1, if the corresponding therapy had a duration longer than the boundary duration. The idea behind this approach is, that the longer the duration of the past therapy, it might be more probable that the viral population developed resistances against the compounds in that therapy. The expectation is, that for smaller cutos, insignicant therapies are stored in the history variables. As the cuto increases, those insignicant therapies are excluded from the history and thus prediction improves. When the cuto increases further, also the relevant history information is ltered from the inputs, prediction relies mostly on the current therapy compounds and accuracy decreases. The objective is to nd the cuto that yields the highest AUC in validation and thus is best suited for prediction. During this work, models were built for cutos ranging from 0 to 196 days (stepsize of 7 for the rst 70 days and 14 for days 70 to 196).

2.3.3 Continuous Representation for Therapy History A logical assumption would be, that the viral population and thus the outcome of a particular therapy is less inuenced by therapies that lie further in treatment's past than by the ones that were issued closely before the current therapy. Therefore, this feature encoding aims at weighting the history indicators according to how recently in time the compounds were issued before the current therapy. The binary variables are multiplied with some weight and thus transformed to real-valued inputs. Therapies dating further back should receive a weight close to 0, while timely close ones

3In the following, current therapy denotes the particular therapy for which the input is constructed rather than the chronological most current therapy for a patient 2 MATERIALS AND METHODS 18

Figure 7: f(t, k) for dierent values of k should be fully weighted (value near 1). It would also be useful to have a parameter controlling the descent of this ination function over time. A function that fullls those criteria is given by ( − 1 tk + 1 0 ≤ t ≤ 4300 f(t, k) = 4300k 0 else

Here t denotes the time dierence between the end of therapy in history and the start of the current therapy. k controls the curvature of the function (see Figure 7): Values k > 1 assure that a past therapy remains relatively important over the time dierence, while values 0 < k < 1 downweigh more rapidly. The intercept on the t-axis of 4,300 days was determined by taking the 95%-intercept of the time dierences between the last and the rst therapy for all patients in the database. If multiple therapies in the treatment history of a patient contained the same compound, the most recent date was taken. For nding the value of k that maximizes the mean AUC of the cross validation runs, logistic regression models and random forests were trained using values for k ranging from 0.1 to 10.

2.3.4 Second Order Variables for Past and Current Compounds The motivation for this encoding is that a compound of one particular drug class is likely to especially inuence the eectiveness of other compounds from the same class (i.e. via cross resistances). The 17 compounds of interest can be grouped into PIs and RTIs (NRTIs and NNRTIs). The idea is to connect historical compounds of one group with current drugs of the same group: For the two groups, each of the treatment history's compound variables (see encodings above) is multiplied with each compound's (binary) variable of the same group in the current therapy. This results in 49 (= 7 × 7) second order variables for the PI- and in 100 (= 10 × 10) for the RTI-group. The nal input for a particular therapy consists of those 149 variables in addition with the 17 current compound indicators. This approach was carried out with the binary as well as with the continuous representation of the history variables. 2 MATERIALS AND METHODS 19

2.4 Clustering Compound Combinations Despite the abundance of therapies in the database some compound compositions appear much more frequently than others. Therefore, also the occurrences of specic transitions from one drug combination to another vary strongly. A certain amount of incidences for each combination is important - especially for the graph-approaches (section 2.5). Grouping drug combinations according to how similar they are, results in fewer therapy classes, which are more balanced with respect to the number of therapies they represent. The rst step of grouping was to account for the boosting-ability of Ritonavir: the eect of PIs in a therapy can be increased by adding Ritonavir. In many cases after the year 2000 this is done by many physicians but not recorded in the database. Therefore, for assessing the similarity of two treatments, as a precaution, Ritonavir was added to all compound compositions including a PI, that started after the year 2000. For therapies with a start-date of before 2000, RTV was just added to the compositions holding Lopinavir, which basically includes Ritonavir. For those Ritonavir-enriched compound combinations, André Altmann provided a ma- trix holding their pairwise similarities (similarity matrix) as real values in the interval [0, 1]. The combinations were nally grouped as follows:

• The compound combinations comprising one or two compounds were grouped man- ually such that each group represented a certain amount of therapies: For the drug combinations containing only one compound, all NNRTIs and PIs formed one cluster. Zidovudine as the only compound in a therapy formed a further one and the remaining NRTIs another. The combinations of two compounds were grouped the following way: One cluster contained all combinations comprised of two PIs. Another one featured all combina- tions of drugs from two dierent classes. The 2-NRTI-combinations were subdivided into the groups Lamivudine-Zidovudine (3TC-ZDV), Lamivudine-Stavudine (3TC- D4T), Didanosine-Zidovudine (DDI-ZDV), Zalcitabine-Zidovudine (DDC-ZDV) and Other combinations of 2 NRTIs (2NRTIs).

• For the two subsets of combinations containing three and four compounds a kind of hierarchical clustering was applied to group them:

1. Begin with a starting-cluster C0 containing all compound combinations 2. Apply a binary clustering, that groups the combinations into two subclusters

C1, C2 based on the dissimilarity matrix (1-similarity matrix)

3. If the two subclusters C1, C2 do not violate the termination criterium (see below), iterate step 2 and 3 over C1 and C2 The binary subclustering (step 2) is done using the Partitioning Around Medoids (PAM) approach. The termination condition in step 3 is considered violated, if either the number of therapies in the database represented by one of the subclus-

ters is smaller than a certain threshold Ttheras or the quotient of the represented number of therapies and the number of represented compound compositions (num-

ber of elements in the cluster) is smaller than threshold Tquot. Tquot prevents the 2 MATERIALS AND METHODS 20

average number of therapies per compound composition from getting too small,

while Ttheras accounts for the absolute number of therapies in a cluster.

In practise, an approach with Ttheras = 100 and Tquot = 5 resulted in 45 compound composition clusters for the combinations holding three or four compounds (22 containing three compounds and 23 containing four).

• All combinations containing more than four compounds were grouped in a salvage- cluster.

All in all, this clustering-procedure resulted in 56 clusters, each representing a certain amount of therapies to work with.

2.5 Therapy Sequence Graphs For the analysis of habits in prescribing dierent therapies in certain sequences and the eect of those habits on the therapy outcome, therapy sequence graphs might be a useful tool to infer information from the database, since generally graphs have the advantage to be much clearer and more visual than tables.

2.5.1 Transitions-Occurrence Graphs The 1,689 compound compositions form the nodes of this graph representation of the database. They are weighted according to their frequency in the database. The edges resemble the occurrences of changes from one drug combination to another: there is an edge from combination A to combination B, if there is an occurrence in the database where the follow-up therapy for a therapy with compound combination A contains the compounds resembled by B. The edge's weight accounts for the number of those occur- rences in the database. Further labels are the number of successful and failing therapies represented by the edge-related subset of therapies represented in node B. If a drug combination occurs as starting therapy for a patient, an edge from an ad- ditional node, resembling therapy-naivety is introduced with a corresponding weight. Analogous, a node indicating that a combination is used for the current treatment of a patient (therapy has not ended yet according to the database) is included in the graph. Edges to another node resemble that a therapy has indeed a stop-date, but there is no follow-up drug combination. In a further approach, the nodes represented not all compound compositions, but the 56 combination-clusters (see section 2.4). The transition-occurrence representation aims to nd out whether there are favored therapies (resembled by hubs in the graph), common therapy pathways and how drug combinations are changed in general. Findings are possibly integrable into new feature encodings for prediction.

2.5.2 N-gram Graphs Originally, N-gram graphs come from the eld of language modeling. Nonetheless, it is likely, that this representation of the database contains information on the impact 2 MATERIALS AND METHODS 21

Figure 8: Scheme of a bi-gram graph for therapy sequences. The this resemble compound compositions or therapy-clusters of previous treatments on the outcome of another one following in the timeline of a particular patient's therapies. N subsequent drug combinations are resembled by the nodes of the graph. In order for an edge from node A to node B to exist, there needs to be a therapy th, represented by the last compound composition accounted for by node B. In addition, the previous N therapies of th need to be members of the same drug combination as resembled by node A (in the same order). Hence, the N − 1 last drug combinations of node A are identical with the N − 1 rst ones of node B. The number of those transitions pose the weight of the edge. The edge is labeled with the percentage of successful ths represented by node B (Psucc). As in the transition-occurrence graphs, three nodes accounting for therapy-naivety, the last therapy in a patients treatment line and ongoing therapies are introduced. Figure 8 illustrates the topology of a bi-gram graph (N = 2) for therapy sequences. The uni-gram representation (N = 1) can be seen as an extension of the transition-occurrence graphs in section 2.5.1. In this work, uni-gram and bi-gram graphs were constructed and visualized. The nodes represented two subsequent composition-clusters (section 2.4) (or just one cluster in the uni-gram case) rather than drug combinations. Reasons for this clustered representa- tion were an explosion in the amount of nodes as well as a low data-density for many transitions (edges), despite the large database.

2.6 Graph Related Feature Encodings 2.6.1 Similarity to Previous Therapy Especially the most recent treatment in a therapy's past should aect its outcome. An analysis of therapy sequence graphs suggested that in many cases in the database, treat- ments were not altered signicantly: often only one compound was replaced, removed or added. Therefore, the similarity of the current therapy to the previous one might be a useful input. In this work, the similarity was encoded in two ways: 2 MATERIALS AND METHODS 22

Categorical encoding: This encoding takes into account especially minor changes in treatment. Here the dierence to the previous therapy is assumed to be one of the following:

• One compound was replaced by another one

• One compound was removed from the previous therapy

• One compound was added to the previous therapy

• The compound compositions are identical

• Any other treatment change (major change in therapy) 4 The categorical input is encoded by a binary variable for each category.

Real-valued encoding: A matrix containing similarities (real-valued in [0, 1]) for all dierent drug combinations in the database was provided by André Altmann for this work. It was computed based on compound classes and the number of compounds in the therapy. The entries from this matrix are directly used as similarity measure, between current and previous therapy.

The two similarity measures were independently combined with the 34-digit binary vector of the binary history encoding. For comparison, prediction was also applied to an input encoding with the binary history variables removed.

2.6.2 Transitions from Previous Therapy Clusters In this approach transition-information from the N-gram graphs is introduced in the feature encodings: A binary indicator vector is constructed that codes for clusters of the current and the previous n − 1 therapies. In the case n = 2, each position in this vector codes for two clusters (one edge) in the uni-gram graph. In other words, variable clust1 → clust2 for therapy th has the value 1, if the preceding therapy for th is assigned to cluster clust1 and th itself represents a member of clust2. In order to prevent the indicator vector from getting too large, only cluster transitions occurring at least 50 times in the database are integrated. This represents an edge-weight ≥ 50 in the uni-gram-cluster-graph. This cuto resulted in a 226-digit vector, including also the transitions naive → clust2. The indicator vector constructed in this manner is appended to the binary history encoding for the nal inputs. This feature representation was also altered, such that each transition-variable repre- sented the previous two therapies plus the current one (n = 3). Thus, here the transitions of the bi-gram graph are encoded. For an edge weight of at least 50, this resulted in 110 additional inputs. In a third representation, the transition encodings from the uni-gram and the bi-gram graphs described above were concatenated.

4In practise, any combination of Lopinavir (LPV) and Ritonavir (RTV) is considered equal to LPV on its own here, because Ritonavir is always used as a booster in a minor dosis together with LPV. Treating Lopinavir and Ritonavir as independent compounds in those cases might falsify the results. 2 MATERIALS AND METHODS 23

2.7 Analysis of Therapy Success based on Single Compound Replacements The estimation of success probabilities of treatment when replacing one compound in the previous therapy by another drug in the current one is the goal of this approach. Assume that

Psucc(current therapy = Comb1|previous therapy = Comb0) = Psucc(Comb1|Comb0) denotes the proportion of successful therapies of the compound combination Comb1 under the premise that the compound combination of the previous therapy is Comb0. Let X be some compound combination, Cold and Cnew be some compounds, A be some base compound combination that does not contain compound Cnew and B be some base combination that contains neither compound Cold nor Cnew. Then, for each pair of compounds (Cold,Cnew) the probability proportion of success of replacing Cold by Cnew in the follow-up therapy can be approximated:

avg change B(Psucc(B + Cnew|B + Cold)) Qsucc (Cold,Cnew) = avg avg A( X (Psucc(A + Cnew|X))) Here avg denotes averaging over all dened occurrences of compound-composition- I () transitions in the database with respect to combination I in that context. Note that if Cold,Cnew resemble no compound (NULL-compound), the addition or removal of one compound can be encoded. In the nal implementation, it comes down to nding the corresponding edges to the particular therapy transition and averaging over the probabilities of success labeling the edges in the transition-occurrence graph (see section 2.5.1). It might be very useful to see which drug replacements are the most promising in change the general case. The goal is to assess the probability proportion Qsucc for each pair of compounds. Which change in drug combination is most likely to be successful for a particular patient, certainly needs to be decided individually.

2.8 Consolidation of Genotype and Therapy History-based Methods In order to improve prediction, combining history based feature encodings with those based on the viral genotype might be a promising approach. Therefore, in a rst step, for the subset of 3,910 therapies, for which a valid genotype measure was available and a therapy response was dened, genotypic and historical pre- diction was performed separately. 10-fold cross validation was applied on this dataset, with the dierence, that for the history based models, in each run the training set com- prised the 31,239 therapies with no valid genotype available. For genotypic prediction, training data was drawn from the genotype-subset. Again, logistic regression and random forest models were trained. The respectively best performing history encoding was used: the binary second order variables for logistic regression and the continuous representation for random forests (see Table 3). The genotypic representation contained 49 inputs for mutations in reverse transcriptase and protease plus the 17 drug indicators in addition with their genetic barrier 5. Both representations were combined in three dierent ways:

5The genetic barrier is an encoding for the probability that a compound remains eective, derived from mutagenetic trees [1]. 2 MATERIALS AND METHODS 24

Feature Concatenation: The genotypic features were simply appended to the vector of historical ones.

Genotypic Encoding + Historical Prediction: Here the genotypic input vector was ex- tended with the predicted probability of success from the historical model for the particular therapy.

Mean Prediction: Probabilities of therapy success resulting from the independent pre- diction above were averaged for assessing a therapy's response.

Validation for the combination predictions was carried out in the same manner as described above.

2.9 Realization The EuResist database version for this work was provided as a MySQL database [16]. Several scripts for data analysis, generation of the input data from the database and graph data were developed using Python [17]. The model training and prediction as well as the provided plots are realized in R [18]. For graph visualization and analysis, Cytoscape [19], which was originally developed for the analysis of biological networks, was used. An extension to Cytoscape, the Network Analyser [20], which provides several additional graph-analysis methods, has been quite useful too. 3 RESULTS AND DISCUSSION 25

3 Results and Discussion

3.1 Feature Encodings An overview of all used encodings with corresponding parameters is given in Table 3. Mean AUC values and standard deviations from 10-fold cross validation runs for logistic regression and random forests are shown. Where the cross validation subsets are equal, a paired Wilcoxon-Mann-Whitney test can be applied for testing whether the AUCs originate from the same probability distribution. In the one-sided test used for this work, the hypothesis that the values from one AUC population are signicantly smaller than the ones originating from another one is tested. The p-value resulting from this test gives a measure for the signicance of the dierence in the two sets of AUCs: the smaller the p-value, the more signicant the dierence. Therefore, often statistical signicance is dened by a p-value < 0.05. Table 4 shows the p-values of paired Wilcoxon-Mann- Whitney tests for the encodings using the same cross validation subsets.

Feature Parameters AUC (LR) SD (LR) AUC (RF) SD (RF) No History 100 trees 0.6704 0.0113 0.6696 0.0084 duration 0 , Binary ≥ 0.7128 0.0081 0.7294 0.0065 100 trees No History - All treatments 100 trees 0.7016 0.0149 0.7 0.0159 recorded Binary - All duration 0 , treatments ≥ 0.747 0.0172 0.7625 0.0105 100 trees recorded Continuous k = 1, 100 trees 0.7142 0.0085 0.734 0.0076 2nd Order 100 trees 0.7258 0.0083 0.7187 0.0059 Variables 2nd Order 200 trees 0.7196 0.0069 Variables Continuous 2nd Order 200 trees 0.7067 0.0105 0.7268 0.0067 Variables Cont. Similarity to 100 trees 0.6724 0.0112 0.6916 0.0048 previous therapy Bin. Hist + Cont. 100 trees 0.7116 0.0098 0.7266 0.0089 similarity to prev. therapy Bin. Hist + Bin. similarity 100 trees 0.7162 0.0078 0.7239 0.007 to prev. therapy Bin. Hist + Cluster trans. w > 50, 200 trees 0.721 0.0075 0.7297 0.0061 (prev. therapy) Bin. Hist + Cluster trans. w > 50, 200 trees 0.7166 0.0072 0.7287 0.006 (prev. 2 therapies) Bin. Hist + Cluster trans. w > 50, 200 trees 0.7205 0.0075 0.7297 0.0065 (prev. 1+2 therapies)

Table 3: Mean AUC values and standard deviation for the prediction via logistic regres- sion (LR) and random forests (RF) for dierent feature encodings

Base Encoding: For the encoding of just current therapy compounds, in Figure 9 (left) the ROC-curves resulting from the 10 cross validation runs for logistic regression can be seen. The mean AUC was 0.6704. Thus, this encoding performed worse than any history 3 RESULTS AND DISCUSSION 26 Cluster trans. (prev. 1+2) 0.00098, 0.00098 0.00098, 0.09668 0.00195, 0.9814 0.9902, 0.00098 0.00098, 0.09668 0.00684, 0.01367 0.9033, 0.3477 0.00684, 0.1377 - Cluster trans. (prev. 2) 0.00098, 0.00098 0.00195, 0.2783 0.00684, 0.9902 1, 0.00098 0.00098, 0.1875 0.2783, 0.03223 0.997, 0.8623 - 0.9951, 0.8838 Cluster trans. (prev. 1) 0.00098, 0.00098 0.00098, 0.09668 0.00098, 0.9756 0.9902, 0.00098 0.00098, 0.1162 0.00293, 0.00977 - 0.00488, 0.1611 0.1162, 0.6875 Sim. prev. th. (bin) 0.00098, 0.00098 0.001953, 0.9932 0.06543, 1 1, 0.00977 0.00098, 0.9346 - 0.998, 0.9932 0.7539, 0.9756 0.9951, 0.9902 2nd Or- der(cont) 0.00098, 0.00098 0.999, 0.8623 1, 1 1, 0.00098 - 1, 0.08008 1, 0.9033 1, 0.8389 1, 0.92 2nd Or- der(bin) 0.00098, 0.00098 0.00098, 1 0.00098, 1 - 0.00098, 1 0.00098, 0.9932 0.01367, 1 0.00098, 1 0.01367, 1 Cont. 0.00098, 0.00098 0.00488, 0.00195 - 1, 0.00098 0.00098 0.9473, 0.00098 1, 0.03223 0.9951, 0.01367 0.999, 0.02441 Binary 0.00098, 0.00098 - 0.997, 0.999 1, 0.00098 0.00195, 0.1611 0.999, 0.00977 1, 0.92 0.999, 0.7539 1, 0.92 No History - 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1 LR , RF No History Binary Cont. 2nd Or- der(bin) 2nd Or- der(cont) Sim. prev. th. (bin) Cluster trans. (prev. 1) Cluster trans. (prev. 2) Cluster trans. (prev. 1+2)

Table 4: Pairwise signicances (p-values, paired one-sided Wilcoxon-Mann-Whitney test) of AUC values for prediction via dierent feature encodings. Hypothesis: AUC- population indicated by the encoding in the rows is less than the values repre- sented by the column-encodings. The prediction was validated using the same cross validation subsets. 3 RESULTS AND DISCUSSION 27

Figure 9: ROC curves for logistic regression (10-fold cross validation) for base encoding (left) and cluster transition encoding (right). Black bars indicate standard deviations.

Figure 10: Occurrences of the 17 compounds in therapies with a dened outcome 3 RESULTS AND DISCUSSION 28

Figure 11: Boxplot: AUC-values (10-fold cross validation) for dierent cutos (in days) for therapy duration in the binary history encoding. Red circles indicate the mean AUC. The bars represent standard deviations, the horizontal lines in the bars medians and black circles outliers. Left: logistic regression; Right: random forests encoding (see Table 3). When examining the tted coecients from logistic regression, almost all compound variables in the therapy appear to be highly signicant6. Aprenavir and Delavirdine seem not to be signicant. An explanation is their limited occurrence in the input dataset (Figure 10). Stavudine varies between signicant and highly signicant in the 10 cross validation runs. Didanosine is not considered signicant in three runs. Measured by the mean AUC, random forests (100 trees) perform slightly worse in this encoding (mean AUC: 0.6696).

Binary History Encoding improved prediction quite signicantly (see Tables 3 and 4). The tted coecients in logistic regression for the history variables in general had a negative inuence on the therapy outcome being success. As previously stated, com- pounds in a therapy's past are likely to evoke viral resistances. Thus, this result is quite reasonable. Therefore, there are cases from all drug classes (Zalcitabine, Saquinavir, Delavirdine), where the history variable for a compound was considered highly signi- cant while the current variable was not. Reverse scenarios also occurred. For instance, Zidovudine was not signicant as a history input, probably because of its early approval and wide use. The analysis of dierent cuto-durations for introducing a compound into a therapy's history lead to results that deviated from the expectation described in sec- tion 2.3.2. The mean AUCs with their standard errors for dierent choices of the cuto can be seen in Figure 11. In both applied prediction methods performance decreases almost linearly as the cuto increases. This implicates, that no matter for how long a drug is applied in therapy history, it aects the current therapy. Therefore,

6For the coecient analysis, signicance is dened by p-values < 0.01. Variables with p-values of less than 0.001 are considered highly signicant 3 RESULTS AND DISCUSSION 29

Figure 12: Boxplot: AUC-values (10-fold cross validation) for dierent powers (k) in the ination function for the continous history encoding. Left: logistic regression; Right: random forests in the other encodings, no duration-cuto was applied for introducing a compound into the treatment history.

Continuous Representation for Therapy History resulted in slightly better AUC values for logistic regression (mean AUC: 0.7142, k = 1). For random forest prediction, it posed the best performance (mean AUC: 0.734, k = 1) for the dataset of therapies from all patients. Logistic regression coecients showed similar signicances as in the binary history encoding. Figure 12 shows the AUCs for various choices of the parameter k, that controls the descent in weight for compounds applied further in a therapy's past. Although the plot's AUC range is quite small, it is recognizable, that a linear decrease in weight yielded the best prediction for logistic regression (left): a kind of plateau in mean AUCs is reached around k = 1. Apparently, therapies further in the past play a signicant role for assessing current therapy, since choices of k < 1 lead to worse prediction. For random forests, performances are more scattered. However, one can observe that with increasing power k, the mean AUC approaches the one for binary history encoding (mean AUC: 0.7294), which is logical, since for large k the weight for a compound stays near a value of 1 for quite a long time dierence (see Figure 7). Again, best performance is achieved with choices around k = 1. During realization, at one point, by mistake, the ination function was not applied to the time dierence, but to duration of application of the compound in therapy history. This resulted in respectable mean AUCs of 0.7138 (logistic regression) and 0.755 (random forests) for k = 1. The maximum in the k-AUC-plots was even more distinct. A possible explanation for this performance is that many therapies with short durations (i.e. one day) originate from the KAROLINSKA-subset of the EuResist database. This Swedish subset, in comparison to ARCA and AREVIR, has a quite high rate of therapy success. 3 RESULTS AND DISCUSSION 30

Therefore, a short therapy duration is more likely to be from KAROLINSKA and thus aects the current therapy positively. The model is probably too much adapted to the dataset and does not represent general prediction.

Second Order Variables yielded an even further signicant improvement in prediction for logistic regression (mean AUC: 0.7258). For random forests of 100 trees, mean AUC was 0.7187. Increasing the number of trees in the forrest to 200 resulted in better AUCs. An attempt with 400 trees was made, but failed due to a lack of computing capacity. In general, since the second order variables represented drugs in the history and current therapy both being either just RTIs or PIs, they inuenced the prediction outcome rather negatively. The signicance was - as expected - diverse. Highly signicant variables with negative coecients in the logistic regression tting were for instance IDV-LPV7 (both PIs) or D4T-ABC (NRTIs). Also Saquinavir in the history with Indinavir , Ritonavir or Nelnavir (all PIs) in current therapy was signicant in a negative way. One of the few positively inuencing examples was EFV-ZDV (NNRTI-NRTI). Although logic suggests otherwise, EFV-NVP (both NNRTIs), was assigned to a positive eect on therapy success. However, vice versa, NVP-EFV had, as expected, a signicantly negative eect. The continuous version improved random forest performance (mean AUC: 0.7268) but resulted in a mean AUC decrease of logistic regression. Again, coecients turned out to be rather negative. An example is Lopinavir applied in history with Aprenavir, Atazanavir or itself in current therapy. Also the reverse cases were signicant. Strangely, compound repetition like DDI-DDI, ABC-ABC, TDF-TDF and NVP-NVP seem to have a posi- tive inuence. A possible explanation are strategic interruptions in therapy, which have been shown to be useful for treatment for short terms.

Subset of Therapies with Completely Recorded History: The simple binary history encoding (duration-cuto = 0) was also applied to the dataset containing only treat- ments from patients, whose history was completely recorded. This improved prediction signicantly: mean AUCs were 0.747 for logistic regression and 0.7625 for random forests. The control prediction, based on inputs for just current therapy's compounds, yielded higher mean AUCs (logistic regression: 0.7016; random forests: 0.7) as well. However, the number of therapies with a dened outcome shrunk to 17,720.

3.2 Therapy Sequence Graphs Transition-Occurrence Graph: The graph including all treatments containing the 17 compounds of interest had dimensions of 1,692 nodes (1,689 drug combinations + nodes

7notation: historical compound - current compound

Figure 13 (on the next page): Transition Occurrence Graph. Nodes represent at least 50 therapies in the database. Edges account for at least 10% of a nodes outgoing edges. Isolated nodes are ltered out. Edge labels resemble the occurrences of the treatment change in the database 3 RESULTS AND DISCUSSION 31 3 RESULTS AND DISCUSSION 32 for therapy-naivety, last recorded and current therapy) and 17,551 edges. On average, a node had 17.49 neighbors (not counting self-loops). Because of this huge size, ltering was necessary in order to extract valuable information. Combinations of the following lters were applied:

• only include nodes (therapies) that show a certain frequency in the database

• exclude nodes with less than a certain amount of outgoing or ingoing edges

• only include nodes with a certain degree (number of ingoing edges + number of outgoing edges)

• exclude edges with a weight below a certain cuto

• for each node: exclude edges accounting for less than p percent of the sum of weights of all outgoing edges

For instance, a quite eective ltering was achieved by ltering out nodes with an oc- currence of below 50 and removing all edges accounting for fewer than 10 percent of the sum of a nodes outgoing edgeweights. After removing the three additional nodes and all isolated nodes, the size shrunk to 90 nodes and 120 edges. The corresponding graph is shown in Figure 13. From the graph-topological point of view, general therapy path- ways don't seem to exist. Paths in the graph contain more than one edge in only a few cases (for example the paths starting in node DDI). Transitions from therapies consist- ing of only one compound (blue nodes in the gure) to combination therapies are quite common. Obviously, the reverse transitions are not considered to be ecient treatment changes. In many cases, only one compound is added or changed. For instance in the two transitions starting in node D4T-DDC-SQV (light green node), Zalcitabine is replaced either by Didanosine or Zidovudine. Other examples are the smaller connected compo- nents on the gure's right side. Instances for addition of one compound are given by the edges originating in 3TC-NFV (dark green). In some cases, compounds are removed (D4T-DDI-EFV-NFV - yellow) - probably due to adverse eects. Self-loops, which represent treatment interruptions, are relatively frequent too. These ndings motivated the encodings for the similarities to the previous therapy in section 2.6.1 and the analysis of therapy success based on single compound replacements (section 2.7).

Uni-gram graph: The basic uni-gram graph consisted of 59 nodes (56 clusters + 3 additional nodes) and 2,466 edges. The average number of neighbors was 50.58. Despite the relatively small number of nodes, its unltered topology was nonetheless complex. Therefore, a ltering was applied according to the edge-occurrences. Figure 14 shows the uni-gram graph with the edges representing at least 50 therapy-transitions in the database. After removing isolated nodes and the additional nodes for current therapies and therapy stop, the graph consisted of 35 cluster nodes plus the naive-node and 215 edges. Apparently, the drug combinations in the clusters 3TC-D4T-IDV-RTV and 3TC-IDV-RTV- ZDV (blue nodes in the gure) occur quite often in therapy pathways. The graph and the probabilities of success on the edges are quite plausible: For instance, the transition 3 RESULTS AND DISCUSSION 33

Figure 14: Uni-gram graph. Edges account for at least 50 transitions in the database. Isolated nodes are ltered out. Edge labels represent probabilities of success. 3 RESULTS AND DISCUSSION 34

NRTI→2NRTIs (green nodes) is successful in only about 35 percent of the cases. The transition DDI-ZDV→3TC-D4T (both therapies consist of two NRTIs; orange edge) is not that successful either (Psucc = 0.36). Switching from 2NRTIs to a combination ther- apy (dark red edges) on the other hand yields probabilities of success of approximately 0.6 to 0.8. Transitions from therapies with only one or two compounds to combination therapies make sense in general: for example 3TC-D4T→3TC-D4T-IDV-RTV (green edge) represents a probability of success of 0.63 while the reverse edge (red edge) has a label of 0.367. The outgoing edges from the salvage-cluster (red) are, viewed from the point of occurrences, mainly self-loops. Edges originating in the naive-node (yellow) sensibly show a high success rate on average. In order to analyze the predictive power of those cluster transitions, the cluster tran- sition encoding introduced in section 2.6.2 seems reasonable.

Bi-gram graph: Since each node represents two subsequent clusters of therapies, the size of the graph grows drastically in comparison to the uni-gram version: The bi-gram graph contains 2,358 nodes and 16,843 edges (14.1 edges per node on average - not counting self-loops). Filtering edges with an occurrence of less than 50 and removing isolated nodes and the two articial sinks afterwards resulted in a graph with 76 nodes and 108 edges (Figure 15). Again, transitions from the naive-node hold relatively high probabilities of success.

Starting with a therapy in the 2NRTIs-cluster (green) seems to be more successful (Psucc = 0.48) than with one in the NRTI-cluster (blue, Psucc = 0.28). Therapy eectiveness can be increased, when starting with a combination therapy as in the 3TC-IDV-RTV-ZDV-cluster

(yellow, Psucc = 0.8), which consists mainly of therapies holding Lamivudine, Zidovudine and two PIs. Salvage-therapies as starting treatment seem to have relatively high chances of success, too (turquoise node). From here, in almost all cases, a second salvage therapy followed (dark red node), which is quite reasonable, since salvage therapies are often applied to patients with heavily pre-treated viral populations. The purple-colored node in the gure for instance represents the 3TC-IDV-RTV-ZDV- self-loop in the uni-gram version. Apparently, treatment interruptions or changes to other drug combinations in this cluster seem to be successful (Psucc = 0.82). Interestingly, after receiving a therapy from 3TC-IDV-RTV-ZDV initially (yellow node), it is more promising to switch to a therapy from cluster DDC-EFV-ZDV or DDI-EFV-ZDV (grey) directly (Psucc = 0.88 and 0.68) than receiving another therapy from 3TC-IDV-RTV-ZDV (purple) and then changing treatment (Psucc = 0.81 and 0.61). Note that in this section only a few examples are chosen, since a full analysis would be beyond the scope of this thesis.

3.3 Graph Related Feature Encodings Similarity to Previous Therapy: Since for each therapy, the previous one needed to be valid, i.e. consist of combinations of only the 17 compounds of interest, or the patient needed to be therapy-naive, the dataset for the continuous similarity encoding shrunk to 34,879 therapies. For the input containing only the current compound composition and the real-valued similarity to previous therapy, performance was poor (mean AUCs: logistic regression: 0.6724; random forests: 0.6916), but at least improved in relation to 3 RESULTS AND DISCUSSION 35

Figure 15: Bi-gram graph. Edges account for at least 50 transitions in the database. Isolated nodes are ltered out. Edge labels represent probabilities of success. 3 RESULTS AND DISCUSSION 36 the basic encoding for current treatment. The newly introduced variable resembling the entry in the provided similarity matrix was considered to negatively inuence therapy success and to be highly signicant in all cross validation runs for logistic regression. When the binary history encoding was added, the similarity variable was not signicant in any of the cases. In comparison with the simple binary encoding, predictive potential decreased (see Table 3). The addition of the binary similarity indicator to the history variables improved pre- diction signicantly in case of logistic regression (mean AUC: 0.7162) and resulted in a decrease of performance for random forests (mean AUC: 0.7239). The two consistently highly signicant variables in logistic regression were whether one compound changed in comparison to previous therapy and the one indicating whether the two therapies consisted of the same compounds. Both had a positive eect on therapy success. The other three additional variables were assigned no/low signicance in most cross validation cases.

Transitions from Previous Therapy Clusters: Performance was best for the encod- ing capturing the cluster transition from just the previous therapy, corresponding to the edges in the uni-gram graph: This encoding achieved also one of the best overall predictions for the complete dataset with mean AUCs of 0.721 (logistic regression, see Figure 9) and 0.7297 (random forests). When observing the tted coecients from lo- gistic regression, one nds that the new variables make sense in most of the cases. As expected, their signicances dier. In the following, a few examples for highly signicant variables are examined: apparently, adding Didanosine or Lamivudine to a therapy con- sisting only of Zidovudine leads to fewer eective therapies. Also, replacing Zalcitabine in DDC-ZDV by Lamivudine seems to eect therapy success negatively. The transition NRTI→DDI-ZDV is seemingly not a great idea. On the other hand, replacing compounds in the PI containing cluster 3TC-D4T-IDV-RTV by NNRTIs, such as in the ABC-D4T-NVP- cluster, seems to be. Interestingly, a follow-up therapy from 3TC-ABC-D4T (NRTIs) to one from 3TC-IDV-RTV-ZDV (NRTIs and PIs) can enhance therapy success. The variables containing the salvage cluster as the current one are in general considered signicant in the direction of therapy failure. This makes sense, since salvage therapies are often intro- duced when the patient shows heavy symptoms and there are few chances of success with a normal therapy. Historical and current compound variables are still highly signicant in most cases. Taking also into account the predecessor's preceding therapy-cluster (edges in the bi-gram graph) reduced prediction accuracy in comparison to the uni-gram-cluster-tran- sitions (see Table 3). In the case of logistic regression prediction, this reduction was signicant (see Table 4). Again, the coecients for logistic regression were analyzed. A few signicant examples are given here: a therapy-naive patient, who received Zi- dovudine as rst treatment and a combination of Didanosine and Zidovudine afterwards is likely to not respond to that follow-up therapy. Salvage therapies as last cluster in the sequence often resemble a negative outcome with the same explanation as above. Self-loops, as in the D4T-DDI-IDV-RTV and the 3TC-IDV-RTV-ZDV clusters, representing treatment interruptions, again can enhance therapy success. In the combined input of uni-gram and bi-gram features, basically the same variables were signicant as in the uni-gram-cluster-transition encoding. The bi-gram variables 3 RESULTS AND DISCUSSION 37

Combination AUC (LR) SD (LR) AUC (RF) SD (RF) No combination - 0.7215 0.0265 0.7455 0.0232 History No combination - 0.766 0.0245 0.7708 0.0215 Genotype Feature concatenation 0.7784 0.0253 0.7925 0.0183 Genotype + Hist. 0.7809 0.0243 0.7873 0.0199 Prediction Mean prediction 0.7798 0.0254 0.7938 0.0219

Table 5: Mean AUC values and standard deviation for consolidation of genotypic and history based prediction. Genotypic encoding was the genetic-barrier encoding and history encodings were binary second order variables for logistic regression and the continous history representation for random forests

No Comb. - No Comb. - GT + Hist. LR , RF Feat. concat. Mean Pred. History Genotype Pred. No Comb. - - 0.00195, 0.00195 0.00098, 0.00098 0.00098, 0.00098 0.00098, 0.00098 History No Comb. - 0.999, 0.999 - 0.03223, 0.00195 0.00098, 0.00293 0.00195, 0.00098 Genotype Feat. concat. 1, 1 0.9756, 0.999 - 0.4609, 0.8838 0.5391, 0.4229 GT + Hist. 1, 1 1, 0.998 0.5771, 0.1377 - 0.8389, 0.06543 Pred. Mean Pred. 1, 1 0.999, 1 0.5, 0.6152 0.1875, 0.9473 -

Table 6: Pairwise signicances (p-values, paired one-sided Wilcoxon-Mann-Whitney test) of AUC values for prediction via dierent combinations of historical and geno- typic encoding. Hypothesis: AUC-population indicated by the combination in the rows is less than the values represented by the column-combinations. The prediction was validated using the same cross validation subsets. were not considered signicant in most of the cases. Thus, it seems, that the directly preceding therapy has much more eect on the outcome of current therapy. Therefore, performance was quite similar to the encoding just taking into account the previous therapy's cluster.

3.4 Analysis of Therapy Success based on Single Compound Replacements Unfortunately, after extraction of the necessary information, it was clear, that for most possible replacements of compounds not enough dierent base combinations were present in the database. In other words, the transition-occurrence graph did not include a su- cient number of edges originating in compound composition B + Cold with target node B + Cnew for most pairs of compounds (Cold,Cnew). Interestingly, this was not the case for self-loops and addition/removal of compounds. Nonetheless, assessment of success for a replacement would rely too much on too few drug combination transitions. Therefore, a general conclusion for single compound replacements - which was the goal of this approach - could not be drawn from the data at hand. 3 RESULTS AND DISCUSSION 38

3.5 Consolidation of Genotype and Therapy History-based Methods The performances of dierent ways of combining historical and genotypic inputs for logistic regression (historical encoding: binary second order variables) and random forests (continuous history representation) are shown in Table 5. As for the feature encodings, pairwise signicances were computed as well (Table 6). Historical prediction on the genotype-subset (3,910 therapies) performed quite simi- larly to prediction on the whole dataset in the case of logistic regression and slightly better for random forests (mean AUC: 0.7455 as opposed to 0.734 for the complete data). The genotype based encoding performed decisively better with mean AUCs of 0.766 (logistic regression) and 0.7708 (random forests). All three dierent modes of combining inputs boosted prediction accuracy signicantly (Table 6). Thus, it is possible to enhance genotypic prediction with the addi- tion of features for therapy history. Analysis of logistic regression coecients also supports this: For the feature concatenation, variables from both input sources were sig- nicant for prediction. In the second combination, the additional variable for historical predicted probability of success was considered highly signicant in all cross validation models, having a positive eect on therapy success. Best performances were reached for the addition of the historical predicted probability of success for logistic regression (mean AUC: 0.7809) and the simple averaging of pre- dicted therapy outcomes in the case of random forests (mean AUC: 0.7938). However, it does not seem that important, which combination method is chosen: The dierences in the AUCs were not statistically signicant (i.e. the p-value of the Wicoxon-Mann- Whitney test was larger than 0.05; Table 6). Even the simple approach of assessing therapy success via genotype and therapy history independently and averaging the re- sults leads to a signicantly enhanced prediction. 4 CONCLUSION 39

4 Conclusion

To conclude, it is possible to assess success of antiretroviral therapy based on information on past treatment lines. Compounds from previous regimens show a signicant inuence on the eectiveness of the current drug combination. This is even the case, if the past treatment was prescribed for only a short time period. Apparently, hidden developments in the viral population, such as hidden resistances, can be captured by taking therapy history into account. Therefore, for high prediction accuracy, it is important to have as much of therapy history data available as possible. With increasing distance in time, therapies in therapy history seem to loose at least some inuence on current treatment. The corresponding devaluation of past therapies in the features should be neither too slack nor by any means too drastic. Particularly, the predecessor-regimen impacts the current therapy's outcome. Despite the fact that prediction accuracy is somewhat inferior to methods based on the viral genotype, the amount of data for therapy history prediction exceeds the available genotypic data vastly: especially in the developing countries, reliable genotype measures are available in the minority of cases. Particularly, the feature encodings combining both, therapy history and current treat- ment (second order variables and therapy cluster transitions), lead to a respectable per- formance. Consolidating genotypic an historical methods, yields a promising approach for increasing prediction accuracy. The method of combining both encodings seems not that important. Therapy transition graphs oer an interpretable visual representation for common treatment lines with high information content. Filtering rare occurrences of treatments and transitions and combining therapies to clusters improves lucidity of the graphs. By analysis of dierent graph types, it is possible to extract and integrate knowledge into promising encodings for enhancing estimation of therapy success. Many new encodings with high information content are imaginable.

The prevalence of HIV infection in the world is alarmingly high. Especially the de- veloping countries are aected. However, complete viral eradication is not possible with current methods. The need of highly active antiretroviral therapies, which are specically tted for the particular patient, is highly urgent. Computational methods can help with assessing promising compound compositions, evading viral resistances. Knowledge of the patient's therapy history can help with assessing the eectiveness of a drug combination and thus with choosing an appropriate regimen. References 40

References

[1] André Altmann et al. Improved prediction of response to antiretroviral combination therapy using the genetic barrier to drug resistance. Antiviral Therapy, 12:169178, 2007.

[2] Christopher J. Wills. Topiary pruning of the HIV and SIV phylogenetic tree. AIDS Res. Hum. Retroviruses, 11:1070710713, 1995.

[3] US National Institute of Health. How HIV causes AIDS, 2004. http://www.niaid.nih.gov/factsheets/howhiv.htm.

[4] Wikipedia. HIV - wikipedia, the free encyclopedia, 2008. http://en.wikipedia.org/wiki/HIV/.

[5] U.S. Centers of Disease Control and Prevention. 1993 revised classication system of HIV infection and expanded surveillance case denition for AIDS among adolescents and adults, 1992. MMWR.

[6] AVERT. AIDS & HIV information from the AIDS charity AVERT, 2007. http://www.avert.org/.

[7] Luc Perrin and Amalio Telenti. HIV treatment failure: Testing for HIV resistance in clinical practice. Science, 280:18711872, 1998.

[8] U.S. Food and Drug Administration. FDA HIV/AIDS program, 2007. http://www.fda.gov/oashi/aids/hiv.html.

[9] Niko Beerenwinkel. Computational Analysis of HIV Drug Resistance Data. PhD thesis, Universität des Saarlandes, 2003.

[10] Sophie Grabar et al. Clinical outcome of patients with HIV-1 infection according to immunologic and virologic response after 6 months of highly active antiretroviral therapy. Ann Intern Med, 133:401410, 2000.

[11] Panel on Clinical Practices for Treatment of HIV Infection convened by the De- partment of Health, Human Services, and the Henry J. Kaiser Family Foundation. Guidelines for the use of antiretroviral agents in HIV-1-infected adults and adoles- cents, 2007. http://aidsinfo.nih.gov/guidelines.

[12] Brendan Larder et al. The development of articial neural networks to predict virological response to combination HIV therapy. Antiviral Therapy, 12:1524, 2007.

[13] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2001.

[14] Leo Breiman. Random forests. Machine Learning, 45:532, 2001.

[15] EuResist Integration of viral genomics with clinical data to predict response to anti HIV treatment. Home page, 2007. http://www.euresist.org/. References 41

[16] MySQL AB. Mysql ab :: The world's most popular open source database, 2008. http://www.mysql.com/.

[17] Python Software Foundation. Python programming language  ocial website, 2008. http://www.python.org/.

[18] The R Project for Statistical Computing. R, 2008. http://www.r-project.org/.

[19] Cytoscape. Cytoscape: Analysing and visualizing network data, 2007. http://www.cytoscape.org/.

[20] Yassen Assenov et al. Computing topological parameters of biological networks. Bioinformatics, 24:282284, 2007. Applications Note.