Research Collection

Doctoral Thesis

Network Perturbation Analysis for Inferring Compound Targets from Expression Profiles

Author(s): Noh, Heeju

Publication Date: 2018

Permanent Link: https://doi.org/10.3929/ethz-b-000254342

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library

DISS. ETH NO. 24852

Network Perturbation Analysis for Inferring Compound Targets

from Gene Expression Profiles

A thesis submitted to attain the degree of

DOCTOR OF SCIENCES of ETH ZURICH

(Dr. sc. ETH Zurich)

Presented by

HEEJU NOH

MSc Chemical and Bioengineering, ETH Zurich

born on 10.09.1988

citizen of South Korea

accepted on the recommendation of

Prof. Dr. Rudiyanto Gunawan, examiner

Prof. Dr. Massimo Morbidelli, co-examiner

2018

ACKNOWLEDGEMENTS | I

Acknowledgements

My PhD study reminds me a lot of the long-distance race I had taken part in 5 years ago: I had to keep running regardless of the moments I felt down or was out of breath. Without the people who have supported and helped me much throughout my PhD study, I would not have been able to finish the longest run I have ever had. Foremost, my big gratitude goes to Prof. Rudiyanto Gunawan for giving me the opportunity to work on a fascinating project about network perturbation analysis, which could open my carrier into life science. I appreciate his dedicated scientific guidance and help with writing and his great patience and support for my work. I am grateful that I was able to learn invaluable research skills from his direct supervision. Also, I appreciate having Prof. Massimo Morbidelli as co-examiner of my PhD work. Ever since I started my studies at ETH Zurich, I have felt grateful for and encouraged by his comments acknowledging my work. Regarding my research, I would also like to thank Prof. Jason E. Shoemaker at the University of Pittsburgh for sharing his expertise and providing guidance on the topic of influenza virus studies. In addition, I appreciate the time working with master students – Moritz Benisch and Hua Ziyi. Their work provided me with a good insight into method development and application. I especially thank Ziyi for her help in preparing R scripts of the developed methods. For IT support, I thank my colleagues, Dr. Minhaz Ud-dean and Nan papili Gao, and IT coordinators, Dr. Erol Dedeoglu and Dr. Mathias Egloff. I also acknowledge the support of ETH research grant for funding my research. I would further like to thank my colleagues, Dr. Erica Manesso, Dr. Manuel Garcia-Albornoz, Dr. Minhaz Ud-dean, Dr. Yang Liu, Dr. Sandro Hutter, Nan Papili Gao, Sudharshan Ravi, and Nadia Sarait Vertti Quintero for their help and fruitful discussions during my study. Many thanks specially go to my female colleagues, Erica, Nan, and Nadia for creating a good atmosphere for me, and Sandro for his help with my teaching assistant duties in the Chemistry laboratory course for the first two years. In addition, I would like to particularly thank Ms. Alexia Berchtold, Mr. Nikita Kobert, and Dr. Mathias Egloff for their warm statements on my work, which gave me much strength at difficult moments I encountered during my study. Needless to say, my deep thanks go to my parents and my brother for their everlasting support and being the source of my strength. Also, many thanks go to my friends for their big trust in me. Among them, I am very thankful to Jihye Suh for her great amount of support and warm encouragement throughout my study. I can never forget the deep friendship shown by her. Finally, a very special and important thank you goes to my true mentor as well as boyfriend, George Zerveas. Not only did he share his knowledge and wise counsel, but he also always stood by me and expressed an unfaltering faith in me. The time of my study gained unique value as a result.

II | ACKNOWLEDGEMENT

ABSTRACT | III

Abstract

Identifying molecular targets of pharmacologically relevant compounds is of great importance in drug discovery and development for understanding the mechanism of action of drugs as well as for finding new applications of existing drugs (i.e. drug repurposing). Driven by advances in and wide usage of gene expression profiling, many computational strategies have previously been developed for inferring drug targets from data on differentially expressed caused by a drug treatment. However, the accuracy of the drug target prediction using existing methods strongly relies on either an extensive and accurate target annotation of some reference gene expression profiles or an accurate model of gene regulatory network (GRN). In the PhD thesis, three novel network based analytical methods for drug target inference from gene expression profiles are developed using mechanistic models of the gene transcriptional processes. Compared to the state of the art algorithms, the methods developed in this thesis are able to provide more accurate drug target prediction and mechanism of action. Besides drug target identification, the application of the same methods to elucidate the mechanism of action of diseases, specifically influenza A infection, is also demonstrated.

Several network filtering methods have previously been created to predict the gene targets of drugs from gene expression data based on an ordinary differential equation model (ODE) of the GRN under the pseudo steady state assumption. A critical step in these network filtering methods involves inferring the structure of the GRN from the gene expression data, a very challenging problem that is known to be underdetermined or ill-posed. In addition, existing network filtering methods require computationally intensive parameter tuning or gene expression data from experiments with known genetic perturbations or both. For these reasons, a novel network filtering method called DeltaNet was developed, in which the identification of the drug targets using the gene expression data was done directly without a separate step of GRN inference (see Chapter 2). The inference of drug targets by DeltaNet involved solving an underdetermined linear regression problem using either least angle regression (LAR) or Least Absolute Shrinkage and Selection Operator (LASSO) regularization. The predictions of gene targets by DeltaNet for gene expression data of Escherichia coli, yeast, fruit fly and were significantly more accurate than those by existing network filtering methods, such as mode of action by network identification (MNI) and sparse simultaneous equation model (SSEM). Furthermore, DeltaNet using LAR did not require IV | ABSTRACT any parameter tuning and could provide higher robustness and computational speed-up over existing methods.

Despite the efficacious and robust performance of DeltaNet, a direct application of steady state models to time series expression data often leads to poor target prediction accuracy. In Chapter 3, DeltaNet was extended for analyzing time-series transcriptional profiles by incorporating additional constraints based on the time derivatives of the gene expression data. The new method called DeltaNeTS employs LASSO regularization when the GRN structure is not available, or ridge regression when the GRN structure is available. In the case studies using yeast and in silico gene expression data, DeltaNeTS outperformed a time-series analysis method called Time Series Network Inference (TSNI) and the original DeltaNet, by having the greatest area under receiver operating characteristic curve (AUROC) and the greatest area under precision-recall curve (AUPR). Additionally, applying DeltaNeTS to Calu-3 human lung cancer cell gene expression data shed light on network perturbations caused by interferon and influenza A virus treatments.

In order to take advantage of extensive online databases of -protein and protein-DNA interactions, another strategy was developed based on a protein-gene regulatory model for inferring drug induced protein-gene dysregulation from gene expression profiles (see Chapter 4). In the new method named ProTINA (Protein Target Inference by Network Analysis), a candidate protein target is scored based on the network dysregulation caused by drug treatments. More specifically, enhancement and attenuation of transcriptional regulatory activity of transcription factors and their protein partners are considered. In the application to benchmark datasets from three drug treatment studies, ProTINA was not only able to provide highly accurate target predictions (average AUROC = 0.82, close to the perfect score of 1), but also able to reveal the mechanism of action of compounds with high sensitivity and specificity. Moreover, the application of ProTINA to gene expression profiles of influenza A viral infection led to new insights of the early events of the infection.

ZUSAMMENFASSUNG | V

Zusammenfassung

Die Identifizierung molekularer Ziele von pharmakologisch relevanten Verbindungen ist besonders wichtig für die Entdeckung und Entwicklung von Arzneimitteln, um den Wirkmechanismus von Arzneimitteln zu verstehen, sowie zum Auffinden neuer Anwendungen von existierenden Arzneimitteln (d. H. Arzneimittelumsetzung). Angetrieben durch Fortschritte und die breite Verwendung der Genexpressionsprofilierung, wurden bereits viele Berechnungsstrategien entwickelt, um Wirkstoffzielen aus Daten differentieller Genexpression, die durch eine medikamentöse Behandlung verursacht wurde, abzuleiten. Die Genauigkeit der Wirkstoffzielvorhersage unter Verwendung existierender Verfahren beruht jedoch stark auf entweder einer umfangreichen und genauen Zielannotation einiger Referenzgenexpressionsprofile oder einem genauen Modell eines Genregulationsnetzwerks (GRN). In der Dissertation werden drei neue netzwerkbasierte analytische Methoden zur Inferenz von Wirkstoffen aus Genexpressionsprofilen mittels mechanistischer Modelle der Transkriptionsprozesse von Genen entwickelt. Im Vergleich zu den Algorithmen des Standes der Technik sind die in dieser Arbeit entwickelten Methoden in der Lage, eine genauere Vorhersage und Wirkungsweise des Wirkstoffziels zu liefern. Neben der Identifizierung von Arzneimitteln wird auch die Anwendung der gleichen Methoden zur Aufklärung des Wirkungsmechanismus von Krankheiten, insbesondere der Influenza-A-Infektion, gezeigt.

Mehrere Netzwerkfilterverfahren wurden zuvor erstellt, um die Genziele von Arzneimitteln aus Genexpressionsdaten vorherzusagen, die auf einem gewöhnlichen Differentialgleichungsmodell (ODE) des GRN unter der Annahme eines pseudo-stationären Zustands beruhen. Ein kritischer Schritt bei diesen Netzwerkfilterungsverfahren beinhaltet das Ableiten der Struktur des GRN von den Genexpressionsdaten, ein sehr schwieriges Problem, dass bekanntlich unterbestimmt oder schlecht gestellt ist. Zusätzlich erfordern existierende Netzwerkfilterungsverfahren rechenintensive Parameterabstimmungen oder Genexpressionsdaten aus Experimenten mit bekannten genetischen Störungen oder beidem. Aus diesen Gründen wurde eine neue Netzwerkfiltermethode namens DeltaNet entwickelt, bei der die Identifizierung der Wirkstoffziele unter Verwendung der Genexpressionsdaten direkt durchgeführt wurde, ohne einen separaten Schritt der GRN-Inferenz (siehe Kapitel 2). Die Inferenz von Wirkstoffziele durch DeltaNet beinhaltete die Lösung eines unterdeterminierten linearen Regressionsproblems unter VI | ZUSAMMENFASSUNG

Verwendung der Least-Angle-Regression (LAR) oder der Least-Absolute-Shrinkage-and- Selection-Operator (LASSO)-Regularisierung. Die Vorhersagen von Gen-Targets durch DeltaNet für Genexpressionsdaten von Escherichia coli, Hefe, Fruchtfliege und Mensch waren signifikant genauer als die von existierenden Netzwerk-Filtermethoden, wie z.B. Wirkungsweise durch Netzwerkidentifikation (mode of action by network identification, MNI) und spärliches simultanes Gleichungsmodell (sparse simultaneous equation model, SSEM). Darüber hinaus erforderte DeltaNet unter Verwendung von LAR keine Parametereinstellung und konnte höhere Robustheit und Berechnungsgeschwindigkeit gegenüber bestehenden Verfahren bereitstellen.

Trotz der effizienten und robusten Leistung von DeltaNet, führt eine direkte Anwendung von Steady-State-Modellen auf Zeitreihen-Expressionsdaten häufig zu einer schlechten Genauigkeit der Zielprädiktion. In Kapitel 3 wurde DeltaNet erweitert, um Zeitreihen-Transkriptionsprofile zu analysieren, indem zusätzliche Einschränkungen, die auf den Zeitableitungen der Genexpressionsdaten basieren, integriert wurden. Die neue Methode namens DeltaNeTS verwendet die LASSO-Regularisierung, wenn die GRN-Struktur nicht verfügbar ist, oder die Ridge-Regression, wenn die GRN-Struktur verfügbar ist. In den Fallstudien, die Hefe- und In- Silico-Genexpressionsdaten verwendeten, übertraf DeltaNeTS die Zeitreihenanalyse-Methode namens Time Series Network Inference (TSNI) und das ursprüngliche DeltaNet, indem es die größte Fläche unter der Receiver-Operating-Characteristic (area under receiver operating characteristic curve, AUROC) und die größte Fläche unter der Precision-Recall-Kurve (AUPR) aufwies. Die Anwendung von DeltaNeTS auf Daten menschlicher Calu3-Lungenkrebszellen hat außerdem zur Aufklärung von Netzwerkstörungen, die durch Interferon- und Influenza-A-Virus- Behandlungen verursacht wurden, beigetragen.

Um umfassende Online-Datenbanken über Protein-Protein- und Protein-DNA-Interaktionen nutzen zu können, wurde eine weitere Strategie entwickelt, die auf Protein-Gen- Regulationsmodellen beruht, um eine drogeninduzierte Protein-Gen-Dysregulation aus Genexpressionsprofilen abzuleiten (siehe Kapitel 4). In der neuen Methode mit dem Namen ProTINA (Protein Target Inference mittels Netzwerkanalyse) wird ein Kandidat-Protein-Ziel basierend auf der Netzwerkdysregulation, die durch Arzneimittelbehandlungen verursacht wird, bewertet. Genauer gesagt werden die Verstärkung und Abschwächung der Transkriptionsregulationsaktivität von Transkriptionsfaktoren und ihren Proteinpartnern berücksichtigt. Bei der Anwendung auf Benchmark-Datensätzen aus drei ZUSAMMENFASSUNG | VII

Medikamentenbehandlungsstudien war ProTINA nicht nur in der Lage, hochpräzise Zielvorhersagen zu liefern (durchschnittliche AUROC = 0.82, nahe dem perfekten Wert von 1), sondern auch den Wirkmechanismus von Verbindungen mit hoher Sensitivität und Spezifität aufzudecken. Zusätzlich führte die Anwendung von ProTINA auf Genexpressionsprofile der Influenza-A-Virusinfektion zu neuen Einblicken in die frühen Ereignisse der Infektion. VIII | TABLE OF CONTENTS

Table of Contents

Acknowledgements ...... I Abstract ...... III Zusammenfassung...... V 1.Introduction ...... 1 1.1. A statement of the problem ...... 1 1.2.Literature review ...... 3 1.2.1.Comparative analysis ...... 3 1.2.2.Network analysis: strategies based on statistical analysis ...... 3 1.2.3.Network analysis: strategies based on mechanistic models ...... 4 1.3.Research objectives ...... 8 1.4.Thesis outline ...... 8 2.Inferring gene targets of drugs and chemical compounds from gene expression profiles...... 11 2.1.Method and Materials ...... 11 2.1.1.The mathematical models and implementation of DeltaNet ...... 11 2.1.2.Implementation of MNI, SSEM and z-score methods ...... 13 2.1.3.Performance assessment ...... 14 2.1.4.Gene expression data ...... 14 2.2.Results ...... 17 2.2.1.Predicting network perturbations ...... 17 2.2.2.Predicting transcription factor targets of chemical compounds ...... 20 2.3.Discussion ...... 21 2.4.Summary ...... 24 2.5.Remark ...... 24 3.Inferring causal gene targets from time course expression data ...... 25 3.1.Method and materials ...... 26 3.1.1.DeltaNeTS formulation ...... 26 3.1.2.DeltaNeTS implementation ...... 28 3.1.3.Gene regulatory network for human epithelial lung cancer cells ...... 29 3.1.4.Gene expression data ...... 30 3.2.Results ...... 31 3.2.1.Predicting constant perturbations under time-course gene expressions ...... 31 3.2.2.Predicting time-varying perturbations ...... 33 3.3.Discussion ...... 37 TABLE OF CONTENTS | IX

3.4.Summary ...... 38 3.5.Remark ...... 39 4.Network perturbation analysis of gene transcriptional profiles reveals protein targets and mechanism of action of drugs and influenza A viral infection ...... 41 4.1.Method and Materials ...... 41 4.1.1.Protein target identification using ProTINA ...... 41 4.1.1.1.Protein-gene network...... 41 4.1.1.2.Gene transcription model ...... 44 4.1.1.3.Protein target scoring ...... 46 4.1.2.Gene expression data ...... 47 4.1.3.DeMAND and differential expression analysis ...... 49 4.1.4.Gene set enrichment analysis ...... 49 4.1.5.Reference protein targets ...... 49 4.2.Results ...... 50 4.2.1.New protein target prediction strategy ...... 50 4.2.2.Prediction of known targets of drugs ...... 51 4.2.3.Mechanism of action of drugs ...... 52 4.2.4.Application of ProTINA for predicting pathogen-host interactions ...... 55 4.3.Discussion ...... 58 4.4.Summary ...... 59 4.5.Remarks ...... 59 5.Conclusions ...... 61 6.Outlook ...... 65 REFERENCES ...... 67 APPENDICES ...... 77 Appendix A: Inferring gene targets of drugs and chemical compounds from gene expression profiles ...... 77 A.1. Procedure of q-value calculation for predictions from DeltaNet and SSEM ... 77 Appendix B: Inferring causal gene targets from time course expression data ...... 86 Appendix C: Network perturbation analysis of gene transcriptional profiles reveals protein targets and mechanism of action of drugs and influenza A viral infection” ...... 87 List of Figure and Table ...... 105

CHAPTER1: INTRODUCTION | 1

1. Introduction

1.1. A statement of the problem

Identifying molecular targets of pharmacologically relevant compounds is a vital task in drug discovery and development. The knowledge of the targets of a drug is indispensable for identifying the drug therapeutic efficacy and side-effects, for understanding the drug mechanism of action (MoA), and for exploring new applications of the drug in treating other diseases (i.e., drug repurposing) [1]. While the definition of a target can be quite arbitrary, the term generally refers to a molecule whose interaction with the compound is connected to the compound’s effects [2].

Among existing technologies for drug target discovery (e.g., biochemical affinity purification, RNAi knockdown or gene knockout experiments) [3], gene expression profiling has received much recent attention due to its relative ease of implementation as well as the availability of large-scale public databases and well-established experimental protocols and data analytical methods. Gene expression profiles give the measurements of messenger RNA (mRNA) molecules for a set of genes, and consequently are used as indicators of gene transcriptional activities. The majority of gene expression profiles available in online databases were produced using the DNA microarray technology, first introduced in the mid-1990s. As the readouts of DNA microarray chip are fluorescence intensities, the final results comprise relative or differential expressions of genes between two samples, for example logarithmic fold-change difference between treated and control samples. With the precipitous drop in the cost of next generation sequencing, RNA sequencing (RNA-Seq) has practically replaced the DNA microarray. While RNA-Seq is able to provide the copy numbers of mRNA, differential expression of genes is still a commonly generated output from gene expression profiling analysis.

A complication when using gene expression profiles for target discovery is that the data give only indirect indications of the drug’s action. As illustrated in Figure 1-1, the interaction between a compound and its protein target(s) is expected to result in the differential expression of downstream genes that are directly or indirectly regulated by the protein target(s). But, the expression of the protein targets themselves may not – and often do not – change [4]. Consequently, drug target discovery using gene expression profiles requires computational methods to infer the (upstream) targets from the (downstream) effects. 2 | CHAPTER1: INTRODUCTION

Figure 1-1. An example of drug mechanism of action. In this example, the drug interacts with or binds to a protein, which subsequently causes the downstream genes that are regulated directly or indirectly by the protein, to be differentially expressed. TFs are that can bind on the regulatory regions of the DNA and regulate the gene transcriptions. The non-directed edges indicate protein-protein interactions, while the directed edges indicate the regulation of gene expression (pointed arrows: gene activation, blunt arrows: gene repression)

In the present PhD thesis, I address the problem of inferring molecular targets of compounds from gene transcriptional profiles by developing several computational methods based on inferring drug-induced perturbations in the GRN. The inference of molecular targets from genome-scale gene expression profiles has many important applications, not only in drug development and repurposing, but also in human disease studies. The identification of GRN perturbations in diseases may lead to a better understanding of the disease mechanism, and eventually to the formulation of potential or better intervention and treatment. In the following section, the state-of-the-art methods for determining drug targets using gene expression data are reviewed.

CHAPTER1: INTRODUCTION | 3

1.2. Literature review

A number of computational algorithms have been developed to analyze gene expression profiles for drug target predictions. Briefly, there exist two main groups of strategies: comparative analysis and network analysis [5]. The subsections below describe the state of current methods belonging to each of the above categories.

1.2.1. Comparative analysis

Comparative analysis methods use the gene expression profiles as drug signatures. This strategy generally involves gathering a compendium of expression profiles from genetic perturbations experiments (knock-out/knock-down) and from chemical compound treatments with known mechanisms, followed by an association analysis of the expression profiles from drugs (e.g. using clustering, distance or connectivity score) [6-8]. Here, the similarity between the differential gene expression of a drug treatment and those of reference compounds or experiments with known targets, implies a closeness in the molecular targets and the MoA (commonly referred to as “guilt by association”). A notable example of such an approach is the Connectivity Map (C-Map) [9], which provides gene expression profiles of human cell lines treated by ~5000 small molecule compounds as searchable signatures for evaluating drug-drug similarities [10]. However, the obvious drawback of comparative analysis methods is their dependence on an extensive and accurate target annotation of the reference gene expression profiles.

1.2.2. Network analysis: strategies based on statistical analysis

Network analysis methods rely on cellular networks, typically the GRNs, to identify the molecular targets of compounds using the gene expression profiles. Consequently, the performance of any network-based analysis would depend on the fidelity of the network. One type of network analysis methods is based on statistical test or enrichment analysis. For example, a strategy called reverse causal reasoning uses literature-mined GRNs to generate a family of hypotheses of drug induced differential gene expressions, which are subsequently scored against the measured gene expression profiles [11-13]. Another set of methods employ a transcription factor (TF) enrichment analysis of differentially expressed genes (DEGs), to identify TFs that are over-represented among the set of DEGs. Transcription factors are proteins that bind to DNA and regulate the transcriptional rate of genes. A subsequent upstream analysis is performed to search 4 | CHAPTER1: INTRODUCTION for proteins that are highly connected to the over-represented TFs in a signal transduction or protein-protein interaction network (PIN) [14-17]. Meanwhile, another method called Master Regulatory Inference algorithm (MARINa) applies a gene set enrichment analysis using a transcriptional regulatory network to identify TFs whose regulons (i.e. a set of genes regulated by the TFs) are enriched for differentially expressed genes [18].

More recent methods used a combination of different types of cellular networks. Notably, a method called Detecting Mechanism of Action by Network Dysregulation (DeMAND) combined the GRN and PIN information to create a molecular interaction network. The drug targets were scored based on drug-induced alterations in the joint gene expression distribution between two connected genes in the molecular interaction network [19]. In detail, DeMAND first calculates the statistical dysregulation score (p-value) for each interaction in the given network, based on the difference of the gene co-expression relationships in the control experiment and in a drug treatment. Then, an overall p-value for each protein target candidate is obtained by combining the dysregulation scores from all genes connected to that protein in the network. In DeMAND, the target candidates are ranked in increasing magnitude of the overall p-values (i.e. decreasing order of statistical significance). While DeMAND is able to integrate different biological networks in the analysis of gene expression, the method still has some limitations. For example, DeMAND could not be used to predict the mode of the drug’s effects (e.g. enhancement or attenuation).

1.2.3. Network analysis: strategies based on mechanistic models

A number of network-based analytical methods were previously developed based on dynamic models of the GRN to infer network perturbations caused by drug treatment [20-23]. Several notable examples from this class of network analysis include network identification by multiple regression (NIR) [20], MNI [21], and SSEM [22]. Briefly, these methods are based on an ODE model of the gene transcriptional process, as follows: [24]

n drk akj uk r j d k r k (1-1) dt j1 where rk denotes the mRNA concentration of gene k, uk and dk denote the mRNA transcription and degradation rate constants of gene k, respectively, akj denotes the regulatory control of gene j on gene k, and n denotes the total number of genes. The sign of akj describes the nature of the CHAPTER1: INTRODUCTION | 5 regulatory control, where a positive (negative) value represents activation (inhibition/repression).

Meanwhile, the magnitude of akj corresponds to the strength of the regulation. Using pseudo steady state assumption, the concentration change of mRNA over time drk / dt can be set to 0, which simplifies the model above into

nn uk aakj kj rk r j g k r j (1-2) dk jj11 where gk = uk / dk is the ratio between mRNA transcriptional and degradation rate constants.

One can rewrite the model above for gene expression ratios between the drug treatment and control samples, by dividing both sides of Eq. 1-2 by the mRNA level in the control experiment, as follows:

akj rg n r  ki  ki  ji  (1-3)     r gj1 r kbi kb i   jb i 

where rki and rkbi denote the mRNA levels of gene k in treatment sample i and in the corresponding control experiment bi, respectively. Taking the logarithm of both sides of Eq. 1-3 leads to the following linear expression:

n cki a kj c ji p ki (1-4) j1 where cki = log2( rki / ) denotes the log2 fold change (log2FC) of mRNA level of gene k and pki

= log( gki / gkbi ) denotes the effects of treatment in sample i. Typically, a base-2 logarithm is employed in the analysis of gene expression data [25]. According to Eq. 1-4, the log2FC of gene transcript k in a given sample comes from a contribution of two factors: (a) changes in the expression of genes that regulate gene k, and (b) a direct perturbation on the effective transcription (i.e. the ratio between transcription and degradation) of gene k by the treatment. A positive

(negative) perturbation variable pki indicates that the effective transcription of gene k is increased (decreased) by the treatment. Note that the formulation of drug induced perturbations in Eq. 1-4 implicitly assumes that the drug treatment affects only mRNA transcriptional and/or degradation rate constants without causing any changes in the GRN structure, i.e. akj is unaffected.

Rewriting Eq. 1-4 in matrix-vector format gives the following: 6 | CHAPTER1: INTRODUCTION

C AC P (1-5) where C is the n m matrix of log2FC gene expression data of n transcripts from m samples, A is the n n matrix of the coefficients akj (with zero diagonal entries), and P is the n m matrix of treatment effects or perturbations pki. In the existing algorithms mentioned above, the gene target predictions are obtained first by inferring the GRN (i.e. matrix A) using a compilation of gene expression data from the same species or cell line, and the inferred GRN is subsequently used to compute the perturbations caused by drug treatments as following:

ˆ Pf C f AC f (1-6)

where Aˆ is the inferred GRN from the training data set, and Cf and Pf are the log2FC and predicted perturbations from sample f of drug treatment of interest. Here, the drug targets are ranked in decreasing magnitude of Pf, i.e. the higher the predicted perturbation pki, the more likely is a gene k being targeted by a drug i.

The main difference between NIR, MNI, and SSEM is the manner of the GRN inference. For example, in NIR, the GRN was inferred using gene expression data with known perturbations, by deriving a linear least square solution to Eq. 1-5. It assumes the maximum number of regulators to each gene, which is less than m to make Eq. 1-5 overdetermined. Then, the GRN model giving the smallest error is selected among the all possible combinations of regulatory inputs less than the defined maximum number for each gene, which is computationally expensive for a typical genome-scale system (m>>1000). Meanwhile, MNI and SSEM do not require the knowledge of the gene targets of each training perturbation. Therefore, the data from varied treatment types, for example RNAi, gene mutations, and compounds, can be included in the training set for both methods. Besides, the two methods use different strategies to tackle the underdetermined GRN inference problem: a dimensional reduction by singular value decomposition in MNI and a LASSO regression [26] in SSEM. Briefly, in MNI, the GRN matrix A is inferred from the Q (< m) number of principle components (PCs) from the log2FC matrix C, by solving Eq.1-5 for A and P recursively starting from the initial guess P = 0. In SSEM, the perturbation P for the training set was set to 0, and the GRN was inferred from the matrix C by minimizing an L1-norm penalized least square problem. CHAPTER1: INTRODUCTION | 7

For drug target identification using time series gene expression profiles, a related method called TSNI [23] was developed based on a dynamical model of the gene transcriptional process:

n np drk  t akj r j t p ks u s  t (1-7) dt js11 where rk again denotes the mRNA concentration of gene k, akj describes the influence of gene j on gene k, pks describes the effect of perturbation s on gene k, and us(t) denotes the s-th perturbation at time t. TSNI further reformulates the linear dynamical system model above into a discrete time model:

RARPU()()()tk1  d t k d t k

R()tk (1-8) RAP()tk1   d d  U()tk where the subscript d denotes the discrete-time equivalence of the continuous time matrices. Note that the inference of Ad and Pd based on the linear model above is underdetermined. TSNI implements a two-step strategy to resolve the underdetermined inference. In the first step, TSNI applies a data smoothing algorithm, particularly cubic smoothing spline filter [27], to reduce data noise, followed by the use of the piecewise cubic splines to generate artificial data with a higher number of time points. In the second step, TSNI employs a principal component analysis to reduce the problem dimensionality (by keeping only a small number of PCs). The estimated Ad and Pd are finally transformed back to the continuous form A and P using bilinear transformation [28]. In the same manner as NIR, MNI and SSEM, the target candidates from TSNI can be ranked based on the magnitudes of the estimated coefficients in P.

8 | CHAPTER1: INTRODUCTION

1.3. Research objectives

As mentioned above, many of the existing network-based analysis methods involves a GRN inference step, which for whole genome network, is known to be challenging as the inference problem is severely underdetermined [29-30]. Unfortunately, the prediction accuracy of these methods depends strongly on the result of the GRN inference, i.e. the fidelity of the inferred GRN. Furthermore, in the case of MNI and TSNI, the target predictions are often sensitive to the number of PCs used in the dimensionality reduction. MNI also requires gene expression data from experiments with known perturbations, which may not always be available. Besides, the targets of drugs are often proteins rather than genes. Complicating the analysis further, the mRNA level of the drug targets is often unaffected by the drug treatment. Therefore, the targets inferred using the aforementioned network analysis methods based on mechanistic modeling of gene transcriptional process would often not include such protein targets. A post-analysis of the resulting target predictions, for example using upstream analysis or TF enrichment analysis, may therefore be necessary.

The main goal of the present PhD thesis is to develop new strategies to overcome the limitations of the existing network analysis methods and to improve the accuracy of compound target predictions. More specifically, in this thesis I describe the development of three advanced computational network analysis algorithms, namely DeltaNet, DeltaNeTS and ProTINA. The algorithms are based on mechanistic mathematical modeling of the gene transcriptional response to a drug treatment and other types of perturbations. Dynamical information in time series gene expression profiles is also explicitly considered in two of the three methods in the thesis. The most advanced and powerful algorithm ProTINA takes further advantage of extensive online databases on protein-protein and protein-DNA interactions, as prior information for defining constraint on the network structure.

1.4. Thesis outline

In Chapter 2, I describe the development of a single-step network analysis method called DeltaNet. In this strategy, target predictions are obtained directly from the data, while the GRN is only inferred implicitly. DeltaNet relies on least angle regression [31] and LASSO regularization [26] to tackle the curse of dimensionality, that is, when the underlying regression problem in CHAPTER1: INTRODUCTION | 9

DeltaNet becomes underdetermined and prone to overfitting [32]. The performance of DeltaNet in predicting genet targets was demonstrated using gene expression profiles from Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, and human breast cancer cells (MCF-7). Notably, DeltaNet outperformed MNI and SSEM, not only in the accuracy of the target predictions, but also in the computational speed.

In Chapter 3, I present an extension of DeltaNet, named DeltaNeTS, which is able to incorporate dynamical information of the gene transcriptional responses contained in time series gene expression data. I demonstrate that the pseudo steady state assumption employed in DeltaNet and other existing network analysis methods could cause a previously unreported complication when inferring the GRN and network perturbations using time series data. For this reason, DeltaNeTS incorporates an additional network constraint that accounts for the dynamics of the gene transcriptional responses. DeltaNeTS tackles the curse of dimensionality in the perturbation inference problem using LASSO regularization [26] when the GRN structure is unavailable, or using ridge regression [33] when the GRN structure is available. The advantages of DeltaNeTS over DeltaNet and TSNI were demonstrated in the application to time series gene expressions datasets from in silico GRN simulations and microarray data of yeast and cultured human lung cancer cells (Calu-3). In addition, DeltaNeTS was applied to H1N1- and H5N1-type influenza A viral infection studies to elucidate the differences between two viruses using prior information of the GRN.

Finally, in Chapter 4, DeltaNeTS method was further adapted for identifying protein targets. The new method called ProTINA computes protein perturbation scores based on a tissue or cell type-specific protein-gene regulatory network (PGRN) model. ProTINA leverages on the availability of comprehensive maps of protein-protein and TF-DNA interactions for the construction of the PGRN graph, and employs ridge regression to infer the protein-gene regulatory activity from differential gene expression profiles. The target scores are based on the dysregulations of the protein-gene network, specifically the enhancement and attenuation of the gene regulatory activity of TFs and their protein partners. The superiority of ProTINA over the state-of-the-art method DeMAND and differential gene expression analysis (DE) was demonstrated by applying these strategies to three drug treatment studies. In addition to compound target identification, ProTINA was further applied to the influenza A viral infection study for elucidating the targets of influenza A viral proteins. 10 | CHAPTER1: INTRODUCTION

CHAPTER2: DELTANET | 11

2. Inferring gene targets of drugs and chemical compounds from gene expression profiles

As mentioned in Chapter 1, network analysis methods for drug target prediction using gene expression profiles often require inferring a model of the GRN from the expression data. The accuracy of the target predictions depends on the fidelity of the inferred GRN model. As the GRN inference has been shown to be underdetermine, there is unfortunately much uncertainty in the resulting model. In order to alleviate such an issue, a single-step method called DeltaNet is developed in this chapter. DeltaNet infers the gene targets of compounds directly from gene expression data, without requiring an explicit GRN inference step. DeltaNet uses the LAR [31] and the LASSO regularization [26] to tackle the curse of dimensionality of the underlying underdetermined inference problem. In the following, the performance of DeltaNet is demonstrated by applying the method to the analysis of gene expression data from Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, and MCF-7 human breast cancer cell line.

2.1. Method and Materials

2.1.1. The mathematical models and implementation of DeltaNet

The formulation of DeltaNet starts with Eq. 1-5. While MNI and SSEM perform a separate step of inferring network A from a training gene expression dataset, DeltaNet obtains the target prediction in a single step by solving the matrices A and P simultaneously based on the following equation:

AT CCITT  (2-1) m T P where and Im is the m m identity matrix. Since the dimension of the unknowns is larger than the number of samples, the regression problem above is underdetermined. Here, two different strategies for solving Eq. 2-1 were employed. The first involved LAR, a particularly efficacious model variable selection algorithm for low-sample high-dimensional data [31]. The other strategy was LASSO regularization by constraining the L1-norm of the solution [26]. 12 | CHAPTER2: DELTANET

Simply, Eq. 2-1 can be seen as a general linear regression problem:

Y XB (2-2)

T T T where X = [ C Im ], Y = C and B = [ A P ] . The columns of X and Y are further centered to have zero mean, while those of X are also normalized to have a unit Eucledian norm. The matrix B could be solved one column at a time, i.e. the matrices A and P are obtained one gene at a time. Thereby, DeltaNet involves solving multiple independent linear regression problems of the type yk = Xβk, where yk and βk are the k-th column of Y and B, and this can be easily parallelized for computational speed-up. DeltaNet further enforces the assumption of no self-regulatory loop (i.e. akk = 0), and thus the (k-th) row of the data matrix C corresponding to gene k is set to zero when solving βk. While this assumption may appear limiting, the case studies below show that DeltaNet is able to accurately predict network perturbations across different species. The matrix A, if desired, can be computed by rescaling the appropriate submatrix of B. Meanwhile, the matrix P is taken from the solution for B without rescaling.

Two versions of DeltaNet are available: DeltaNet-LAR and DeltaNet-LASSO. As the name suggests, DeltaNet-LAR uses the LAR algorithm to solve the underdetermined regression problem above. LAR is an algorithm developed for creating sparse linear models [31]. Like the traditional forward selection method, LAR starts with a zero vector as the initial solution (i.e. no active variables), and adds a new predictor variable (i.e. an active variable) at every step. LAR employs a less greedy algorithm than the forward selection method in calculating the coefficients of the active variables. Briefly, in the first iteration, the predictor that correlates most with the data (i.e. one that forms the least angle with the residual vector) is chosen and is added to the active set. The solution is updated along the direction of equal angles with respect to all variables in the active set, until the residuals become equally correlated with another predictor which is outside the active set. In the next iteration, this predictor is added to the active set, and the process is repeated until completion or until a desired number of active variables is reached.

The current implementation of DeltaNet-LAR uses the MATLAB toolbox SpaSM (Sparse Statistical Modeling) [34]. In a typical scenario, LAR terminates after m or fewer steps, since the number of samples m is far fewer than the number of genes in the dataset. The output of LAR

i consists of a series of solution vectors βk , i = 1, 2, , I, where I is the total number of steps. In CHAPTER2: DELTANET | 13

i DeltaNet-LAR, the steps are carried out until the relative norm error yXkk βy falls below a user-defined stopping criterion δr between 0 and 1. Setting higher δr would lead to fewer steps taken in LAR and thus fewer non-zero coefficients in the solution vector βk. The case studies below showed that the accuracy of DeltaNet predictions does not depend strongly on δr in the range of

1% ≤δr ≤ 10%. A higher δr has the benefit of reducing computational time at the trade-off of slightly reduced prediction accuracy (see Results).

For DeltaNet-LASSO, the LASSO algorithm from GLMNET [35] is used to solve the following penalized minimization problem:

min yX β aT kk2 subject to k 1 βk where ak is the k-th row vector of the A matrix. Briefly, GLMNET uses cyclical coordinate descent algorithm, which successively minimizes the objective function one-parameter-at-a-time and cycles over the parameters until convergence. GLMNET generates a regularization path for the above problem with much shorter computational time than the LASSO algorithm [31] The MATLAB subroutines for GLMNET was obtained from author’s website, http://www. stanford.edu/~hastie/glmnet_Matlab/.

In DeltaNet-LASSO, a k-fold cross validation (CV) is used to determine the optimal T value. In detail, the data are divided into k equal-sample parts during CV. For each CV trial, k – 1 parts are assigned as the training set and used for generating the regularization path by GLMNET. Then, the remaining part is used as a test set for evaluating the prediction errors as a function of T, using ak from the obtained regularization path. Here, the optimal T corresponds to the minimum average test errors over k number of CV trials. In the case studies below, a 10-fold cross validation was implemented for the analyses using DeltaNet-LASSO.

2.1.2. Implementation of MNI, SSEM and z-score methods

The MATLAB subroutines for MNI was downloaded from http://dibernardo.tigem.it/ softwares/mode-of-action-by-network-inference-mni. To determine the set of optimal parameters Q and thP for each data set, a grid search optimization was used. Here, the optimal values were chosen as the parameter set giving the minimum average rank error for samples with known 14 | CHAPTER2: DELTANET perturbations over 5-fold CV. Meanwhile, the parameter KEEPFRAC was set such that more than 200 genes are retained after the last round of tournament (KEEPFRAC = 0.37 for E. coli, 0.35 for yeast and 0.33 for Drosophila) following the published protocol [36]. The remaining parameters (NROUNDS and ITER) were set to their default values. Finally, instead of sample standard deviation, a unit standard deviation was used as an input because better prediction could be obtained by this setting.

For SSEM implementation, two step procedures were conducted as described in ref. 22. First, Eq. 1-5 was solved using LASSO regression with 10-fold CV. Then, the perturbation matrix P was calculated as residuals of the linear least square regression equation as described in Eq. 1-6. Similar to DeltaNet, LASSO regression was implemented using the MATLAB GLMNET package, where the optimal parameter T was chosen as the one giving the minimal prediction error over 10-fold CV. Also, the same input data as for DeltaNet were used for the network inference (i.e. the inference of the matrix A) in SSEM.

Lastly, z-scores were computed according to:

ccki k zki  (2-3)  k

where cki is the log2FC for gene transcript k in sample i, ck and σk are the average and standard deviation of transcript k across all samples, respectively.

2.1.3. Performance assessment

For comparing the performance of different methods, the AUROC and AUPR were computed, i.e. the areas under the plot of true positive rate against false positive rate and the plot of precision against recall (= true positive rates) respectively, following the procedure adopted in DREAM challenges [37-38]. In the AUROC and AUPR calculation, a gene list in order of decreasing magnitudes of non-zero pki in each sample was used for DeltaNet, SSEM, and MNI, and the gene list ranked by the magnitude of z-scores was used for the z-score method.

2.1.4. Gene expression data

For the evaluation of DeltaNet performance, four different microarray data sets were compiled from public databases. One caution for data collection is compiling the data from the same cell CHAPTER2: DELTANET | 15 line as much as possible in order to avoid a confusion coming from the difference in regulatory pathways among dissimilar tissues and cell types.

For E. coli, the normalized microarray data ‘E_coli_v4_Build_3_ chips524probes4217.tab’ by robust multi-array average (RMA) method were obtained from Many Microbe Microarrays Database (M3D, as of 29th October 2007) (http://m3d.mssm.edu) [39]. As summarized in Figure 2-1, the data comprised 4217 genes and 524 samples with 319 samples from gene perturbation experiments, 12 samples from chemical treatments, 55 samples from wild-type control experiments and 138 samples from other conditions (e.g., different growth phases, nutrient feeding strategies). The log2FC expression data were computed by subtracting the average of wild-type control experiments from the log2 RMA intensity data.

For S. cerevisiae, raw microarray data from the array platform of Affymetrix GeneChip Yeast Genome S98 were compiled from ArrayExpress [40] and Gene Expression Omnibus (GEO) [41]. As shown in Figure 2-1, the compiled yeast data set consisted of 566 samples, among which 383 samples were from gene perturbation experiments, 36 samples from chemical treatments, 140 samples from wild-type control experiments, and 7 samples from other conditions. The raw data

Figure 2-1. Workflow of gene target prediction using DeltaNet. The performance of DeltaNet in predicting known gene perturbations was evaluated using gene expression data of E. coli, S. cerevisiae and D. melanogaster. 16 | CHAPTER2: DELTANET were first RMA-normalized using justRMA function in the affy package of Bioconductor [42], which provided log2 normalized intensity. Then, the log2FC expression data were again calculated by subtracting the average of all wild-type control samples from the log2 RMA intensity. In this data set, 5117 probe sets mapped to the official gene symbols in ygs98.db package (Saccharomyces Genome database as of 9th March 2014) in Bioconductor were only selected out of initial 9335 probe sets.

For multicellular organisms, like Drosophila and human, the gene expression data should ideally come from the same cell lineage, as the GRN can vary across cell lines. Therefore, for D. melanogaster, 80% of the microarray samples were compiled from the experiments of Schneider 2 (S2) cells, and the remaining (20%) were from whole-body homogenates. Total 330 microarray samples in the platform of Affymetrix GeneChip Drosophila Genome 2.0 array were obtained from 5 different studies involving gene knock-down (KD) and overexpression experiments in ArrayExpress and GEO. In particular, 242 samples came from genetic perturbations and 88 samples were from wild-type control experiments. The raw data were again pre-processed using justRMA to give log2 normalized intensity. Then, log2FC was calculated by subtracting the control data for each publication separately due to an experimental bias in the measurements among the publications. Among the probe sets mapped to the GenBank accession numbers in drosophila2.db ( Gene database as of 17th March 2015) in Bioconductor, only 5879 genes showing significant differential expression (log2FC ≥ 1) were selected for reducing computational complexity. In the case of multiple probe sets mapped to the same gene symbol, median log2FC values were taken.

For the same reason as Drosophila, gene expression data for human were also collected from a single cell line, namely MCF-7 human breast cancer cells. First, 2537 microarray samples in the platform of Affymetrix GeneChip HT U133 array were compiled from C-Map version 2 [7], and the data were pre-processed by using justRMA. Log2FC were calculated using mean-centering within batches as recommended in the previous study [43], and the probe sets were mapped to the GeneBank accession numbers in hthgu133a.db (Entrez Gene database as of 17th March 2015) in Bioconductor. Again, only 3153 genes showing significant differential expressions

( |median log2FC| ≥ 1) in at least one of the samples were chosen for reducing computational complexity. Besides, among the total 2537 samples, only the samples from the experiments that CHAPTER2: DELTANET | 17 were at least three times replicated were used for data analysis. Therefore, the final MCF-7 data set included 569 samples from 142 different compound treatment experiments.

2.2. Results

2.2.1. Predicting network perturbations

In the first application, the performance of DeltaNet was compared with MNI, SSEM and z- scores in predicting the targets of gene perturbation experiments in E. coli, yeast, and Drosophila datasets. The experiments involved known gene knockouts (KO), over-expression and mutations, which provided the gold standard data for comparing the methods. For this case study, a threshold criterion δr = 10% was used for DeltaNet-LAR.

The test samples of E. coli came from 85 experiments with known perturbations, while the test samples of yeast comprised 109 experiments. For Drosophila, the test set came from the study of cell cycle regulators using S2 cells [44], comprising 91 different perturbation experiments. Figure 2-1 gives the numbers of the combined gene targets in the test samples of E. coli, yeast, and Drosophila test samples, which were slightly higher than the number of samples since a few experiments involved more than one gene perturbation. Except for MNI, each sample of experimental replications was first analyzed separately, and then the median value in P for the replicate samples was calculated as the final prediction. In the analysis using MNI, average gene expression values over replicates were used as the input data as described in the MNI protocol [36].

From each method and each sample from the test data sets, genes were ranked in decreasing magnitudes of the perturbation variables pki. The ranking reflects the confidence level that a gene is directly perturbed in the corresponding experiment, while the sign of pki indicates the nature of the perturbation. In evaluating the performance of the methods, except for MNI, a true positive

(TP) requires not only a high confidence prediction (i.e. large magnitude in pki), but also the correct sign of perturbations (a positive sign for gene induction and a negative sign for gene repression). As MNI did not provide the sign of the perturbations, only the gene ranking was compared in evaluating its performance.

Figure 2-2 compares the true positive rate (TPR) as a function of the gene rank according to DeltaNet, SSEM, MNI and z-scores. The TPR was computed as the fraction of the known gene 18 | CHAPTER2: DELTANET

Figure 2-2. True positive rates of gene target predictions from DeltaNet, SSEM, MNI, and z-scores. The results of DeltaNet-LAR came from analyses using a threshold δr = 10%. perturbations that appear above a given rank (shown in the x-axis). As shown in Figure 2-2, DeltaNet significantly outperformed SSEM, MNI and z-scores for all three test datasets. DeltaNet- LAR and DeltaNet-LASSO gave relatively the same TPR. The top 10 genes from DeltaNet had on average 14% and as large as 19% (for D. melanogaster) higher TPR than the next best method SSEM. MNI gave the worst TPRs among the methods considered, which could be caused by suboptimal tuning of the parameters. The need to optimize the method parameters for different datasets is a known drawback of MNI, since the tuning of these parameters can be computationally demanding [22].

As shown in Table 2-1, DeltaNet-LAR finished faster than DeltaNet-LASSO and SSEM. The computational time of DeltaNet-LAR decreased with increasing δr, as expected. Meanwhile the

TPR of DeltaNet-LAR did not vary significantly for δr between 1 to 10% (see Appendix Figure A1). DeltaNet-LASSO and SSEM had similar computational times since both methods used the same LASSO regularization with 10-fold CV. If the optimal method parameters were known beforehand, then MNI finished quicker than DeltaNet and SSEM. But, as mentioned above, the parameter tuning could lead to a high total computational requirement (see Table 2-1). Finally, Table 2-2 gives AUROC and AUPR for each method, which further confirms the higher accuracy of DeltaNet predictions over those from the other strategies.

CHAPTER2: DELTANET | 19

Table 2-1. Computational times of DeltaNet, SSEM and MNI Computational times1 (hours) E. coli Yeast Drosophila δr = 20% 4.34 9.6 5.7 δr = 10% 9.77 18.8 9.2 DeltaNet-LAR δr = 5% 13.76 24.6 11.4 δr = 1% 17.20 29.1 12.9 completion 17.90 29.9 13.2 DeltaNet-LASSO 30.5 43.8 42.9 SSEM 33.8 48.6 43.1 Single run 0.16 0.19 0.14 MNI Parameter tuning2 15.58 15.55 11.83 1Computational times were determined based on a single CPU run in a workstation with AMD Opteron 6282 SE processor and 256 GB RAM. 2The parameter tuning for E. coli, yeast, and Drosophila was performed by a grid search using 99, 96 and 89 grid points, respectively.

Table 2-2. AUROC and AUPR of DeltaNet, SSEM, MNI and z-scores. AUROC AUPR E. coli DeltaNet-LAR1 0.951 0.694 DeltaNet-LASSO 0.942 0.717 SSEM 0.921 0.558 MNI 0.906 0.252 Z-scores 0.860 0.262 Yeast DeltaNet-LAR1 0.890 0.432 DeltaNet-LASSO 0.903 0.402 SSEM 0.893 0.360 MNI 0.876 0.085 Z-scores 0.897 0.233 Drosophila DeltaNet-LAR1 0.966 0.534 DeltaNet-LASSO 0.957 0.527 SSEM 0.882 0.352 MNI 0.846 0.224 Z-scores 0.950 0.243 1 DeltaNet-LAR result was obtained using δr = 10%. 20 | CHAPTER2: DELTANET

2.2.2. Predicting transcription factor targets of chemical compounds

In the second application, the predicted gene targets were used to identify TFs which interact with drugs or chemical compounds in yeast and human MCF-7 data sets. For yeast data, DeltaNet,

SSEM, MNI and z-scores were applied to rank gene targets. For DeltaNet-LAR, δr = 1% was used in order to keep enough non-zero pki coefficients. For each chemical treatment sample, a TF enrichment analysis was performed from top 100 genes, using YEASTRACT [45], and TFs were ranked in increasing order of adjust p-values calculated from the enrichment analysis (i.e., TFs with lower p-values are ranked higher).

For evaluating the accuracy of the TF prediction, reference TF targets were searched from the database called Search Tool for Interactions of Chemicals (STITCH) (http://stitch.embl.de) [46], a public database of chemical-protein interactions evidenced by experiments, and literature mining. For acetate and rapamycin, two other publications [47-48] were also considered for retrieving their TF targets. Overall, only 5 compounds were found to have TF targets in the yeast compendium from STITCH (with a confidence score > 0.7) and the publications. Figure 2-3 (a) compares the rankings of known TF targets of these 5 chemical compounds, according to the adjusted p-values from YEASTRACT for DeltaNet, SSEM, MNI and z-scores predictions (see Appendix Table A1 for more details). Here, DeltaNet gave the best median ranking (69.5), followed by SSEM (85), MNI (92.5), and lastly z-scores (127). However, differences among the methods were not statistically significant (see Appendix Table A2).

For MCF-7 dataset, DeltaNet, SSEM, and z-scores were applied to generate the gene target predictions, while MNI was not due to the absence of known perturbation samples in the MCF-7 data set for MNI parameter tuning. In this dataset, only 21 compounds have at least one reported TF target in DrugBank [49] and STITCH. From the top 100 predicted gene targets for each compound, a ranked list of TFs was obtained in decreasing order of enrichment scores, using Enrichr [50] with the option of position weight matrices. In Figure 2-3(b), the rankings from Enrichr for the known TF targets of the 21 compounds were compared among DeltaNet, SSEM and z-scores predictions (see Appendix Table A3 for more details). Again, DeltaNet gave the best median ranking (65), followed by z-scores (83) and SSEM (105). Here, the difference in the median rankings between DeltaNet and SSEM was statistically significant (see Appendix Table A2). Taken together, the outcomes of TF enrichment analyses above demonstrated that DeltaNet CHAPTER2: DELTANET | 21

Figure 2-3. Rankings of known TF targets of chemical compounds based on TF enrichment analysis of DeltaNet, SSEM, MNI, and z-scores predictions. The TFs are ranked according to (a) the adjusted p-values of YEASTRACT for yeast dataset and (b) the combined enrichment scores of Enrichr for human MCF-7 dataset. could provide gene target predictions which agreed better with previously reported TFs, than the other methods.

Unfortunately, TF enrichment analysis of E. coli data could not be performed because the chemical compounds, namely ampicillin, kanamycin, norfloxacin, and spectinomycin in the dataset do not have any TF interactions with high confidence (score > 0.7) in STITCH.

2.3. Discussion

DeltaNet formulation uses an ODE model of gene transcription process under steady state condition. Two options are available in DeltaNet to solve the underdetermined linear regression problem: LAR (DeltaNet-LAR) and LASSO regularization (DeltaNet-LASSO). One can relax the assumption, akk = 0, in DeltaNet by treating the predicted pki as the sum between the self-regulatory feedback and the perturbations caused by the treatment. In such a case, instead of using the magnitude of the perturbation coefficients pki to rank genes, one can use the q-values of pki [51]. However, for the case studies above, any improvement was not observed in the accuracy when using gene rankings from the q-values (see Appendix 7.1.1. and Appendix Figure A2).

The output of DeltaNet comprises a ranked list of gene target predictions. Such a list is amenable for further enrichment analysis to identify other type of molecular targets, such as TFs 22 | CHAPTER2: DELTANET in the second case study above. An upstream analysis can also be applied to find protein partners of enriched TFs, for example by using Expression2Kinase [15] and Enrichr [50]. Beyond TF and protein targets, one can also apply functional enrichment analysis to obtain the functional relevance of the gene target predictions. Several web-server tools exist for such a purpose, notably ToppCluster analysis [52] which provides 17 categories of human-ortholog gene annotations such as , pathways, microRNAs and human phenotype.

Although DeltaNet uses the same ODE model as used in MNI and SSEM, there is a key difference in the manner by which the target predictions are inferred from the data. In other words, the first phase of both MNI and SSEM involves the identification of the GRN matrix A using a compendium of gene expression data. The perturbation matrix P is subsequently obtained for the treatment samples of interest by network filtering, which in essence uses the difference P = C – CA. In MNI, the matrix A is especially estimated from training data together with the unknown matrix P, using a procedure that resembles Expectation Maximization algorithm [21]. The convergence of this procedure is however not guaranteed and the solution often varies with the initial starting guess. Also, the performance of MNI is known to sensitively depend on the tuning of method parameters which often leads to numerically intensive optimization [53]. Not to mention, MNI requires data from known genetic perturbations for parameter tuning.

In contrast, SSEM uses LASSO regularization to identify the matrix A using the complete gene expression data, where the perturbation matrix P is initially set to zero. By doing so, SSEM ignores the treatment or perturbation effects when inferring the GRNs. The matrix P is subsequently obtained from the residuals of the regression above. The LASSO regularization enforces a limit on the model complexity, an assumption which is based on the observed sparsity of GRNs [54-55, 20]. The implementation of LASSO requires selecting the appropriate constraint parameter T for model complexity. As the optimal value is not known a priori and is also problem dependent, a cross-validation is often used for setting T. As shown in Table 2-1, analyses using LASSO, including DeltaNet-LASSO and SSEM, were the slowest among the methods considered. Here, the majority of the computational time was contributed by the cross validation step.

One can view DeltaNet as a hybrid between MNI and SSEM. The inference of the matrices A and P in DeltaNet is performed simultaneously, which resembles the first step of MNI. But, like SSEM, GRN is assumed sparse by way of LAR and LASSO regularization to tackle the CHAPTER2: DELTANET | 23 underdetermined problem. Nevertheless, DeltaNet does not involve an explicit network filtering step. Instead, the perturbation matrix P for the treatment samples is obtained together with the other samples in the compendium. DeltaNet predictions could therefore fully use the information contained in the available data (training and treatment sets). As demonstrated in the case studies, DeltaNet offers a significant improvement in the accuracy of target prediction over MNI and SSEM. Furthermore, DeltaNet-LAR has better numerical efficiency and robustness with respect to parameter tuning over DeltaNet-LASSO, MNI and SSEM. Generally, DeltaNet-LAR with a threshold parameter δr of 10% would be a good choice because of the balance between target prediction accuracy and computational performance as demonstrated in the above case studies.

While the difference between DeltaNet and the existing network filtering methods may appear slight, this deviation is nevertheless important and fundamental. There were two key factors motivating the single-step inference in DeltaNet. First, the inference of GRNs from the typical gene expression has been shown to be underdetermined [29-30]. Thus, any method relying on the solution of such an inference problem could be sensitive to the associated uncertainty. Second, despite the underdetermined issue, it is often possible to predict the effects of a network perturbations from existing gene expression data with reasonable accuracy [56]. Therefore, DeltaNet was reformulated, based on the premise that the available gene expression data, while lacking information for the accurate inference of GRN, have enough information to identify the network perturbations caused by a treatment.

Considering that both DeltaNet-LASSO and SSEM employ the same LASSO regularization, the differences between the gene target predictions from these two methods were quite interesting. In the first case study for yeast and E. coli data sets, DeltaNet-LASSO produced sparser GRNs than SSEM (see Appendix Figure A3(a)). This trend is expected since in comparison to SSEM, DeltaNet formulation has additional degrees of freedom that come from the perturbation vector. However, the sparsity between DeltaNet and SSEM in the Drosophila dataset was not significant.

Furthermore, when looking at ak predictions (a row vector of the A matrix) for the target genes which were predicted within top 10 by DeltaNet-LASSO but in below top 10 by SSEM in E. coli, yeast, and Drosophila data sets, much fewer non-zero coefficients were obtained in ak by DeltaNet- LASSO than those by SSEM (see Appendix Figure A3(b)). The observations above indicated a possibility of overfitting in SSEM due to the fact that the regression problem did not consider perturbations on the genes. 24 | CHAPTER2: DELTANET

Finally, time series expression data have become routine and increasingly available in public databases. Although the prediction accuracy of DeltaNet for E. coli and yeast did not significantly differ when including a small number of time series samples from when including none of them in the above case studies (see Appendix Figure A4), generally applying DeltaNet or any other steady-state model based methods to time series data should be done with caution because the steady state assumption is violated.

2.4. Summary

DeltaNet is a network-based analysis tool for predicting causal gene targets of drug treatment using a compendium of gene expression data. The method is based on a linear regression model which was derived from an ODE for mRNA transcription rates. In DeltaNet, both GRN and perturbations for drug treatment samples are inferred at the same time. The underdetermined problem in DeltaNet can be solved using either LAR or LASSO regression. The MATLAB subroutine for DeltaNet is available in the following webpage: http://www.cabsel.ethz.ch/tools/ deltanet.html. In the case studies with E. coli, yeast, drosophila, and human MCF-7 data sets, both strategies in DeltaNet showed higher prediction accuracy than other network filtering methods – MNI and SSEM. However, due to the underlying steady-state assumption, DeltaNet is limited on an extensive usage of time series data. Chapter 3 is more devoted to an extension of DeltaNet for a time series data application.

2.5. Remark

The work presented in this chapter has been published in Bioinformatics [57].

CHAPTER3: DELTANETS | 25

3. Inferring causal gene targets from time course expression data

Time series experiments are commonly performed to elucidate the dynamical responses of gene transcription to intra- and extracellular stimuli. Microarray data from such studies provide gene transcriptional profiles at multiple time points after a stimulus, for example after the drug treatment starts. To the best of my knowledge, the complications arising from a direct application of DeltaNet or any of methods based on the pseudo steady-state assumption to time series data, have not been thoroughly investigated.

Figure 3-1 illustrates one important complication when applying methods based on pseudo steady-state assumption to time-series gene expression profiles in a simple three gene network with a perturbation to the upstream gene (gene A). The complication in this example involves reversals of the causal directions in the inferred gene regulatory network due to the violation of the steady state condition. When one applies steady-state model methods to the time series measurements of gene expressions as depicted in Figure 3-1(a), the prediction from those data

Figure 3-1. A simple three-gene network with a perturbation to the upstream gene (gene A). The nodes represent genes, while arrows represent gene regulations. (a) The dynamics of gene expressions over three time points due to a perturbation on gene A. Genes showing differential expressions are drawn as filled nodes. (b) Inference of the GRN from time series differential expressions with steady state assumption. Data at each time point are viewed as independent steady state samples. The differential expressions are consistent with a GRN with the reverse regulatory interactions under time-varying perturbations. 26 | CHAPTER3: DELTANETS would be more consistent to a GRN with the reverse regulatory directions under time-varying network perturbations (see Figure 3-1(b)). Consequently, the accuracy of the gene target predictions based on the inferred GRN could deteriorate.

In this chapter, the previous DeltaNet is extended to incorporate an additional network constraint which takes account of dynamics in time series expression profiles. The improved method, called DeltaNeTS, is also able to utilize prior information of the GRN structure. Any prior information on the existence of gene-gene regulatory relationships may alleviate the curse of dimensionality in DeltaNeTS by limiting the degrees of freedom, specifically the number of nonzero coefficients akj. Without any prior information on the GRN structure, DeltaNeTS adopts the LASSO regularization under the assumption that the GRN is sparse [26]. But, when given prior information on the GRN structure, DeltaNeTS uses ridge regression [33]. In this chapter, the advantage of DeltaNeTS over another time series data analysis tool called TSNI and the steady- state model based DeltaNet is demonstrated by applying to time series in silico and yeast gene expression data. Lastly, DeltaNeTS is applied to elucidate dynamic network perturbations caused by H1N1 (influenza A/CA/04/09) and H5N1 (influenza A/VN/1203/04) type influenza A viral infections.

3.1. Method and materials

3.1.1. DeltaNeTS formulation

The formulation of gene target prediction in DeltaNeTS is the same as in DeltaNet, but with an additional constraint to accommodate time series gene expression data. More specifically, we assume that the mRNA concentrations vary between sampling time points (i.e. piecewise linear), such that

l1 l l logrk  s k t log r k  (3-1)

l l1 l where rk and rk are the mRNA concentrations of gene k at l-th and (l+1)-th time points, and sk is the slope between the two time points. Using Eq. 1-1, the first order derivative of the logarithm of rk can be rewritten as the following: CHAPTER3: DELTANETS | 27

n n drlog  u a k k rkj  d  uexp a r  r  d (3-2)  j k k kj j k k dt rk j1 j1 jk jk

By substituting Eq. 3-1 for rk in Eq. 3-2, the following equation can be derived:

 drlog  n k uexp a sl t  log r l  s l t  log r l  d (3-3) k kj j j  k k k dt j1 jk where rk in Eq. 3-3 is the mRNA concentration of gene k between l-th and (l+1)-th time points.

The derivative of Eq. 3-3 (i.e. the second derivative of log(rk)) becomes zero due to the assumption made in Eq. 3-1.

    dr2 log  nn k uassl  l exp  ast l  log r l  st l  log r l   0 (3-4) 2 k kjjk   kjj  j  k k   dt jj11    j k   j k 

Here, uk and dk are assumed to be the same between sampling times l and l+1, i.e. the perturbations are constant in this time window. Since uk and the exponential term in Eq. 3-4 are always positive,

l the remaining term should be zero. Therefore, the following relationship between the variable sk and the network coefficient akj can be obtained:

n ll sk  a kj s j (3-5) j1 jk

The matrix form of Eq. 3-5 for n genes and m samples is the same as following:

S AS (3-6) where S is the n m matrix of the slopes calculated from time series log2FCs data. Finally, in DeltaNeTS, Eq. 1-5 and Eq. 3-6 are solved together:

TTT CCIA  m    TTT      (3-7) S   S 0   P  where 0 is the zero matrix of appropriate dimensions. 28 | CHAPTER3: DELTANETS

3.1.2. DeltaNeTS implementation

In DeltaNeTS, the slopes of the time series gene expression profiles were calculated using 2nd- order accurate finite difference approximations at each sampling time point [58]. Here, non- uniform time gaps among the samplings were also considered in the finite difference calculation. For the first and last time points, in particular, forward and backward finite difference were used, and for middle time points, centered approximation was used.

In the implementation of DeltaNeTS, the problem can be seen as the following linear regression problem:

TT CCI  m  Y XB, Y TT  , X    (3-8) S   S 0 

In Eq. 3-8, the unknown matrix B = [A P]T could be solved one column at a time, as in DeltaNet.

Thus, for each gene, yk = Xβk is solved, where yk and βk are the k-th column of Y and B respectively.

The variable yk and Xk are scaled by the following multiplication factor:

I0m W   (3-9) 0I Mm where ρ is a scaling factor given by

22  (||yyk[1 m ] || 2 /m ) / (|| k [ m  1  M ] || 2 /( M m )) (3-10) where yk[1…m] is a subvector of yk with the first m elements, and yk[m+1…M] contains the remaining elements corresponding to the slopes.

The regression problem in Eq. 3-8 is typically underdetermined as the number of genes is usually larger than the number of samples. In DeltaNeTS, two strategies are available to tackle the underdetermined problem. The first strategy applies LASSO regularization [26] by assuming the network A is sparse. The other strategy employs ridge regression [33], which is used when prior information of the GRN structure is available. For LASSO regularization, the following penalized least square objective function is minimized, as done in DeltaNet-LASSO:

minyXβa2 subject to T k k21 k βk CHAPTER3: DELTANETS | 29

where ak is the subvector of βk corresponding to the GRN (i.e. the first n elements of βk). Similar to DeltaNet, the k-th element of βk was excluded when calculating, to obtain a matrix A with zero diagonal entries.

For ridge regression strategy with a given GRN structure, the dimension of the problem is reduced to:

CACP (3-11) k kTFk k

SAS (3-12) kkTFk where TFk is the set of TFs (n= n ) regulating gene k in the given GRN, Ck, Sk, and Pk are the k- TFk th row in matrix C, S, and P respectively, Ak is the 1 vector of akj for j TFk, and C and TFk

S are the submatrices consisting of the rows in C and S corresponding to TFk. The objective TFk function in this case is:

minyX ββ22 +  k k22 k βk

T T T T where yk = [ Ck Sk ] , X = [ [ ] , [ Im 0 ] ], βk = [ Ak Pk ] , and λ is a shrinkage parameter for the L2-norm penalty. Again, the GLMNET algorithm [35] was employed to solve LASSO or ridge regression above. Here, a 10-fold cross validation was used to determine the optimal shrinkage parameter T or λ in the same way as described in section 2.1.1.

3.1.3. Gene regulatory network for human epithelial lung cancer cells

For Calu-3 data analysis, DeltaNeTS was implemented using a prior GRN structure with ridge regression. For the GRN structure, TF-gene interactions specific for human epithelium lung cancer cells were obtained from Regulatory Circuit database [59]. Here, only TF-gene interactions with a confidence score greater than 0.1 were included. The confidence score indicates the normalized promoter activity level in a given cell type (0: not active, 1: maximally active) [59]. The GRN structure for Calu-3 consisted of 42,145 edges pointing from 515 TFs to 7,125 genes. 30 | CHAPTER3: DELTANETS

3.1.4. Gene expression data

In this chapter, the performance of DeltaNeTS was evaluated using in silico transcriptional expression profiles generated by GeneNetWeaver (GNW version 3.1) [60], as well as using yeast microarray gene expression datasets compiled from public databases. For in silico data, five replicates of single-gene KO and control (wild-type) experiments were simulated using a random E. coli subnetwork with 1000 nodes and 2249 edges, generated by GNW. Among them, the single- gene KO experiments for 114 transcription factors were selected (i.e. genes regulating at least one other gene). For each KO experiment, two time series datasets were considered, one set using a uniform time mesh with t = 0, 0.2, 0.4, 0.6, 0.8 and 1.0 (GNW1 dataset), and another set using a non-uniform time mesh with t = 0, 0.1, 0.2, 0.4, 0.7, and 1.0 (GNW2 dataset). For the analysis, the log2FC values were calculated by subtracting the averaged expressions over the replicates of the control experiments from those of the KO experiments.

For yeast, 531 raw microarray samples of Affymetrix Yeast Genome S98 array were compiled from 6 batch groups (GSE25582, GSE3076, GSE1934, E-MEXP-2354, GSE26169, and GSE9482) available in ArrayExpress [40] and Gene Expression Omnibus (GEO) [41]. The dataset contained 396 time series samples from 23 perturbation experiments, such as gene KO and gene activations, and 135 time series samples from wild-type strain (control experiments). The raw intensity data were normalized in log2 scale using justRMA function in the affy package of Bioconductor [42]. 5168 probe sets in the microarray (out of a total 9334 probe sets) were matched to gene symbols in ygs98.db package (Saccharomyces Genome database as of 26th September 2015). After taking averages over replicates and calculating the differences between perturbation and control experiments, the final yeast time series dataset comprised 137 samples of log2FC data for 5168 genes. In addition to time series data, 325 steady state microarray samples were further compiled from ArrayExpress and GEO, related to genetic perturbation experiments on yeast. These data were processed in the same manner as the time series data described above, leading to 118 samples of steady state log2FC data for 5168 genes.

For interferon signaling and influenza A infection study, the raw Agilent Whole Human Genome 4×44K microarray data were compiled from 12 batch experiments, of which 11 groups came from viral infection experiments, and the remaining group was from interferon-alpha (IFN-α) and interferon-gamma (IFN- γ) treatment experiments. The raw data were background-corrected and CHAPTER3: DELTANETS | 31 normalized using normexp and quantile methods in limma package of Bioconductor. The log2FCs with their adjusted p-value by Benjamini-Hockberg method were calculated using limma by comparing the samples between virus (or interferon) and mock conditions at each time point. The probe sets were mapped to the official gene symbols in hgug4112a.db package (Entrez Gene database as of 27th September 2015). For a gene with multiple probe sets, log2FCs from the probe set with the smallest average adjusted p-value were selected. Therefore, the set of pre-processed data includes 139 time series samples for 19,112 genes.

3.2. Results

3.2.1. Predicting constant perturbations under time-course gene expressions

The performance of DeltaNeTS was compared with the original method DeltaNet and to another method for analysing time series data called TSNI [23]. In addition, the z-scores of the log2FC data (Eq. 2-3) were used to rank gene targets. For DeltaNet and DeltaNeTS, we employed LASSO with a 10-fold cross validation as described in section 3.1.2. For TSNI, the MATLAB subroutines from the website of the original authors (URL: http://dibernardo.tigem.it/ softwares/time series-network-identification-tsni) were utilized. Unlike DeltaNeTS, TSNI subroutine above could only be applied to one time series experiment at a time, as the method performed data smoothing for each set of time series experiments. In DeltaNeTS and TSNI analyses, an additional constraint was enforced, such that the perturbations are constant over time (i.e. the same P for the different time points of the same experiment).

In assessing the performance of each method, for each sample, the genes were first ranked in decreasing magnitudes of the inferred perturbation coefficients pki or the z-scores. A higher ranked gene can be seen as a more likely target of the treatment than those ranked lower. The prediction accuracy metrics, namely the numbers of TP, false positive (FP), true negative (TN), and false negative (FN) predictions were calculated, as a function of the increasing rank number. Here, targets from different time points were counted separately. The signs of the predicted perturbations were also considered when computing the metrics above.

Figure 3-2 compares the true positive rates (TPR = TP/(TP+FN)) of the four methods. For TSNI, Figure 3-2 shows only the best TPRs among different numbers of PCs that were tested (PC number = 1 for GNW datasets, and PC number = 3 for yeast dataset). In general, the methods performed 32 | CHAPTER3: DELTANETS

Figure 3-2. True positive rates of target predictions by DeltaNeTS, DeltaNet, TSNI and z- scores for in silico time series dataset, using (a) uniform time sampling and (b) non-uniform time sampling, and for (c) yeast time series dataset. better in analysing in silico datasets than actual microarray data from yeast. Importantly, the two time series analysis methods, namely DeltaNeTS and TSNI performed equally well, while both methods significantly outperformed DeltaNet and z-scores. For yeast dataset, DeltaNeTS was the best performer by being able to put 95% of the true targets among the top 10 genes in the ranked list. This true positive rate was significantly better than TSNI, which placed 74% of the true targets among the top 10 genes.

Furthermore, the AUPR and AUROC [29] for each method were compared. The ROC was constructed by plotting the true positive rates (TP/(TP + FN)) vs. false positive rates (FP/(FP+TN)) as a function of the increasing rank number (i.e. genes were sequentially included from the ranked list). Meanwhile, the prediction precision was calculated as the ratio TP/(TP+FP), while the recall was equal to the true positive rates. As shown in Table 3-1, the AUROC stayed relatively high for all methods, except for DeltaNet on yeast dataset. Meanwhile, in terms of AUPR, the predictions of DeltaNeTS outperformed TSNI and other methods, especially when using yeast time series data. The AUPR of TSNI deteriorated significantly when using time series dataset with uneven time points (GNW2), in comparison to dataset with a uniform time-mesh (GNW1). In contrast, the AUPR values of DeltaNeTS predictions for GNW1 and GNW2 datasets did not differ significantly. Finally, DeltaNet method gave the poorest predictions among the methods considered, demonstrating the need to analyse time series data using an appropriate strategy. CHAPTER3: DELTANETS | 33

Figure 3-3. True positive rates of target predictions by DeltaNeTS and DeltaNet for (a) time series samples and (b) steady-state samples using yeast microarray datasets.

Table 3-1. AUROCs and AUPRs of DeltaNeTS, TSNI, DeltaNet, and z-scores

DeltaNeTS TSNI DeltaNet Z-scores GNW1 1.000 1.000 0.765 0.839 AUROC GNW2 0.998 0.999 0.755 0.824 Yeast 0.980 0.998 0.567 0.919 GNW1 0.662 0.637 0.491 0.545 AUPR GNW2 0.523 0.373 0.420 0.364 Yeast 0.580 0.330 0.055 0.035

In addition, the performance of DeltaNeTS was also tested in combining steady state and time series transcriptional expression profiles. In this case, the matrix C in Eq. 3-7 and Eq. 3-8 contained both types of data, while the matrix S was computed only for the time series samples. Figure 3-3 compares the TPRs of DeltaNeTS with those of DeltaNet when using the combination of time series and steady state datasets from yeast. The addition of steady state data improved DeltaNet’s target predictions for the time series experiments. Importantly, DeltaNeTS outperformed DeltaNet when predicting the targets of steady state experiments.

3.2.2. Predicting time-varying perturbations

Next, DeltaNeTS was applied to the gene expression profiles of Calu-3 to predict the targets of IFN-α and IFN-γ. Interferons activate different TFs through distinct signalling pathways and trigger the transcription of genes involved in immune responses. The binding sites are known as IFN-stimulated response elements (ISRE) for IFN-α signalling and IFN-γ activated sites (GAS) for IFN-γ signalling pathways [61]. Before analysing the data, the log2FC values that were not statistically significant (Benjamini-Hochberg adjust p-value > 0.05) were replaced with linearly 34 | CHAPTER3: DELTANETS interpolated values between the adjacent time points with a significant log2FC. If there was no statistically significant log2FC over the time points, the log2FC for the gene was set to zero. In addition, TF-gene interactions for human epithelial lung cancer cells were used as prior information of the GRN (see Section 3.1.2 and 3.1.3). As a result, DeltaNeTS only produced predictions pki only for the genes existing in the prior GRN structure (n = 7125). For the purpose of comparisons, TSNI and DeltaNet were also implemented to give perturbations related to the expression of the same genes as in DeltaNeTS. In the analysis using DeltaNeTS, the network perturbations were allowed to vary over time (i.e. individual prediction for each time point). Once the P matrix was generated by DeltaNeTS, an enrichment analysis for transcription factor binding sites (TFBS) was performed using Enrichr [62] using the top 50 genes in the average P coefficients over time points. The enrichment analysis of the results of TSNI, DeltaNet and log2FC were also performed using the top 50 genes according to the averaged P over time points. Finally, the ranking and significance of the enriched TFBS for ISRE for IFN-α and STAT5 including GAS sequence [63] for IFN-γ were compared. As shown in Table 3-2, the binding sites of ISRE, STAT5A, and STAT5B were all significantly enriched (adjusted p-value < 0.01) among the top 50 target predictions from DeltaNeTS. Meanwhile, not all of the expected binding sites were enriched among the top 50 target predictions from TSNI and log2FC (adjusted p-value < 0.01). Note that none of the TFBS of interferons were significantly enriched among the top 50 targets predicted by DeltaNet.

Furthermore, DeltaNeTS was used for analyzing dynamic network perturbations caused by H1N1 (influenza A/CA/04/09) and H5N1 (influenza A/VN/1203/04) type viral infections in the Calu-3 dataset. H1N1 and H5N1 influenza viruses are also known as the swine and avian influenza viruses, respectively. These viruses have recently posed a great threat to the public health worldwide [64]. Despite being classified as influenza A type, the two viral strains have distinct characteristics: H5N1 is highly pathogenic but less transmissible to human, while H1N1 is the opposite.

DeltaNeTS was applied in the same manner as the above analysis of interferon targets. The influenza infection period was divided into 3 time phases: phase 1 from 0 to 7 hours, phase 2 from 7 to 18 hours, and phase 3 from 18 hours onwards. A gene set enrichment analysis (GSEA) using Reactome pathways [65] was finally performed using the averaged P predictions over the time points within each phase. CHAPTER3: DELTANETS | 35

Table 3-2. Transcription factor binding sites enriched in the top 50 target predictions

IFN-α IFN-γ ISRE STAT5A STAT5B DeltaNeTS 2 (4.7E-03)* 3 (2.36E-03) 6 (8.97E-03) TSNI 3 (1.7E-04) 8 (6.25E-02) 14 (6.62E-01) DeltaNet 134 (5.9E-01) - (-) - (-) log2FC 3 (1.9E-04) 4 (1.37E-03) 9 (3.88E-02) *: rank (adjusted p-value) from Enrichr

The GSEA results by DeltaNeTS are depicted in Figure 3-4, clearly showing the key differences in biological pathways modulated by the two influenza A viral infections. In H5N1 infection, the pathways of cell cycle, DNA repair, and programmed cell death were prominently over- represented, whereas in the infection by H1N1 virus, cell survival-related pathways including G protein-coupled receptors (GPCR), mitogen activated protein kinase (MAPK) family cascades were distinctively modulated. The cell survival pathways such as GPCR and MAPK signaling pathways are typically known to be hijacked by influenza viruses for improving viral replication [66]. Meanwhile, in the case of the severe influenza A virus infection, apoptosis process is known to be triggered [67-68]. Besides apoptosis, H5N1 has been shown to cause excessive secretion of cytokine molecules [69]. Although the cytokine signaling pathway was mildly enriched (p-value > 0.01) in the analysis of H1N1 infection data, the highest enrichment (p-value < 0.01) appeared in the H5N1 infection between 7-18 hours. Furthermore, the enrichment of TGF- β (transforming growth factor-beta) among the targets predicted by DeltaNeTS for H1N1 is in agreement with previous findings that showed persistent induction of inflammatory response through TGF-β2 by H1N1 type influenza virus [70-71]. Note that the GSEA using the log2FC data did not produce results that were consistent with existing knowledge about these infections (see Appendix figure B1).

36 | CHAPTER3: DELTANETS

Figure 3-4. Enriched pathways resulting from gene set enrichment analysis of DeltaNeTS predictions using Reactome pathways. The size and color of dots indicate negative 10-based logarithm of p-values for the enriched terms. The influenza infection period was divided into 3 time phases: Phase1 = 0-7 hours, phase 2 = 7-18 hours, and phase 3 ≥ 18 hours post- infection. CHAPTER3: DELTANETS | 37

3.3. Discussion

Time series gene expression data are routinely measured in biological and clinical studies, and have also become increasingly available in public databases. To take advantage of such informative data, DeltaNeTS extends the previous method DeltaNet, originally created for steady state expression data. More specifically, DeltaNeTS incorporates an additional constraint on the GRN matrix A, using information contained in the slopes of the expression data, to prevent reversals of the causal directions in the GRN. Similar to DeltaNet, the magnitudes of the perturbation coefficients in P indicate the likelihood that the corresponding gene is a direct target of the treatment, while the signs suggest the types of the perturbation (activation/repression). In comparison to DeltaNet, TSNI and z-scores, DeltaNeTS could produce more accurate gene target predictions, when applied to in silico datasets and especially to yeast time series expression profiles.

There exist obvious similarities between TSNI and DeltaNeTS methods. While the starting ODE model in TSNI differ from the ODE model in DeltaNeTS, the final inference equations in the two strategies share strong resemblance. Also, in both strategies, the GRN and perturbation matrices A and P are inferred simultaneously. The key difference is in how DeltaNeTS and TSNI resolve the curse of dimensionality issue. In the case of TSNI, the original time series data are first smoothen, and the smoothing function is subsequently used to interpolate and produce data with much higher temporal resolution. By using only the most important PCs, the inference is finally carried out in a lower dimensional space for computational efficiency. In contrast, DeltaNeTS find a solution using LASSO or ridge regression. Without prior information of the GRN structure, a sparse GRN is assumed, and predictions for each gene are obtained from the gene expression profiles for a subset of other genes that are chosen by LASSO regression. When the GRN structure is given, the perturbation inference is performed for the genes existing in this structure. In this case, the coefficients for all the variables are calculated using ridge regression.

Because of its dependence on data smoothing for interpolation, TSNI could become sensitive to any bias in this step. In the case study, it was observed that the performance of TSNI dropped significantly for time series dataset with non-uniform time sampling (see AUPR in Table 3-1). It was further noted that the accuracy of the gene target predictions from TSNI depended sensitively on the number of PCs, where using too few or too many PCs would lead to poor prediction performance. Unfortunately, the authors of TSNI did not provide a clear guidance nor a strategy 38 | CHAPTER3: DELTANETS to select the optimal number of PCs. Finally, the biggest drawback of the current TSNI implementation was its inability to combine data from different perturbation experiments and to use steady state data. In contrast, DeltaNeTS can naturally accommodate both steady state and time series expression profiles. As illustrated in Figure 3-3, DeltaNeTS could even outperform DeltaNet in predicting the gene targets of steady state experiments, by combining steady state and time series datasets.

Furthermore, the application of DeltaNeTS to Calu-3 data demonstrated the efficacy of DeltaNeTS for target prediction in a larger and more complex regulatory system of human cells. One of the main issues in analyzing human cell data (or metazoans) is the large number of genes. Particularly, we expect that the computational time of DeltaNeTS increases with the dimension of the dataset. One way to alleviate this issue is to use dimensionality reduction method such as singular value decomposition (SVD) or clustering of highly correlated genes, as done in TSNI. However, one can consider that implementing dimensionality reduction using SVD comes expectedly at the cost of decreased prediction accuracy (data not shown). Fortunately, the implementation of DeltaNeTS is ‘embarrassingly parallel’ and therefore lends itself to parallel computing.

3.4. Summary

For inferring targets from time series gene expression data, DeltaNet was extended to incorporate an additional network constraint using slopes of log2FC. In the new method called DeltaNeTS, the underdetermined problem is solved using LASSO or ridge regression. But, for ridge regression, a prior information on the structure of the GRN is required to alleviate the curse of dimensionality issue, i.e. to reduce the degrees of freedom in the inference problem. MATLAB and R packages of DeltaNeTS are available in Github repository (https://github.com/CABSEL/ DeltaNeTS). In the case studies of in silico and yeast data, DeltaNeTS outperformed the original DeltaNet and another method called TSNI for inferring the gene targets from time series transcriptional expression profiles. Particularly, for yeast, the ability of DeltaNeTS to seamlessly combine steady state and time series expression data led to accurate target predictions, not only for time series experiments, but also for steady state samples. Moreover, in Calu-3 data analysis, CHAPTER3: DELTANETS | 39

DeltaNeTS was able to shed light on dynamic network perturbations and eventually biological pathways in respond to interferons and influenza A infections.

3.5. Remark

A part of the work presented in this chapter has been published in a conference proceedings of the 6th Foundation of Systems Biology in Engineering (FOSBE) [72].

40 | CHAPTER3: DELTANETS

CHAPTER4: PROTINA | 41

4. Network perturbation analysis of gene transcriptional profiles reveals protein targets and mechanism of action of drugs and influenza A viral infection

The differential expression of genes induced by a drug treatment represents the downstream effects of the interactions between the drug and intracellular molecules that control the gene expression. In this case, the action of the drug compounds alters the regulatory activity of molecules that directly or indirectly control the gene transcriptional process, such as TFs or proteins that interact with TFs. Note that the expression (or the level) of the target molecules themselves are often not affected by their interactions with the drug [4]. In this chapter, the focus shifts from identifying gene targets to inferring protein targets of drugs. More specifically, a new network perturbation analysis method called ProTINA (Protein Target Inference by Network Analysis) is developed for identifying protein targets using gene transcriptional profiles. The two main steps in ProTINA include: (a) the creation of a model of tissue or cell type-specific PGRN by leveraging on the availability of comprehensive maps of protein-protein and protein-DNA (TF-gene) interactions, and (b) the calculation of protein target scores based on the enhancement or attenuation of the protein-gene regulatory activity. Here, the strength and mode of the protein-gene regulations are inferred from gene expression data using the same procedure in DeltaNeTS. Using gene expression profiles from NCI-DREAM drug synergy challenge, genotoxicity study, and drug targeting study, the superiority of ProTINA over the state-of-the-art method DeMAND and DE in predicting the protein targets of drugs is demonstrated. Besides protein targets of compounds, the application of ProTINA to study host-pathogen interactions is also presented.

4.1. Method and Materials

4.1.1. Protein target identification using ProTINA

4.1.1.1.Protein-gene network. The protein-gene regulatory network is a bipartite graph with weighted, directed edges pointing from a protein to a gene (see Figure 4-1(a)). The edges in the PGRN further describe the regulation of gene expression by TFs and their protein partners. The PGRN is constructed by combining two 42 | CHAPTER4: PROTINA

Figure 4-1. Protein target prediction by ProTINA. (a) The protein-gene network describes direct and indirect regulations of gene expression by TFs and their protein partners (P), respectively. A drug interaction with a protein is expected to cause differential expression of the downstream genes in the PGN. (b) Based on a kinetic model of gene transcriptional process, PROTINA infers the weights of the protein-gene regulatory edges, denoted by akj, using gene expression data. The variable akj describes the regulation of protein j on gene k, where the magnitude and sign of akj indicate the strength and mode (+akj: activation, -akj: repression) of the regulatory interaction, respectively. (c) A candidate protein target is scored based on the deviations in the expression of downstream genes from the PGN model prediction (Pj: log2FC expression of protein j, Gk: log2FC expression of gene k). The colored dots in the plots illustrate the log2FC data of a particular drug treatment, while the lines show the predicted expression of gene k by the (linear) PGN model. The variable zk denotes the z-score of the deviation of the expression of gene k from the PGN model prediction. A drug-induced enhancement of protein-gene regulatory interactions is indicated by a positive (negative) zk in the expression of genes that are activated (repressed) by the protein (i.e. akjzk > 0). Vice versa, a drug-induced attenuation is indicated by a negative (positive) zk in the expression of genes that are activated (repressed) by the protein (i.e. akjzk < 0). (d) The score of a candidate protein target is determined by combining the z-scores of the set of regulatory edges associated with the protein in the PGN. A positive (negative) score indicates a drug- induced enhancement (attenuation). The larger the magnitude of the score, the more consistent is the drug induced perturbations (enhancement/attenuation) on the protein-gene regulatory edges.

CHAPTER4: PROTINA | 43 types of networks, namely the TF-gene network and PIN. For the construction of human cell type- specific PGRNs, the Regulatory Circuit resource that provides 394 cell type and tissue-specific TF-gene interactions [59] was used. More specifically, for the analysis of the NCI-DREAM drug synergy, genotoxic compound study, and influenza A viral infection study data sets, the TF-gene networks of human lymphoma cells, pleomorphic hepatocellular carcinoma cells, and epithelium lung cancer cells were used, respectively. Here, only TF-gene interactions with a Regulatory Circuit confidence score greater than 0.1 were included. The confidence score indicates the normalized promoter activity level in a given cell type (0: not active, 1: maximally active) [59]. For the analysis of mouse pancreatic cell dataset, the mouse pancreatic TF-gene interactions were obtained from CellNet [73]. In the construction of the PGRNs, any TF-gene interactions involving unmeasured genes were excluded. In summary, the TF-gene network for human lymphoma, hepatocellular carcinoma cell, and epithelium lung cancer cell lines included 31,392 edges pointing from 515 TFs to 5,153 genes, 3,868 edges pointing from 413 TFs to 953 genes, and 42,145 edges pointing from 515 TFs to 7,125 genes, respectively. The mouse pancreatic PGRN included 2,922 edges, involving 95 TFs and 588 genes.

For human PIN, the protein-protein interactions were combined from two databases, namely Enrichr [62] and STRING [74]. For mouse pancreatic cells, mouse (Mus musculus) PIN was obtained from the STRING database. For each TF, its protein partners were defined as proteins that are within a network distance of 2 from the TF in the PIN. When using the STRING database, all direct protein partners of TFs were included, and proteins with a network distance of 2 from TFs with a confidence score reported on STRING larger than 0.5 were also included. For human lymphoma, hepatocytes, and lung cancer cells, each PIN includes 11,090 protein partners for a subset of 499 TFs (out of 515 TFs), 10,834 protein partners for a subset of 403 TFs (out of 413 TFs), and 6,175 protein partners for a subset of 504TFs (out of 515 TFs), respectively. For mouse pancreatic cells, the PIN includes 6,620 protein partners for a subset of 89 TFs (out of 95 TFs).

Finally, in the construction of the PGRNs, a directed edge was assigned from a TF or from a protein partner of a TF, to every gene regulated by the TF. In summary, the cell type-specific PGRN for human lymphoma cells included 21,488,617 regulatory edges among 11,161 TFs/proteins and 5,153 genes. For hepatocellular carcinoma cells, the PGRN comprised 3,726,393 edges among 10,893 TFs/proteins and 953 genes. For human lung cancer cells, the PGRN comprised 30,656,861 edges among 11346 TFs/proteins and 7,125 genes. For mouse pancreatic 44 | CHAPTER4: PROTINA cells, the PGRN consisted of 1,417,972 edges among 6,661 TFs/proteins and 588 genes. While increasing the size of the PGRN, for example by including lesser confident TF-gene and protein- protein interactions or by including proteins with a network distance from TFs larger than 2, would allow the scoring of a higher number of proteins, such strategy often lowers the accuracy of the protein target predictions.

4.1.1.2.Gene transcription model The edges in the PGRN have weights, whose magnitudes represent the strength of the gene regulation and whose signs indicate the direction or the mode of the regulation: positive for gene activation and negative for gene repression. The weights are inferred from the gene expression dataset by adapting a procedure described in the section 3.1.1.

While the regulatory edges in the model above usually describe TF-gene interactions, in ProTINA, the (indirect) regulation of a gene by proteins that interact with the TFs was further accounted. For this purpose, a modified ODE model from Eq. 1-1 is considered as following:

nnTF P drk abkj kjq uk r j() r j r q d k r k (4-1) dt jq11 where a positive (negative) bkjq describes the activation (repression) of the k-th gene by a protein q through its interaction with the TF protein j. The variables nTF and nP denote the numbers of TFs and their protein partners, respectively. The multiplication of two variables rj and rq implies that the regulation of gene k by protein q requires the TF protein j (a non-zero rj). The ODE model in Eq. 4-1 can be simplified into:

nnTFP   drk akjqj b kjq b kjq uk r j  r q  d k r k dt jq11  

nnTFP   aakj kq uk r j  r q  d k r k (4-2) jq11  

nnTFP  akj uk r j d k r k j1

 where akj denotes the overall regulatory influence of each protein j, including TFs and their protein partners, on the expression of gene k. Note that the model in Eq. 4-2 is mathematically equivalent to that in Eq. 1-1. CHAPTER4: PROTINA | 45

By taking the pseudo-steady state assumption, the above model equation can be linearized using a logarithmic transformation (see section 1.2.3). The inference of the weights from the gene expression dataset involved the following linear regression problem:

nnTF P  (4-3) cki a kj c ji p ki j1 where cki denotes the log2FC expression for gene k in sample i. The variable pki represents the part of log2FC of gene k expression in sample i that cannot be accounted for by the log2FC of its protein regulators. In other words, pki indicates the perturbations to the expression of gene k. As detailed below, ProTINA relies on the magnitude and directions of such network perturbations (dysregulations) to identify proteins with altered gene regulatory activity.

The dynamical information contained in time series gene expression profiles could greatly improve the inference of the edge weights above. As previously described in Chapter 3, such information could be accounted for by adding the following linear constraint on the linear regression problem:

nnTF P  (4-4) ski  a kj s ji j1 where ski is the time derivatives (slope) of the log2FC of gene k in sample i. The slopes of the log2FC at each sampling time point were computed using a second-order accurate finite difference approximation [58]. In summary, the estimation of edge weights in ProTINA involved the following linear regression problem:

C = A C + P (4-5) k k Rk k

S = A S (4-6) k k Rk where Ck and Sk are the 1 m vectors of log2FC expressions and time-derivatives of gene k across m samples, the subscript Rk refers to the set of (nTF,k + nP,k) protein regulators of gene k in the cell

type-specific PGRN, CRk and SRk denote the (nTF + nP,k) m matrices of log2FCs and their slopes across m samples, Ak is the 1 (nTF+nP) vector of weights for edges in the PGRN pointing to gene k, and Pk is the 1 m vector of dysregulation impacts of gene k over m samples. 46 | CHAPTER4: PROTINA

In ProTINA, the vectors Ak and Pk for each gene k in Equations (4-5) and (4-6) were estimated by ridge regression. The ridge regression provides a solution to an underdetermined linear regression problem of the standard form: yk = Xβk + ε, using a penalized least square objective function:

minyX ββ22 +  k k22 k βk where λ is a shrinkage parameter for the L2-norm penalty. Equations (4-5) and (4-6) are rewritten T T T into the standard linear regression problem with yk = [ Ck Sk ] , X = [ [CRk SRk ] , [ Im 0 ] ],

T βk = [ Ak Pk ] . Before applying the ridge regression, the vectors of log2FCs and slopes were scaled as described in section 3.1.2 using Eq. 3-9 and 3-10. Self-loops were excluded in the regression, and thus the diagonal entries of Ak were set to 0. In the applications of ProTINA, 10- fold cross validations were employed to determine the optimal λ, one that gives the minimum average prediction error. Here, the GLMNET package [35] was used for both the MATLAB and R versions of ProTINA.

4.1.1.3.Protein target scoring In ProTINA, each candidate protein target is assigned a score based on the deviation of the expression of its downstream genes. More specifically, the residuals of the linear regression problem in Equations (4-5) were computed for each gene k, i.e.

r C A C (4-7) k k k Rk where rk is the 1 m vector of residuals for m samples. For each drug treatment, there often exist multiple gene expression profiles, taken at different time points or different doses.

Correspondingly, the z-score zlk was evaluated for each drug treatment l and for each gene k, according to

rlk zlk  (4-8)  kln

where rlk denotes the average residual of gene k among the drug treatment samples, σk denotes the sample standard deviation of the residuals in all samples besides the drug treatment, and nl denotes the number of samples from the drug treatment. A positive (negative) z-score indicates that the expression of gene k in the particular sample was higher (lower) than expected based on the CHAPTER4: PROTINA | 47 expression of its regulators. The greater the magnitude of the z-score, the more significant is the gene dysregulation.

The target score of a TF or protein for a drug is calculated by combining the z-scores of the target genes in the PGRN, as follows: (ref. 75)

n D wz s  k1 kj ki (4-9) ji n D w2 k1 k where zki denotes the z-score of gene k and sji denotes the score of the TF/protein j in the drug treatment sample i. The weighting coefficients wkj are set equal to the edge weights akj divided by the maximum magnitude of akj across all j. In other words, the weight wkj reflects the fraction of the regulation of gene k expression that could be attributed to protein j. When wkj (or akj) and zki have the same signs, wkjzki thus takes a positive value. As illustrated in Figure 4-1(c), a positive wkjzki implies an enhanced regulatory activity of protein j on gene k, since the activation (inhibition) of gene k expression by protein j is stronger in this sample than expected by the PGRN model. In contrast, a negative wkjzki indicates an attenuation of the regulatory influence of protein j on gene k, since the activation (inhibition) of gene k expression by protein j is weaker than predicted by the

PGRN model. Consequently, a highly positive (negative) score sji is an overall indicator of strongly enhanced (attenuated) regulatory activity of protein j by the drug treatment in sample i (see Figure 4-1(d)). The protein targets in each drug treatment sample are ranked in decreasing magnitude of the scores sji.

4.1.2. Gene expression data

ProTINA was applied to three datasets of drug treatments from NCI-DREAM drug synergy challenge [76], genotoxicity study [77] and chromosome drug targeting study [78], and to gene expression data of human lung cancer cell Calu-3 from influenza A viral infection studies [71, 79- 81. For NCI-DREAM drug synergy challenge, the raw Affymetrix Human Genome U219 microarray data were compiled from Gene Expression Omnibus (GEO) database (Barrett et al., 2013) (accession number: GSE51068). The raw data were first normalized and transformed into log2 scaled expressions using justRMA function in the affy package of Bioconductor [42]. Then, the log2FC differential expressions and their statistical significance (Benjamini-Hochberg adjusted p-values) were calculated using a linear fit model and empirical Bayes method in the 48 | CHAPTER4: PROTINA limma package of Bionconductor. Three samples from the drug treatment using the low dose of Aclacinomycin A were dropped because all of the log2FC expressions were close to 1 and thus not statistically significant. The probe sets were mapped to gene symbols using hgu219.db annotation package (Entrez Gene database as of 27th September 2015). In the case of multiple probe sets mapping to a gene symbol, the log2FC was from the probe set with the smallest average adjusted p-value over the samples.

The raw microarray data from genotoxicity study [77] in human HepG2 cell line were obtained from GEO (accession numbers: GSE28878 using Affymetrix GeneChip Human Genome U133 Plus 2.0 array and GSE58235 using Affymetrix HT Human Genome U133+ PM array). As with the drug synergy data, the microarray data were first normalized using justRMA, and the log2FCs and their adjusted p-values were calculated using limma in Bioconductor. Because the data came from different microarray platforms, the gene symbols were matched separately for each platform using hgu133plus2.db annotation package (Entrez database of 27th September 2015) and HT_HG- U133_Plus_PM annotation file in Affymetrix, respectively. Likewise, in the case of multiple probe sets matching a gene symbol, the probe set with the smallest average adjusted p-value across all samples was chosen.

The raw data from the chromosome-targeting study using mouse pancreatic alpha and beta cells [78] were also obtained from GEO database (accession number: GSE36379). Again, the raw data were normalized using justRMA, and the log2FCs and their adjusted p-values were calculated by limma. The probe sets were mapped to the corresponding gene symbols using moe430a.db package (Entrez Gene database as of 27th September 2015) in Bioconductor. In the case of multiple probe sets mapping to a gene symbol, the probe set with the smallest average adjusted p-value among the samples were selected.

For influenza A infection analysis, the raw microarray data of four influenza studies [71, 79- 81] were from GEO database (accession numbers: GSE40844, GSE37571, GSE33142, andGSE28166). The raw data were background-corrected and normalized using normexp and quantile methods in limma package of Bioconductor. The log2FCs and their adjusted p-values were again calculated by limma. The probes were mapped to the corresponding gene symbols using hgug4112a.db package (Entrez Gene database as of 27th September 2015). Like before, for CHAPTER4: PROTINA | 49 genes with multiple probe sets, the log2FC value corresponding to the probe set with the smallest average adjusted p-value was chosen.

4.1.3. DeMAND and differential expression analysis

For DeMAND analysis, the public R subroutines were obtained from the website: http://califano.c2b2.columbia.edu/demand. Following the procedure detailed in the original publication [19], the RMA normalized gene expression values were used as inputs to the analysis. In DeMAND analysis, the same cell type-specific PGRNs as those in ProTINA were utilized. For each candidate protein target, DeMAND evaluated the p-value of the deviations in the gene expression relationship between the protein target and each of the genes connected to this protein in the PGRN. The drug targets were ranked in increasing magnitude of the combined p-values.

In DE analysis, the log2FC values as described in section 4.1.1 were directly used as the target scores. Correspondingly, the candidate protein targets were ranked in decreasing magnitude of the log2FC gene expression values.

4.1.4. Gene set enrichment analysis

For influenza A virus study, a GSEA of the protein target predictions from ProTINA, DeMAND and DE analysis was performed for the KEGG biological pathways [82], using the R package GAGE (Generally Applicable Gene-set/pathway Enrichment analysis) with Kolmogorov-Smirnov tests [83]. In the case of ProTINA and DeMAND, target proteins with zero score were excluded from the GSEA.

4.1.5. Reference protein targets

The reference protein targets of compounds in drug treatment studies were compiled from 5 different public databases of chemical-protein interactions: DrugBank [84], Therapeutic Target Database (TTD) [85], MATADOR [86], Comparative Toxicogenomics Database (CTD) [87], and STITCH [88]. DrugBank and TTD provided information on the mechanism of drug actions as well as the proteins that have physical binding interactions with drugs. Meanwhile, MATADOR, CTD, and STITCH gave interactions between proteins and chemical compounds, curated from text mining and experimental evidences. When retrieving the protein targets of drugs from these databases, only the proteins that directly bind to the queried drugs were considered. The reference 50 | CHAPTER4: PROTINA targets for each dataset in this study are provided in Appendix Table C1-C3. Meanwhile, the reference protein targets for influenza A virus study were obtained from ref. 89, where 1,292 host proteins that likely physically bind to viral proteins of influenza type A/WSN/33 in human embryonic kidney cells (HEK293) were identified by whole-genome co-immunoprecipitation assays.

4.2. Results

4.2.1. New protein target prediction strategy

ProTINA takes advantage of the availability of comprehensive protein-protein and protein- DNA interaction databases to construct, when possible, a tissue or cell type-specific PGRN. The method considers a PGRN with weighted directed edges (see Figure 4-1(a)), describing direct and indirect gene transcriptional regulation by TFs and their protein partners. The edge weights are determined by applying ridge regression using the gene expression data based on a kinetic model of the gene transcriptional process (see Figure 4-1(b) and section 4.2.2.2). Here, a positive weight indicates a gene activation, while a negative weight implies a gene repression. Because of the underlying kinetic model, ProTINA is able to incorporate dynamical gene expression data, a common type of data from drug treatment studies [9, 76-78]. The scoring of drug targets is based on the enhancement or attenuation of protein-gene regulatory interactions caused by the drug treatment. A drug-induced gene regulatory enhancement occurs when the expression of genes that are positively (negatively) regulated by a candidate target, becomes higher (lower) in drug treated samples than what is predicted by the PGRN model (see Figure 4-1(c)). A drug-induced attenuation describes the opposite scenario, where the expression of positively (negatively) regulated genes of a target is lower (higher) than expected from the model. For any given differential gene expression sample, a candidate protein target is scored based on the overall enhancement and/or attenuation of its regulatory influence on the downstream genes (see Figure 4-1(d) and Material and Methods). Thus, a protein target with a more positive (negative) score is considered a more likely target of the drug, in which the drug treatment enhances (attenuates) the gene regulatory activity. CHAPTER4: PROTINA | 51

4.2.2. Prediction of known targets of drugs

The performance of ProTINA was tested in predicting drug targets using gene expression data from three drug treatment studies employing human and mouse cell lines. The first dataset came from the NCI-DREAM drug synergy study using human diffuse large B cell lymphoma OCI-LY3 [76], the second from the compound genotoxicity study using human liver cancer cells HepG2 [77], and the third from the chromatin-targeting compound study using mouse pancreatic cells [78]. ProTINA was compared to the state-of-the-art network-based analytical method DeMAND [19], and to the traditional DE. For the analysis of datasets from human cell lines, cell-type specific PGRNs were constructed by combining human PIN from STRING [74] and Enrichr database [62] and human cell-type specific protein-DNA networks from Regulatory Circuit resource [59]. Meanwhile, for the construction of mouse pancreatic cell type-specific PGRN, mouse (Mus musculus) PIN from STRING [74] and mouse protein-DNA interactions from CellNet [73] were utilized (see details in Material and Methods).

In assessing the performance of ProTINA and the other methods, the ranked list of protein target predictions for each compound was compared with the reference drug targets compiled from the literature (see Material and Methods and Appendix Table C1-C3). In the ranked list for each method, protein target candidates were ranked according to decreasing magnitudes of the protein scores in ProTINA, increasing p-values of network dysregulation from DeMAND, and decreasing magnitudes of log2FC gene expression from DE analysis. Figure 4-2 (also see Appendix Table C4-C6) summarizes the AUROCs of the target predictions, calculated in a similar manner as section 2.1.3 from ProTINA, DeMAND, and DE analysis, showing ProTINA significantly outperforming DeMAND and DE analysis for all three datasets. Here, the drug target predictions from DE analysis had the poorest AUROCs with an overall average below 0.664 (AUROC range: 0.393 – 0.982). Meanwhile, the target predictions of DeMAND were slightly better than the DE analysis, averaging at 0.743 (AUROC range: 0.405 – 0.989) for the three datasets. Meanwhile, ProTINA gave the highest average AUROCs among the methods with an average of 0.825 (AUROC range: 0.425 – 0.991). 52 | CHAPTER4: PROTINA

Figure 4-2. Prediction of known targets of drugs. AUROCs of protein target predictions from ProTINA, DeMAND and DE methods for the NCI-DREAM drug synergy (human B- cell lymphoma), the compound genotoxicity (human HepG2) and the chromatin targeting study (mouse pancreatic cell) datasets (*: p-value < 0.01, **: p-value <0.001 by paired t-test).

4.2.3. Mechanism of action of drugs

Besides high AUROCs, ProTINA also provided accurate and specific indications on the MoA of the compounds. In the NCI-DREAM synergy study, roughly half of the compounds are known to cause DNA damages, including DNA topoisomerase inhibitors (camptothecin, doxorubicin and etoposide), DNA crosslinker (mitomycin C), oxidative DNA damaging agent (methothrexate), and histone deacetylase (HDAC) inhibitors (trichostatin A). In demonstrating ProTINA’s ability to reveal the compound MoA, I focused on the canonical p53 DNA damage response pathway [19], as illustrated in Figure 4-3. Here, the activation of p53 in response to DNA damage is expected to induce the transcription of Cyclin Dependent Kinase Inhibitor 1A (CDKN1A) and Growth Arrest and DNA Damage Inducible Alpha (GADD45A) [90-91]. In turn, CDKN1A and GADD45A – through their interactions with Proliferating Cell Nuclear Antigen (PCNA) – regulate the DNA replication and repair process [92]. GADD45A also inhibits the catalytic activity of Aurora Kinase A (AURKA) [93], leading to a lowered activation of Polo-like Kinase 1 (PLK1) and Cyclin B1 (CCNB1) in a phosphorylation cascade [94-95]. As shown in Figure 4-4(a), except for trichostatin A, the six proteins in the canonical p53 pathway above were ranked highly by ProTINA among CHAPTER4: PROTINA | 53

Figure 4-3. Canonical p53 DNA damage response pathway. In response to DNA damage, GADD45A, CDKN1A, PCNA are activated, while AURKA, CCNB1, and PLK1 proteins are inhibited [19]. the genotoxic compounds in the study (median rank <500), consistent with their known MoA. Note that the same six proteins were ranked much lower among the non-DNA damaging compounds (median rank >500), signifying a high specificity of ProTINA predictions (see also Appendix Figure C1). Equally important, ProTINA was able to accurately identify the direction of the drug-induced alterations caused by the DNA damaging compounds. The signs of protein target scores from ProTINA indicated drug-induced enhancement (positive scores) of CDKN1A, PCNA, and GADD45A, and attenuation (negative scores) of CCNB1, AURKA, and PLK1 (see Appendix Table C7), consistent with the expected response of these proteins to DNA damage in Figure 4-3.

As illustrated in Figure 4-4(a), DeMAND and DE analysis also performed reasonably well in predicting the compounds’ MoA. But, the directions of the perturbations predicted by DE analysis were not always consistent with the expected response to DNA damage (see Appendix Table C8- C9). Meanwhile, DeMAND did not provide any information on the directions of the drug perturbations. In addition, the protein target scores of ProtTINA provided a clearer demarcation between the genotoxic and the non-genotoxic agents among the compounds in the dataset, than DeMAND and DE analysis (see Appendix Figure C1). Besides the canonical p53 response pathway, the ranking of proteins involved in the overall DNA damage repair (DDR) and its associated pathways [96] were further examined (see Appendix Table C10-C11). As depicted in Figure 4-4(b), ProTINA ranked these proteins much higher than DeMAND and DE analysis, with DE performing the poorest among the methods considered. 54 | CHAPTER4: PROTINA

Figure 4-4. Mechanism of action of compounds based on target predictions by ProTINA. (a) The rank distribution of the canonical p53 DNA damage response proteins in the drug target predictions of PROTINA, DeMAND and DE for the NCI-DREAM drug synergy dataset. (B) The rank distribution of proteins involved in the core DNA-damage repair (DDR) and DDR- associated pathways [96] in the target predictions of PROTINA, DeMAND, and DE for the DNA damaging compounds in the NCI-DREAM drug synergy study (**: p-value <0.001 by Wilcoxon signed rank tests).

In comparison to DeMAND and DE analysis, ProTINA was further able to detect a specific MoA of mitomycin C, whose DNA crosslinking activity is expected to prompt a particular DNA repair process called the fanconi anemia pathway [97]. The fanconi anemia pathway relies on a specific protein complex to ubiquitinate Fanconi Anemia Group D2 Protein (FANCD2) and Fanconi Anemia Group I Protein (FANCI), as well as two homologous recombination (HR) repair proteins, namely Breast Cancer Type 1 Susceptibility Protein (BRCA1) and RAD51 Recombinase (RAD51) [98]. In ProTINA analysis, the average rank of FANCD2, FANCI, BRCA1, and RAD51 was within top 100 for mitomycin C, while the average rank of those proteins was much greater than 100 for the other DNA damaging agents (see Appendix Table C12). However, the specific activation of the fanconi anemia pathway by mitomycin C was not detected by DeMAND or DE analysis. Thus, ProTINA provided more sensitive and specific indications for the mechanism of action of compounds than DeMAND and DE. CHAPTER4: PROTINA | 55

4.2.4. Application of ProTINA for predicting pathogen-host interactions

Next, ProTINA was applied to time-course gene expression profiles of human lung cancer cells (Calu-3) under influenza A virus infection, with the goal of identifying host factors that interact with the viral proteins. The gene expression data came from four studies of influenza A viruses, including A/Netherlands/602/2009 (H1N1), A/CA/04/2009 (H1N1), and A/Vietnam/1203/2004 (H5N1) [71, 79-81]. Here, ProTINA was employed to compute the overall protein target scores using the gene expression data of Calu-3 from the four studies above, by averaging the scores from the early phase of the influenza infection between 0 to 12 hours. The target predictions of ProTINA were compared to the findings from a genome-wide co-immunoprecipitation analysis of host and viral protein interactions [89]. More specifically, the aforementioned study reported 1,292 host proteins that co-immunoprecipitated with viral proteins of influenza A/WSN/33 using human embryonic kidney cells (HEK293). Despite the discrepancy in the cell types and influenza viral strains between the co-immunoprecipitation analysis and the gene expression profiling, influenza A viruses share similar features and common protein interactions [99-100]. Besides ProTINA, the accuracy of viral target predictions from DeMAND and DE was also evaluated for the same dataset.

Figure 4-5 gives the receiver operating characteristic (ROC) curves of the target predictions from ProTINA, DeMAND and DE analysis. ProTINA outperformed the two other methods, providing the highest AUROC (ProTINA: 0.758 vs. Demand: 0.687 and DE: 0.647). Furthermore,

Figure 4-5. Prediction of targets of influenza A virus. The receiver operative characteristic curves give the true positive rate versus the false positive rate relationship of the protein target predictions from ProTINA, DeMAND, and DE against proteins that co- immunoprecipitate with influenza A viral proteins. The AUROCs for ProTINA, DeMAND and DE analysis are 0.758, 0.687 and 0.647, respectively. 56 | CHAPTER4: PROTINA a GSEA for the target predictions from each of the methods (see Material and Methods) was performed to elucidate the key pathways involved in the viral infection and the accompanying host response. The results of the GSEA are summarized in Figure 4-6. Both DeMAND and DE target predictions were enriched for only a few pathways (q-value < 0.01), while ProTINA prediction had a much higher number of overrepresented pathways.

The common enriched pathways among ProTINA, DeMAND and DE (top of Figure 4-6) included known mechanisms related to viral entry, replication and assembly, including endocytosis [101], protein processing in endoplasmic reticulum [102], ubiquitin mediated proteolysis [103-104] and RIG-I-like receptor signaling pathway [105-106]. Both ProTINA and DE analysis indicated the modulation of host cell cycle [107], mRNA surveillance [108] and DNA damage response [109]. Only ProTINA prediction was significantly enriched for focal adhesion and actin cytoskeleton, which have been shown to regulate influenza virus entry at the early stage of infection [110]. In addition, ProTINA target predictions were also enriched for a broad set of host response pathways to viral infection, including host defense mechanism (e.g., T- and B-cell receptor pathways, phagocytosis, leukocyte migration, chemokine signaling pathways), DNA damage repair (e.g., nucleotide excision repair, p53 signaling pathway, homologous recombination) and apoptosis. As several influenza proteins are known to interfere with interferon production (which in turn regulates several cytokines) [105-106], these findings suggest that, overall, ProTINA provided a broader picture of the early events in the influenza A viral infection, than DeMAND and DE analysis. CHAPTER4: PROTINA | 57

Figure 4-6. Gene set enrichment analysis for KEGG pathways for the influenza A protein target predictions from ProTINA, DeMAND, and DE. The size and color of dots correspond to –log10 scale of the q-values. Only pathways with q-value < 0.01 are shown. 58 | CHAPTER4: PROTINA

4.3. Discussion

ProTINA is a novel and highly effective network-based analytical method for inferring the protein targets of compounds from gene expression profiling data. The target scoring is based on quantifying perturbations in the protein-gene regulatory network, specifically enhancement or attenuation of gene regulatory interactions, caused by the compound treatment. ProTINA combines information of TF-gene and protein-protein interactions and data of differential gene expressions to create a tissue or cell type-specific PGRN model. In the applications to three benchmark drug treatment datasets using human and mouse cell lines, ProTINA significantly outperformed the state-of-the-art algorithm DeMAND, which also relies on network dysregulation scoring, and the standard DE analysis. The target predictions of ProTINA also provide indications for the MoA of compounds, including the directions of the network perturbations, with high sensitivity and specificity.

Both ProTINA and DeMAND score the protein targets of compounds based on gene regulatory network perturbations. In particular, DeMAND calculates protein dysregulation scores (p-values) for a given gene regulatory network, by statistical comparison between samples from drug treatment and from control experiments. Thus, DeMAND requires only few samples to generate its prediction (provided that the network can be defined a priori). On the other hand, ProTINA makes use of available differential gene expression profiles from a study or a cell line (i.e., not only for a particular drug), to assign the edge weights of the PGRN by ridge regression. Importantly, in the regression analysis, the PGRN model used in ProTINA accounts for the network perturbations. The ability of ProTINA to incorporate data from other drug treatments or conditions in the scoring of protein targets makes this method particularly suited to take advantage of extensive and still growing number of gene transcriptional profiles from publicly accessible databases such as GEO.

Besides its intended use to predict targets of compounds, the analysis of network perturbations using ProTINA could provide insights into the mechanism of diseases. In the application to gene expression profiles of Calu-3 cells from influenza A infection studies, ProTINA again outperformed DeMAND and DE analysis in identifying host factors that bind with viral proteins. Furthermore, the GSEA of ProTINA target predictions revealed the spectrum of cellular processes involved in the early phase of influenza A infection, including pathways involved in viral entry, CHAPTER4: PROTINA | 59 replication and assembly, and those related to cellular response to viral infection. Among the pathways with the highest significance (lowest q-value) was focal adhesion, which has been shown to regulate influenza viral entry as well as replication [110]. Meanwhile, the target predictions of DeMAND and DE analysis had fewer enriched pathways, and thus were less informative than the target analysis by ProTINA.

4.4. Summary

ProTINA is a general network-based analysis tool for identifying protein targets of compounds, using protein-gene network model and genome-scale gene expression profiles. In both drug and pathogen target analysis, ProTINA outperformed the state-of-the-art method DeMAND and differential expression analysis with regard to the accuracy of predicting known targets of drugs and influenza A viruses. Besides known target predictions, ProTINA could provide sensitive and specific prediction of compound’s mechanism of action. The MATLAB and R packages of ProTINA are available in Github repository (https://github.com/CABSEL/ProTINA).

4.5. Remarks

The work presented in this chapter has been published to the journal of Nucleic Acid Research. By presenting this work, I won the prize for Foundations in Mathematics and Informatics for Computer Simulations in Science and Engineering (FoMICS) at the PASC 2017 in Lugano in Switzerland, and won the best poster award at the 15th ICMSB in Raitenhaslach in Germany.

60 | CHAPTER4: PROTINA

CHAPTER5: CONCLUSION | 61

5. Conclusions

The identification of the molecular targets of compounds has great importance not only in drug discovery, but also in disease studies. Because of economic and technical reasons, gene expression profiling has been increasingly used in recent times for drug target identification. Many computational methods have concomitantly been developed for inferring the targets from gene expression profiles. In my PhD work, I have focused on the development of strategies based on network analysis approach using a generalized mass action model of the gene transcriptional process. The three methods presented in this thesis, namely DeltaNet, DeltaNeTS and ProTINA, provide significant advances over existing methods, such as the ability to accommodate different types of gene expression data (i.e. steady-state, time series, or combined data) and comprehensive knowledge of protein-protein and TF-gene interactions, resulting in much improved accuracy in drug target predictions.

The curse of dimensionality is a prominent issue that affects the inference of drug targets from gene expression profiles. This issue leads to an underdetermined inverse problem. In this thesis, I employed several strategies, including regularization, stepwise variable selection and prior information on network structure, to tackle the curse of dimensionality. Regularization strategies enforce constraints on drug target inference problem, for example by putting bounds on the L1 and

L2 norms of the solution as in LASSO and ridge regression, respectively. As the optimal bound is problem specific and is set by cross validation, regularization can incur high computational cost. Variable selection strategies, specifically stepwise algorithm such as LAR, are more computationally favorable than regularization. However, stepwise selection algorithms often produce models that are too small and suboptimal (see ref 111). Besides adopting regularization and variable selection algorithms, I also leveraged extensive online databases on protein-protein and TF-gene interactions to reduce the degrees of freedom in the inference problem. DeltaNet, DeltaNeTS and ProTINA incorporate one or a combination of the above strategies to provide not only accurate target predictions, but also high computational efficiency.

Besides the curse of dimensionality, an important limitation of existing methods for drug target identification using gene expression profiles is the inability to integrate steady state and time series data. Existing methods, such as MNI and SSEM, apply the pseudo steady state assumption in formulating the target inference problem. As I have shown in Chapter 3, the application of these 62 | CHAPTER5: CONCLUSION strategies, including also DeltaNet, to time series data may be affected by complications related to reverse causality in the GRN inference. Meanwhile, methods such as TSNI are specifically targeted for time series dataset, but are not able to utilize any available steady state data. The ability to seamlessly incorporate both steady state and time series gene expression profiles represents a key advantage of the more advanced methods in this thesis, including DeltaNeTS and ProTINA. As different types of data carry different kind of information about the network, the combination of such data is thus highly desirable. As demonstrated in section 3.2.1, by combining steady state and time series datasets, DeltaNeTS outperformed DeltaNet in prediction accuracy, not only for time series experiments, but also for steady state samples.

While the methods presented in this thesis shared the same starting point, specifically a generalize mass action model of the gene transcriptional process, the scoring of these targets was nevertheless different. In DeltaNet and DeltaNeTS, the target scoring was based on deviation in the log2FC expression between the gene expression profiles from a drug treatment experiment and the model prediction. If the GRN is represented as a directed graph with genes as its nodes and regulatory interactions as its directed edges, then the scoring in DeltaNet and DeltaNeTS was based on nodal perturbations. Meanwhile, in ProTINA, the candidate targets were scored based on deviations in the regulatory interactions, i.e. enhancement and attenuation of protein-gene regulatory activity. In other words, ProTINA considered perturbations to the edges of the GRN. For this reason, ProTINA was able to identify protein targets of drug whose expression may not be affected by the drug treatment. The aforementioned difference between DeltaNet/DeltaNeTS and ProTINA implies that the information generated by the two group of methods is complementary. While combining the results from these groups of strategies was not explored in this thesis, such a strategy should be explored further in a future study.

Finally, as demonstrated in the applications to gene expression profiles from different types of experiments, including drug treatment studies, and from different organisms and cell types, the methods developed in this thesis significantly outperformed the state-of-the-art algorithms in the literature. Each method in the thesis was put against existing method(s) that shared similar approach, so as to keep the comparison as fair as possible. For example, DeltaNet was compared with MNI and SSEM, strategies that used the same mechanistic model of gene transcriptional process. DeltaNeTS was compared with TSNI, a method specifically developed for handling time series gene expression data. Finally, ProTINA was compared with DeMAND, a method that relies CHAPTER5: CONCLUSION | 63 on perturbations to the gene regulatory network to identify drug targets. Besides drug target identification, several case studies further illustrated the application of the methods in this thesis to elucidate disease mechanisms, such as influenza A viral infection.

64 | CHAPTER5: CONCLUSION

CHAPTER6: OUTLOOK | 65

6. Outlook

In this PhD work, the developed methods have been mainly tested using gene expression datasets from drug treatment and genetic perturbation studies. The accuracy of the target predictions was evaluated by comparing the predictions against known molecular targets of drugs or known genetic mutations. Therefore, the experimental validation of predicted genes/proteins that were highly ranked but previously unreported to be the targets of a drug, would be of great interest, especially for identifying a novel drug target and for drug repurposing. The experimental validation may involve drug treatment using cells with the candidate target knocked out or down (i.e. a pharmacogenomics study), to confirm whether the drug effects are abrogated. Besides novel drug targets, the experimental investigation of highly ranked genes/proteins predicted for gene expression profiles of diseases may lead to new intervention or treatment. For example, in influenza A viral infection, highly enriched pathways such as focal adhesion and actin cytoskeleton in ProTINA target prediction (see Chapter 4) should be explored as an avenue to treat influenza A infection.

Another important future application of the methods developed in this thesis would be for drug- drug similarity analysis. In spite of different mechanism of action, compounds may induce similar differential gene expressions. Thus, instead of log2FC, ProTINA or DeltaNet/DeltaNeTS scores could be used for computing drug-drug similarity based on the predicted molecular targets. For this purpose, the integration of protein target predictions from ProTINA and gene target predictions from DeltaNet/DeltaNeTS predictions should be explored. The creation of a drug-drug similarity network could provide a great tool in drug repurposing, since the mechanism of a large number of compounds with unknown mechanism of action could be predicted more efficiently based on their similarity to drugs with known mechanism and indications.

In disease study as well as drug treatment study, phenotype is a key factor determining the efficacy of therapeutic targets. Thus, an improvement to network perturbation analysis strategies would be taking phenotypic information into account. Inferring network perturbations in the context of a specific phenotype or phenotypes would improve the prediction of phenotypically relevant molecular targets. For example, considering a phenotype which can be represented into a simple binary or continuous variable, such as cellular viability, would be a good starting point to 66 | CHAPTER6: OUTLOOK advance the existing methods. Once established, the phenotypic variable can be incorporated in the regression problem in the target inference.

REFERENCES | 67

REFERENCES

[1] Ashburn,T.T. and Thor,K.B. (2004) Drug repositioning: identifying and developing new uses for existing drugs. Nat. Rev. Drug Discov., 3, 673–683. [2] Imming,P., Sinning,C. and Meyer,A. (2006) Drugs, their targets and the nature and number of drug targets. Nat. Rev. Drug Discov., 5, 821–834. [3] Schenone,M., Dančík,V., Wagner,B.K. and Clemons,P.A. (2013) Target identification and mechanism of action in chemical biology and drug discovery. Nat. Chem. Biol., 9, 232– 240. [4] Isik,Z., Baldow,C., Cannistraci,C.V. and Schroeder,M. (2015) Drug target prioritization by perturbed gene expression and network information. Sci. Rep., 5. [5] Chua, H. N., and F. P. Roth. (2011). Discovering the targets of drugs via computational systems biology. J. Biol. Chem. 286, 23653–23658. [6] Hughes, T. R., M. J. Marton, a R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. a Bennett, et al. (2000). Functional discovery via a compendium of expression profiles. Cell 102, 109–126. [7] Lamb, J., E. Crawford, D. Peck, and J. Modell. (2006). The Connectivity Map: using gene- expression signatures to connect small molecules, genes, and disease. Science, 313, 1929– 1936. [8] Iorio, F., R. Bosotti, E. Scacheri, V. Belcastro, P. Mithbaokar, R. Ferriero, L. Murino, et al. (2010). Discovery of drug mode of action and drug repositioning from transcriptional responses. Proc. Natl. Acad. Sci. U. S. A., 107, 14621–14626. [9] Lamb,J. (2007) The Connectivity Map: a new tool for biomedical research. Nat. Rev. Cancer, 7, 54–60. [10] Iorio,F., Rittman,T., Ge,H., Menden,M. and Saez-Rodriguez,J. (2013) Transcriptional data: A new gateway to drug repositioning? Drug Discov. Today, 18, 350–357. [11] Chindelevitch,L., Ziemek,D., Enayetallah,A., Randhawa,R., Sidders,B., Brockel,C. and Huang,E.S. (2012) Causal reasoning on biological networks: Interpreting transcriptional changes. Bioinformatics, 28, 1114–1121. [12] Martin,F., Thomson,T.M., Sewer,A., Drubin,D. a, Mathis,C., Weisensee,D., Pratt,D., Hoeng,J. and Peitsch,M.C. (2012) Assessment of network perturbation amplitudes by applying high-throughput data to causal biological networks. BMC Syst. Biol., 6. [13] Belcastro,V. et al. (2013) Systematic verification of upstream regulators of a computable cellular proliferation network model on non-diseased lung cells using a dedicated dataset. Bioinform. Biol. Insights, 7, 217–30. 68 | REFERENCES

[14] Lachmann,A. and Maayan,A. (2009) KEA: Kinase enrichment analysis. Bioinformatics, 25, 684–686. [15] Chen,E.Y. et al. (2012) Expression2Kinases: mRNA profiling linked to multiple upstream regulatory layers, Bioinformatics, 28, 105-111. [16] Laenen,G. et al. (2015) Galahad: a web server for drug effect analysis from gene expression. Nucleic Acids Res., 43, W208–W212. [17] Koschmann,J. et al. (2015) Upstream Analysis: An Integrated Promoter-Pathway Analysis Approach to Causal Interpretation of Microarray Data. Microarrays, 4, 270–286. [18] Lefebvre,C. et al. (2010) A human B-cell interactome identifies MYB and FOXM1 as master regulators of proliferation in germinal centers. Mol. Syst. Biol., 6. [19] Woo,J.H., Shimoni,Y., Yang,W.S., Subramaniam,P., Iyer,A., Nicoletti,P., Rodríguez Martínez,M., López,G., Mattioli,M., Realubit,R., et al. (2015) Elucidating Compound Mechanism of Action by Network Perturbation Analysis. Cell, 162, 441–451. [20] Gardner,T.S. (2003) Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling. Science, 301, 102–105. [21] di Bernardo,D., Thompson,M.J., Gardner,T.S., Chobot,S.E., Eastwood,E.L., Wojtovich,A.P., Elliott,S.J., Schaus,S.E. and Collins,J.J. (2005) Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat. Biotechnol., 23, 377–383. [22] Cosgrove, E. J., Y. Zhou, T. S. Gardner, and E. D. Kolaczyk. (2008). Predicting gene targets of perturbations via network-based filtering of mRNA expression compendia. Bioinformatics, 24, 2482–2490. [23] Bansal, M., G. Della Gatta, and D. di Bernardo. (2006). Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics, 22, 815–822. [24] Liao, J. C., R. Boscolo, Y.-L. Yang, L. M. Tran, C. Sabatti, and V. P. Roychowdhury. (2003). Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl. Acad. Sci. U. S. A., 100, 15522–15527. [25] Tarca,A. et al. (2006) Analysis of microarray experiments of gene expression profiling. Am. J. Obstet. Gynecol., 195, 373–388. [26] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B, 58, 267–288. [27] De Boor, C. (2001). A Practical Guide To Splines. Springer, NY. [28] Ljung, L. (1999) Systems Identification: Theory for the User. Prentice Hall, Upper Saddle River, NJ. REFERENCES | 69

[29] Szederkényi,G. et al. (2011) Inference of complex biological networks: distinguishability issues and optimization-based solutions. BMC Syst. Biol., 5, 177. [30] Ud-Dean,S.M.M. and Gunawan,R. (2014) Ensemble inference and inferability of gene regulatory networks. PLoS One, 9. [31] Efron,B.B. et al. (2004) Least angle regression. Ann. Stat., 32, 407–499. [32] Duda,R.O., Hart,P.E., Stork,D.G. (2011) Pattern Classification, Wiley, NY, 2nd-ed. [33] Hoerl,A.E. and Kennard,R.W. (1970) Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12, 55–67. [34] Sjöstrand,K. et al. (2012) Spasm: A matlab toolbox for sparse statistical modeling. J. Stat. Softw., 1–24. [35] Friedman,J. et al. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw., 33, 1–22. [36] Xing,H. and Gardner,T.S. (2006) The mode-of-action by network identification (MNI) algorithm: a network biology approach for molecular target identification. Nat. Protoc., 1, 2551–4. [37] Stolovitzky,G., Prill,R.J. and Califano,A. (2009) Lessons from the DREAM2 challenges: A community effort to assess biological network inference. Ann. N. Y. Acad. Sci., 1158, 159–195. [38] Prill,R.J., Marbach,D., Saez-Rodriguez,J., Sorger,P.K., Alexopoulos,L.G., Xue,X., Clarke,N.D., Altan-Bonnet,G. and Stolovitzky,G. (2010) Towards a rigorous assessment of systems biology models: The DREAM3 challenges. PLoS One, 5. [39] Faith,J.J. et al. (2008) Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res., 36, D866–70. [40] Parkinson,H. et al. (2007) ArrayExpress--a public database of microarray experiments and gene expression profiles. Nucleic Acids Res., 35, D747–50. [41] Barrett,T. et al. (2013) NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res., 41, D991–5. [42] Gentleman,R.C. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5, R80. [43] Iskar,M. et al. (2010) Drug-induced regulation of target expression. PLoS Comput. Biol., 6. [44] Bonke,M. et al. (2013) Transcriptional networks controlling the cell cycle. G3 (Bethesda), 3, 75–90. 70 | REFERENCES

[45] Teixeira,M.C. et al. (2014) The YEASTRACT database: an upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae. Nucleic Acids Res., 42, D161–6. [46] Kuhn,M. et al. (2014) STITCH 4: integration of protein-chemical interactions with user data. Nucleic Acids Res., 42, D401–D407. [47] Giannattasio,S. et al. (2013) Molecular mechanisms of Saccharomyces cerevisiae stress adaptation and programmed cell death in response to acetic acid. Front. Microbiol., 4. [48] Jacinto,E. and Hall,M.N. (2003) TOR signalling in bugs, brain and brawn. Nat. Rev. Mol. Cell Biol., 4, 117–26. [49] Wishart,D.S. et al. (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res., 34, D668–D672. [50] Chen,E.Y. et al. (2013) Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics, 14. [51] Storey,J.D. (2002) A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B, 64, 479–498. [52] Kaimal,V. et al. (2010) ToppCluster: A multiple gene list feature analyzer for comparative enrichment clustering and networkbased dissection of biological systems. Nucleic Acids Res., 38, 96–102. [53] Bevilacqua,V. and Pannarale,P. (2013) Scalable high-throughput identification of genetic targets by network filtering. BMC Bioinformatics, 14. [54] Luscombe,N. et al. (2004) Genomic analysis of regulatory network dynamics reveals large topological changes. Nature, 431, 308–312. [55] Tegner,J. et al. (2003) Reverse engineering gene networks : Integrating genetic perturbations with dynamical modeling. Proc. Natl. Acad. Sci. U. S. A., 100. [56] Maathuis,M.H. et al. (2010) Predicting causal effects in large-scale systems from observational data. Nat. Methods, 7, 247–248. [57] Noh, H. and Gunawan, R. (2016) Inferring gene targets of drugs and chemical compounds from gene expression profiles. Bioinformatics, 32, 2120-7. [58] Lynch, D. R. (2005). Finite Difference Calculus. In Numerical Partial Differential Equations for Enviornmental Scientists and Engineers: A first Practical Course, Springer. [59] Marbach,D., Lamparter,D., Quon,G., Kellis,M., Kutalik,Z. and Bergmann,S. (2016) Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases. Nat. Methods, 13, 366–370. REFERENCES | 71

[60] Schaffter, T., D. Marbach, and D. Floreano. (2011). GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27, 2263–2270. [61] Leonidas C. Platanias. (2005). Mechanisms of type- I- and type-II-interferon-mediated signalling. Nat. Rev. Immunol. 5, 375–386. [62] Kuleshov, M. V, M. R. Jones, A. D. Rouillard, N. F. Fernandez, Q. Duan, Z. Wang, S. Koplev, et al. (2016). Enrichr : a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res., gkw377. [63] Soldaini, E., S. John, S. Moro, J. Bollenbacher, U. Schindler, and W. J. Leonard. (2000). DNA Binding Site Selection of Dimeric and Tetrameric Stat5 Proteins Reveals a Large Repertoire of Divergent Tetrameric Stat5a Binding Sites. Mol. Cell. Biol., 20, 389–401. [64] Korteweg C and Gu J, (2010) Pandemic influenza A (H1N1) virus infection and avian influenza A (H5N1) virus infection: a comparative analysis, Biochem Cell Biol., 88, 575- 87. doi: 10.1139/O10-017 [65] Croft,D., OKelly,G., Wu,G., Haw,R., Gillespie,M., Matthews,L., Caudy,M., Garapati,P., Gopinath,G., Jassal,B., et al. (2011) Reactome: A database of reactions, pathways and biological processes. Nucleic Acids Res., 39, 691–697. [66] Gaur,P., Munjhal,A. and Lal,S.K. (2011) Influenza virus and cell signaling pathways. Med. Sci. Monit., 17, RA148-54. [67] Parnell,G., McLean,A., Booth,D., Huang,S., Nalos,M. and Tang,B. (2011) Aberrant cell cycle and apoptotic changes characterise severe influenza a infection - a meta-analysis of genomic signatures in circulating leukocytes. PLoS One, 6. [68] Ueda,M., Daidoji,T., Du,A., Yang,C.-S., Ibrahim,M.S., Ikuta,K. and Nakaya,T. (2010) Highly pathogenic H5N1 avian influenza virus induces extracellular Ca2+ influx, leading to apoptosis in avian cells. J. Virol., 84, 3068–3078. [69] Oslund,K.L. and Baumgarth,N. (2011) Influenza-induced innate immunity: regulators of viral replication, respiratory tract pathology & adaptive immunity. Future Virol., 6, 951–962. [70] Lam,W.-Y., Yeung,A.C.-M., Ngai,K.L.-K., Li,M.-S., To,K.-F., Tsui,S.K.-W. and Chan,P.K.-S. (2013) Effect of avian influenza A H5N1 infection on the expression of microRNA-141 in human respiratory epithelial cells. BMC Microbiol., 13. [71] Li,Y., Zhou,H., Wen,Z., Wu,S., Huang,C., Jia,G., Chen,H. and Jin,M. (2011) Transcription analysis on response of swine lung to H1N1 swine influenza virus. BMC Genomics, 12. [72] Noh, H., Hua, Z., and Gunawan, R. (2016) Inferring causal gene targets from time course expression data. IFCA-PaperOnLine, 49(26), 350-356 72 | REFERENCES

[73] Cahan,P., Li,H., Morris,S.A., Lummertz Da Rocha,E., Daley,G.Q. and Collins,J.J. (2014) CellNet: Network biology applied to stem cell engineering. Cell, 158, 903–915. [74] Szklarczyk,D., Franceschini,A., Wyder,S., Forslund,K., Heller,D., Huerta-Cepas,J., Simonovic,M., Roth,A., Santos,A., Tsafou,K.P., et al. (2015) STRING v10: Protein- protein interaction networks, integrated over the tree of life. Nucleic Acids Res., 43, D447– D452. [75] Whitlock,M.C. (2005) Combining probability from independent tests: The weighted Z- method is superior to Fishers approach. J. Evol. Biol., 18, 1368–1373. [76] Bansal,M., Yang,J., Karan,C., Menden,M.P., Costello,J.C., Tang,H., Xiao,G., Li,Y., Allen,J., Zhong,R., et al. (2014) A community computational challenge to predict the activity of pairs of compounds. Nat. Biotechnol., 32, 1213–1222. [77] Magkoufopoulou,C., Claessen,S.M.H., Tsamou,M., Jennen,D.G.J., Kleinjans,J.C.S. and Van delft,J.H.M. (2012) A transcriptomics-based in vitro assay for predicting chemical genotoxicity in vivo. Carcinogenesis, 33, 1421–1429. [78] Kubicek,S., Gilbert,J.C., Fomina-yadlin,D., Gitlin,A.D. and Yuan,Y. (2012) Chromatin- targeting small molecules cause class-specific transcriptional changes in pancreatic endocrine cells. Proc Natl Acad Sci U S A, 109, 5364–5369. [79] McDermott,J.E., Shankaran,H., Eisfeld,A.J., Belisle,S.E., Neuman,G., Li,C., McWeeney,S., Sabourin,C., Kawaoka,Y., Katze,M.G., et al. (2011) Conserved host response to highly pathogenic avian influenza virus infection in human cell culture, mouse and macaque model systems. BMC Syst. Biol., 5. [80] Mitchell,H.D., Eisfeld,A.J., Sims,A.C., McDermott,J.E., Matzke,M.M., Webb- Robertson,B.J.M., Tilton,S.C., Tchitchek,N., Josset,L., Li,C., et al. (2013) A Network Integration Approach to Predict Conserved Regulators Related to Pathogenicity of Influenza and SARS-CoV Respiratory Viruses. PLoS One, 8. [81] Menachery,V.D., Eisfeld,A.J., Schäfer,A., Josset,L., Sims,A.C., Proll,S., Fan,S., Li,C., Neumann,G., Tilton,S.C., et al. (2014) Pathogenic influenza viruses and coronaviruses utilize similar and contrasting approaches to control interferon-stimulated gene responses. MBio, 5, 1–11. [82] Kanehisa,M., Furumichi,M., Tanabe,M., Sato,Y. and Morishima,K. (2017) KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res., 45, D353– D361. [83] Luo,W., Friedman,M.S., Shedden,K., Hankenson,K.D. and Woolf,P.J. (2009) GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics, 10. REFERENCES | 73

[84] Law,V., Knox,C., Djoumbou,Y., Jewison,T., Guo,A.C., Liu,Y., MacIejewski,A., Arndt,D., Wilson,M., Neveu,V., et al. (2014) DrugBank 4.0: Shedding new light on drug metabolism. Nucleic Acids Res., 42, 1091–1097. [85] Zhu,F., Shi,Z., Qin,C., Tao,L., Liu,X., Xu,F., Zhang,L., Song,Y., Liu,X., Zhang,J., et al. (2012) Therapeutic target database update 2012: A resource for facilitating target-oriented drug discovery. Nucleic Acids Res., 40, 1128–1136. [86] Günther,S., Kuhn,M., Dunkel,M., Campillos,M., Senger,C., Petsalaki,E., Ahmed,J., Urdiales,E.G., Gewiess,A., Jensen,L.J., et al. (2008) SuperTarget and Matador: Resources for exploring drug-target relationships. Nucleic Acids Res., 36, 919–922. [87] Davis,A.P., Grondin,C.J., Johnson,R.J., Sciaky,D., King,B.L., McMorran,R., Wiegers,J., Wiegers,T.C. and Mattingly,C.J. (2017) The Comparative Toxicogenomics Database: Update 2017. Nucleic Acids Res., 45, D972–D978. [88] Szklarczyk,D., Santos,A., Von Mering,C., Jensen,L.J., Bork,P. and Kuhn,M. (2016) STITCH 5: Augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic Acids Res., 44, D380–D384. [89] Watanabe,T., Kawakami,E., Shoemaker,J.E., Lopes,T.J.S., Matsuoka,Y., Tomita,Y., Kozuka-Hata,H., Gorai,T., Kuwahara,T., Takeda,E., et al. (2014) Influenza virus-host interactome screen as a platform for antiviral drug development. Cell Host Microbe, 16, 795–805. [90] Cazzalini,O., Scovassi,A.I., Savio,M., Stivala,L.A. and Prosperi,E. (2010) Multiple roles of the cell cycle inhibitor p21CDKN1A in the DNA damage response. Mutat. Res. - Rev. Mutat. Res., 704, 12–20. [91] Zhan,Q. (2005) Gadd45a, a p53- and BRCA1-regulated stress protein, in cellular response to DNA damage. Mutat. Res. - Fundam. Mol. Mech. Mutagen., 569, 133–143. [92] Kelman,Z. (1997) PCNA: structure, functions and interactions. Oncogene, 14, 629–640. [93] Shao,S., Wang,Y., Jin,S., Song,Y., Wang,X., Fan,W., Zhao,Z., Fu,M., Tong,T., Dong,L., et al. (2006) Gadd45a interacts with aurora-A and inhibits its kinase activity. J. Biol. Chem., 281, 28943–28950. [94] Macůrek,L., Lindqvist,A., Lim,D., Lampson,M.A., Klompmaker,R., Freire,R., Clouin,C., Taylor,S.S., Yaffe,M.B. and Medema,R.H. (2008) Polo-like kinase-1 is activated by aurora A to promote checkpoint recovery. Nature, 455, 119–123. [95] Toyoshima-morimoto,F., Taniguchi,E., Shinya,N., Iwamatsu,A. and Nishida,E. (2001) Polo-like kinase 1 phosphorylates cyclin B1 and targets it to the nucleus during prophase. Nature, 410, 215–220. [96] Pearl,L.H., Schierz,A.C., Ward,S.E., Al-Lazikani,B. and Pearl,F.M.G. (2015) Therapeutic opportunities within the DNA damage response. Nat. Rev. Cancer, 15, 166–180. 74 | REFERENCES

[97] Deans,A.J. and West,S.C. (2013) DNA interstrand crosslink repair and cancer. Nat rev Cancer, 11, 467–480. [98] Andreassen,P. and Ren,K. (2009) Fanconi Anaemia Proteins, DNA Interstrand Crosslink Repair Pathways, and Cancer Therapy. Curr. Cancer Crug Targets, 9, 101–117. [99] de Chassey,B., Meyniel-Schicklin,L., Aublin-Gex,A., Navratil,V., Chantier,T., André,P. and Lotteau,V. (2013) Structure homology and interaction redundancy for discovering virus–host protein interactions. EMBO Rep., 14, 938–944. [100] Shaw,M.L., Stone,K.L., Colangelo,C.M., Gulcicek,E.E. and Palese,P. (2008) Cellular proteins in influenza virus particles. PLoS Pathog., 4, 1–13. [101] Matsuoka,Y., Matsumae,H., Katoh,M., Eisfeld,A.J., Neumann,G., Hase,T., Ghosh,S., Shoemaker,J.E., Lopes,T.J., Watanabe,T., et al. (2013) A comprehensive map of the influenza A virus replication cycle. BMC Syst. Biol., 7, 97. [102] Braakman,I., Hoover-Litty,H., Wagner,K.R. and Helenius,A. (1991) Folding of influenza hemagglutinin in the endoplasmic reticulum. J. Cell Biol., 114, 401–411. [103] Rodriguez,A., Perez-Gonzalez,A. and Nieto,A. (2007) Influenza Virus Infection Causes Specific Degradation of the Largest Subunit of Cellular RNA Polymerase II. J. Virol., 81, 5315–5324. [104] Rudnicka,A. and Yamauchi,Y. (2016) Ubiquitin in influenza virus entry and innate immunity. Viruses, 8, 1–15. [105] Varga,Z.T., Grant,A., Manicassamy,B. and Palese,P. (2012) Influenza Virus Protein PB1- F2 Inhibits the Induction of Type I Interferon by Binding to MAVS and Decreasing Mitochondrial Membrane Potential. J. Virol., 86, 8359–8366. [106] Hale,B.G., Randall,R.E., Ortin,J. and Jackson,D. (2008) The multifunctional NS1 protein of influenza A viruses. J. Gen. Virol., 89, 2359–2376. [107] Shoemaker,J.E., Fukuyama,S., Eisfeld,A.J., Muramoto,Y., Watanabe,S., Watanabe,T., Matsuoka,Y., Kitano,H. and Kawaoka,Y. (2012) Integrated network analysis reveals a novel role for the cell cycle in 2009 pandemic influenza virus-induced inflammation in macaque lungs. BMC Syst. Biol., 6. [108] Cho,H., Ahn,S.H., Kim,K.M. and Kim,Y.K. (2013) Non-structural protein 1 of influenza viruses inhibits rapid mRNA degradation mediated by double-stranded RNA-binding protein, staufen1. FEBS Lett., 587, 2118–2124. [109] Li,N., Parrish,M., Chan,T.K., Yin,L., Rai,P., Yoshiyuki,Y., Abolhassani,N., Tan,K.B., Kiraly,O., Chow,V.T.K., et al. (2015) Influenza infection induces host DNA damage and dynamic DNA damage responses during tissue regeneration. Cell. Mol. Life Sci., 72, 2973– 2988. REFERENCES | 75

[110] Elbahesh,H., Cline,T., Baranovich,T., Govorkova,E.A., Schultz-Cherry,S., and Russell,C.J. (2014) Novel Roles of Focal Adhesion Kinase in Cytoplasmic Entry and Replication of Influenza A Viruses. J. Virol. 88, 6714–6728. [111] Ratner,B. (2010) Variable selection methods in regression: Ignorable problem, outing notable solution. J. Targeting, Meas. Anal. Mark., 18, 65–75. [112] Iglewicz, B. and Hoaglin, D.C. (1993) Volume 16: How to detect and handle outliers. ASQC Quality Press. [113] Pike,N. (2011) Using false discovery rates for multiple comparisons in ecology and evolution. Methods Ecol. Evol., 2, 278–282. 76 | REFERENCES

APPENDICES | 77

APPENDICES

Appendix A: Inferring gene targets of drugs and chemical compounds from gene expression profiles

A.1. Procedure of q-value calculation for predictions from DeltaNet and SSEM

For each gene k, we first compute the median-average and median absolute deviation (MAD) of the perturbation coefficient pki over all samples i in which pki is non-zero. The MAD is calculated by:

MAD median(|ppki k |)

where pk is here the median average of non-zero pki. Subsequently, we evaluate the modified z- score Mi as follow: [112]

0.6745(pp ) M  ki k i MAD and assign any sample with | Mi | > 3.5 as an outlier. After removing outliers, we transform the values of (non-zero) pki into the standard z-scores. By assuming normal (Gaussian) distribution, we then convert the standard z-scores into p-values. Finally, we calculate the q-value [51] from the p-values using the MATLAB subroutine mafdr(). q-values provides a measure of statistical significance based on False Discovery Rate (FDR) for multiple hypothesis tests [51, 113]. For the purpose of gene ranking, we assign the q-value of pki with zero values to 1, which is the maximum q-value possible. Correspondingly, a gene with a lower q-value is ranked higher.

78 | APPENDICES

Figure A1. True positive rates of gene target predictions using DeltaNet-LAR until completion and with δr = 1, 5, 10 and 20% for E. coli, S. cerevisiae and D. melanogaster datasets. APPENDICES | 79

Table A1. Ranking of known TF based on the p-values of TF enrichment analysis using the top 100 gene target predictions by DeltaNet, SSEM, MNI and z-scores in the yeast compendium. DeltaNet2 SSEM MNI Z-scores TF1 rank (p-value) rank (p-value) rank (p-value) rank (p-value) haa13 5 (2.79e-09) 20 (4.47e-08) 78 (2.19e-07) 9 (2.68e-05) hsf1 21 (3.43e-07) 17 (2.74e-08) 47 (1.02e-08) 158 (5.96e-02) rtg33 41 (1.18e-04) 102 (1.79e-03) 79 (2.47e-07) 279 (7.20e-01) 3 Acetate rtg1 50 (2.42e-04) 153 (1.39e-02) 107 (1.26e-05) 142 (3.70e-02) rim101 118 (2.12e-02) 101 (1.62e-03) 99 (3.60e-06) 139 (3.37e-02) hog1 197 (1.43e-01) 205 (7.10e-02) 108 (1.70e-05) 124 (2.62e-02) swi6 207 (1.84e-01) 113 (2.91e-03) 190 (3.62e-03) 128 (2.79e-02) ime1 265 (4.77e-01) 269 (5.05e-01) 266 (2.14e-01) 126 (2.64e-02) 3-Amino- 1,2,4- gcn4 2 (3.99e-07) 2 (1.00e-15) 1 (0.00e+00) 1 (0.00e+00) triazole flo8 97 (9.24e-02) 47 (2.75e-04) 102 (3.87e-03) 107 (5.02e-02) Ethanol snf1 247 (7.44e-01) 262 (7.38e-01) 117 (1.04e-02) 134 (1.04e-01) spt4 277 (9.95e-01) 278 (9.74e-01) 286 (9.68e-01) 231 (5.52e-01) skn7 1 (2.30e-09) 10 (5.49e-06) 24 (6.40e-11) 158 (4.10e-01) yap1 116 (1.83e-02) 139 (8.23e-02) 72 (2.80e-06) 113 (2.52e-01) H2O2 hog1 166 (7.38e-02) 252 (5.75e-01) 255 (5.55e-01) - (-) dot5 212 (1.84e-01) - (-) 240 (4.27e-01) 170 (4.37e-01) tec1 3 (2.43e-09) 43 (6.13e-05) 25 (4.80e-09) 214 (1.00e+00) gln3 4 (5.02e-09) 7 (1.06e-10) 260 (9.84e-01) 36 (0.00e+00) gcn4 5 (1.38e-08) 9 (4.51e-10) 3 (0.00e+00) 59 (2.67e-07) msn2 17 (3.89e-05) 26 (2.93e-06) 31 (2.70e-09) 166 (6.19e-01) put3 29 (2.64e-04) 39 (1.80e-05) 108 (3.57e-02) 69 (1.44e-05) Rapamyci msn44 40 (7.83e-04) 15 (6.38e-08) 86 (3.08e-03) 121 (1.50e-01) n rtg34 41 (9.14e-04) 40 (2.47e-05) 83 (1.74e-03) 54 (1.90e-08) rtg14 61 (5.42e-03) 85 (2.78e-03) 32 (3.00e-09) 184 (8.41e-01) gat14 78 (1.03e-02) 174 (7.20e-02) 223 (6.26e-01) 175 (7.20e-01) rgt1 117 (4.51e-02) 121 (1.31e-02) 128 (1.01e-01) 64 (6.80e-07) pip2 151 (9.50e-02) 41 (2.56e-05) 57 (3.90e-05) 11 (0.00e+00) rpd3 180 (1.60e-01) 115 (1.07e-02) 28 (3.00e-10) - (-) 1Unless indicated otherwise, the reference TFs are taken from STITCH with significance scores > 0.7. 2 Gene ranking was generated using DeltaNet-LAR with δr =1%. 3TFs taken from ref. 47 4TFs taken from ref. 48

80 | APPENDICES

Table A2. Statistical significance of the median differences in the rankings of enriched TF between two methods. The p-values were calculated by Wilcoxon rank sum test in MATLAB.

Yeast p-values of a Wilcoxon rank sum test median rank DeltaNet SSEM MNI Z-score DeltaNet 69.5 - - - - SSEM 85 0.8663 - - - MNI 92.5 0.4460 0.5613 - - Z-score 127 0.1741 0.1421 0.3070 - MCF-7 p-values of a Wilcoxon rank sum test median rank DeltaNet SSEM Z-score DeltaNet 65 - - - SSEM 105 0.0389 - - Z-scores 83 0.3612 0.3971 -

APPENDICES | 81

Table A3. Ranking of TFs enriched among the top 100 gene target predictions by DeltaNet, SSEM, and z-scores in MCF-7 compendium from C-Map.

DeltaNet SSEM Z-scores TF rank (c-score) rank (c-score) rank (c-score) Acacetin JUN 80 (1.36) 132 (-0.04) 207 (-0.32) SP1 27 ( 3.61) 166 (-0.09) 12 ( 6.10) NFKB1 28 ( 3.48) 25 ( 3.69) 95 ( 0.86) Acetylsalicylic acid TP53 39 ( 2.47) 105 ( 0.33) 238 (-0.37) PPARA 127 ( 0.04) 112 ( 0.24) 83 ( 1.23) HIF1A 111 ( 0.27) 45 ( 1.77) 33 ( 3.61) SREBF1 29 ( 2.85) 62 ( 2.04) 51 ( 2.32) Clenbuterol VDR 54 ( 1.81) 88 ( 1.19) 76 ( 1.56) Clioquinol HIF1A 47 ( 2.68) 209 (-0.56) 31 ( 3.82) Deferoxamine HIF1A 35 ( 4.23) 191 (-0.41) 26 ( 3.62) SP1 47 ( 2.43) 13 ( 9.80) 17 ( 7.97) RUNX2 152 (-0.15) 203 (-0.49) 215 (-0.49) Diethylstilbestrol JUN 165 (-0.21) 150 (-0.25) 48 ( 3.44) ESR1 223 (-0.43) 98 ( 0.70) 58 ( 2.60) ESR2 - ( - ) 252 (-1.47) 251 (-1.42) CREB1 83 ( 1.12) 87 ( 1.17) 26 ( 4.10) NR1I2 159 (-0.36) 80 ( 1.42) 108 ( 0.59) ESR1 111 ( 0.08) 91 ( 0.99) 120 ( 0.19) Extradiol NR1I2 159 (-0.36) 108 ( 0.59) 80 ( 1.42) HIF1A 201 (-0.48) 99 ( 0.73) 199 (-0.32) RUNX2 208 (-0.50) 173 (-0.24) 35 ( 3.52) ESR2 247 (-1.28) - ( - ) - ( - ) Flunisolide NR3C1 81 ( 1.33) 124 ( 0.11) 209 (-0.29) Flurbiprofen TP53 3 ( 7.20) 115 ( 0.32) 175 (-0.22) WT1 27 ( 3.31) 107 ( 0.51) 12 ( 7.45) Fulvestrant ESR1 29 ( 3.16) 99 ( 0.78) 68 ( 1.86) Geldanamycin HIF1A 61 ( 1.71) 210 (-0.21) 177 (-0.32) SREBF1 19 ( 3.62) 36 ( 3.87) 111 ( 0.54) JUN 86 ( 0.85) 145 (-0.15) 76 ( 1.60) Genistein HIF1A 114 ( 0.24) 71 ( 1.63) 168 (-0.22) PPARA - ( - ) 146 (-0.13) - ( - ) SP1 8 ( 7.91) 44 ( 2.81) 12 ( 8.01) LY-294002 JUN 60 ( 1.94) 31 ( 3.48) 190 (-0.31) HIF1A 87 ( 0.82) 207 (-0.40) 196 (-0.38) PGR 24 ( 4.96) 64 ( 2.14) 65 ( 2.05) Levonorgestrel ESR1 150 (-0.12) 130 ( 0.10) 193 (-0.25) Mexiletine AHR 192 (-0.45) 70 ( 1.61) 184 (-0.30) Naloxone ESR1 82 ( 1.15) 105 ( 0.53) 83 ( 1.18) Probucol YY1 65 ( 1.36) 81 ( 1.32) 56 ( 1.52) Rapamycin JUN 25 ( 4.62) 93 ( 1.05) 52 ( 2.47) 82 | APPENDICES

SREBF1 28 ( 4.51) 143 (-0.07) 28 ( 4.19) HIF1A 53 ( 2.41) 29 ( 5.14) 1 ( 8.60) SREBF2 179 (-0.21) 147 (-0.06) 185 (-0.29) JUN 54 ( 1.69) 41 ( 2.91) 44 ( 2.83) PPARA 57 ( 1.52) 152 (-0.12) 208 (-0.46) Rosiglitazone PPARG 127 ( 0.04) 80 ( 1.41) 118 ( 0.24) KLF4 155 (-0.13) 226 (-0.49) 238 (-0.48) XBP1 192 (-0.22) 172 (-0.24) 40 ( 3.01) Troglitazone JUN 25 ( 4.19) 93 ( 1.05) 102 ( 0.65) SREBF2 58 ( 2.46) 180 (-0.30) 205 (-0.38) Valproic acid SREBF1 63 ( 1.77) 24 ( 5.86) 26 ( 4.54)

APPENDICES | 83

Figure A2. Comparison of true positive rates using gene ranking sorted by pki magnitudes and by q-values from DeltaNet. The analyses using DeltaNet-LAR were performed using δr=1%. 84 | APPENDICES

Figure A3. Comparison of the average numbers of nonzero coefficients in ak, the row vector of matrix A, inferred by DeltaNet-LASSO and SSEM over (a) the entire genes and (b) target genes predicted within top 10 only by DeltaNet-LASSO. The numbers of target genes within top 10 prediction by DeltaNet-LASSO but not by SSEM were 10, 15, and 13 respectively for E. coli, yeast, and Drosophila data sets. APPENDICES | 85

Figure A4. Comparison of true positive rates for gene target predictions using DeltaNet- LAR (δr=10%) with and without time-series data for E. coli and yeast. Since Drosophila dataset consists of only steady-state data, we only compared the results for E. coli and S. cerevisiae. 86 | APPENDICES

Appendix B: Inferring causal gene targets from time course expression data

Figure B1. Enriched pathways resulting from gene set enrichment analysis of log2FCs for Reactome pathway terms. The size and color of dots indicate negative 10-based logarithm of p-values for enriched terms. Each influenza infection was divided into 3 time phases: Phase1 = 0-7 hours, phase 2 = 7-18 hours, and phase 3 = more than 18 hours of post-infection.

APPENDICES | 87

Appendix C: Network perturbation analysis of gene transcriptional profiles reveals protein targets and mechanism of action of drugs and influenza A viral infection”

Table C1. Known targets for NCI-DREAM drug synergy challenge compounds Compounds Direct targets (source) Aclacinomycin A TOP2A (STITCH*), TOP1 (STITCH), TOP2B(STITCH) MYH14 (DrugBank, STITCH), MYH2 (STITCH), MYH9 (STITCH), Blebbistatin MYH 10 (STITCH) TOP1 (DrugBank, TTD), CHEK1 (STITCH), ALB (CTD), AR (CTD), Camptothecin ESR1 (CTD), ESR2 (CTD), PGR (CTD), THRA (CTD), THRB (CTD) TOP2A (DrugBank, TTD, STITCH), ABCB1 (STITCH), ERBB2 Doxorubicin (STITCH), AURKA (STITCH), ALB (CTD), AR (CTD), TF (CTD), Hydrochloride ESR1 (CTD), ESR2 (CTD), NOLC1 (CTD), PGR (CTD), THRA (CTD), THRB (DTD) TOP2A(DrugBank, TTD, CTD, STITCH), ABCC3(MATADOR, STITCH), TOP2B(MATADOR, DrugBank, TTD, STITCH), Etoposide ABCC1(STITCH), ABCB1(MATADOR, STITCH), ABCC2(STITCH), H2AFX(STITCH), ABCB5(MATADOR), D020168(MATADOR), D004250(MATADOR), CASP3(STITCH), ABCG2(STITCH) HSP90AA1(DrugBank, TTD, STITCH), HSP90AB1(DrugBank, TTD, Geldanamycin STITCH), HSP90B1(TTD, CTD, STITCH) DHFR(DrugBank, MATADOR, TTD, STITCH), ABCC3(MATADOR, STITCH), SLC19A1(MATADOR, STITCH), SLC22A6(MATADOR, STITCH), ABCB1(MATADOR, STITCH), FPGS(MATADOR, STITCH), ABCC2(MATADOR, STITCH), ABCC4(MATADOR, STITCH), GGH(MATADOR, STITCH), TYMS(STITCH), ABCG2(STITCH), ABCB5(MATADOR), SLC10A2(MATADOR), SLC22A9(MATADOR), SLC13A3(MATADOR), SLC10A4(MATADOR), SLCO4C1(MATADOR), Methotrexate SLCO4A1(MATADOR), SLCO2B1(MATADOR), SLC16A14(MATADOR), SLC22A11(MATADOR), SLC23A1(MATADOR), SLCO2A1(MATADOR), SLC16A11(MATADOR), SLC22A8(MATADOR), SLCO3A1(MATADOR), ABCC5(MATADOR), ABCC6(MATADOR), SLC23A2(MATADOR), SLC10A1(MATADOR), SLCO1C1(MATADOR), SLCO1B1(MATADOR), SLCO1B3(MATADOR), ABCB4(MATADOR), SLC10A6(MATADOR), AR(CTD), PGR(CTD) Mitomycin C NQO1(MATADOR, STITCH), POR(MATADOR, STITCH) Monastrol KIF11(DrugBank, TTD, STITCH) 88 | APPENDICES

FKBP1A(DrugBank, MATADOR, TTD, CTD, STITCH), FGF2(DrugBank), OPRK1(TTD), FKBP3(MATADOR, STITCH), FKBP4(MATADOR, STITCH), FKBP1B(MATADOR, STITCH), ABCB1(MATADOR, STITCH), FKBP2(MATADOR, STITCH), FKBP5(MATADOR, STITCH), MTOR(DrugBank, TTD, STITCH), CYP3A4(STITCH), RYR1(STITCH), PPIA(STITCH), Rapamycin PIN4(MATADOR), PPIF(MATADOR), FKBP8(MATADOR), FKBP14(MATADOR), FKBP9(MATADOR), FKBP7(MATADOR), FKBP15(MATADOR), FKBP6(MATADOR), PIN1(MATADOR), FKBP11(MATADOR), PPIG(MATADOR), RPS6KA6(MATADOR), RPS6KA2(MATADOR), PPIL3(MATADOR), PPIB(MATADOR), PPIE(MATADOR), PPIC(MATADOR), PPID(MATADOR), PAK2(MATADOR), FKBP10(MATADOR) HDAC1(STITCH), HDAC2(STITCH), HDAC3(STITCH), HDAC8(STITCH), HDAC4(STITCH), HDAC5(STITCH), Trichostatin A HDAC7(STITCH), HDAC9(STITCH), HDAC6(STITCH), HDAC10(STITCH), F3(STITCH), HDAC11(STITCH) TUBB(DrugBank), TUBA4A(DrugBank, MATADOR), Vincristine TUBB1(MATADOR), TUBG1(MATADOR), TUBB2B(MATADOR), TUBA8(MATADOR), TUBD1(MATADOR), ESR1(CTD) *For protein targets reported in STITCH, targets with less than 0.7 evidence score were not considered.

APPENDICES | 89

Table C2. Known targets for HepG2 genotoxicity study compounds Compounds Direct targets (source) Azathioprine HPRT1(STITCH*), IMPDH1(STITCH) Diazinon ACHE(STITCH), BCHE(STITCH) Tetradecanoyl phorbol PRKCA(STITCH), PRKCB(STITCH), PRKCG(STITCH), acetate PRKCE(STITCH), PRKCD(STITCH) D-Mannitol NOS1(STITCH), NOS2(STITCH), NOS3(STITCH) 2-Acetylaminofluorene CYP1A2(STITCH) 8-Hydroxyquinoline HIF1AN(STITCH) 4-Aminobiphenyl ALB(STITCH), TP53(STITCH) Ampicillin TEK(STITCH), SLC15A2(STITCH), SLC15A1(STITCH) DAO(STITCH), IL4I1(STITCH), GRB2(STITCH), o-Antranilic acid PREP(STITCH) PTGS1(STITCH), PTGS2(STITCH), FAAH(STITCH), TRPV1(STITCH), CYP2E1(STITCH), CYP1A2(STITCH), CYP2D6(STITCH), CYP1A1(STITCH), NR1I3(STITCH), Acetaminophen SULT1A1(STITCH), BAZ2B(STITCH), INS-IGF2(STITCH), GPER1(STITCH), CREBBP(STITCH), MIF(STITCH), LPO(STITCH), TF(STITCH), LTF(STITCH), ABCB1(STITCH), EPX(STITCH) ADRB2(STITCH), AHR(STITCH), ALB(STITCH), AR(STITCH), Benzo(a)pyrene CYP1A1(STITCH), NR1I2(STITCH), ESR1(STITCH), CYP1A2(STITCH) Carbon tetrachloride CCR1(STITCH), CCR5(STITCH) Cyclophosphamide NR1I2(STITCH) Chlorambucil GSTP1(STITCH) ABCB1(STITCH), CAMLG(STITCH), PPP3R2(STITCH), Cyclosporine A PPIA(STITCH), PPIF(STITCH) ABCC5(STITCH), CBR1(STITCH), VDR(STITCH), Curcumin GSTP1(STITCH), PPARG(STITCH), ALOX5(STITCH), APP(STITCH), PRKCE(STITCH) AHR(STITCH), AR(STITCH), ESR1(STITCH), ESR2(STITCH), 1,1,1-Trichloro-2,2-di- GPER1(STITCH), NR1I2(STITCH), NR1I3(STITCH), (4-chlorophenyl) ethane PGR(STITCH), TSHR(STITCH) AHR(STITCH), ESR1(STITCH), ESR2(STITCH), Bis(2-ehtylhexyl) NR1I2(STITCH), NR1I3(STITCH), PPARA(STITCH), phthalate PPARB(STITCH), PPARG(STITCH), RXRA(STITCH), RXRB(STITCH) 90 | APPENDICES

AHR(STITCH), AR(STITCH), ESR1(STITCH), ESR2(STITCH), ESRRA(STITCH), ESRRB(STITCH), ESRRG(STITCH), NR1I2(STITCH), SHBG(STITCH), PGR(STITCH), Diethylstilbestrol NR3C1(STITCH), SRC(STITCH), ALB(STITCH), PELP1(STITCH), GPER1(STITCH), EP300(STITCH), CREBBP(STITCH), CARM1(STITCH), ABCG2(STITCH), GRIP1(STITCH), HTR2A(STITCH), TTR(STITCH) ALB(STITCH), PTGS2(STITCH), PTGS1(STITCH), ALOX5(STITCH), SCN4A(STITCH), ASIC1(STITCH), KCNQ2(STITCH), KCNQ3(STITCH), PLA2G2A(STITCH), Diclofenac CYP2C9(STITCH), CYP2C8(STITCH), CYP3A4(STITCH), CYP2C19(STITCH), CYP46A1(STITCH), ZADH2(STITCH), CYP2E1(STITCH), TF(STITCH), CYP2C18(STITCH), CYP2D6(STITCH) ESR1(STITCH), ESR2(STITCH), NR1I2(STITCH), AR(STITCH), ATP6(STITCH), CHRNA4(STITCH), CYP3A4(STITCH), ESRRG(STITCH), GPER1(STITCH), HSD17B2(STITCH), Estradiol (17beta- NR1I2(STITCH), PGR(STITCH), SHBG(STITCH), estradiol) SHC1(STITCH), HSD17B1(STITCH), SULT1E1(STITCH), SULT1A1(STITCH), NR3C1(STITCH), ESRRA(STITCH), RARA(STITCH), BRCA1(STITCH), ESRRB(STITCH) Ethylbenzene MAOB(STITCH), CRK(STITCH), CRKL(STITCH) Eugenol AR(STITCH), ESR1(STITCH), ESR2(STITCH), TRPV3(STITCH) ESR1(STITCH), GABRB1(STITCH), GABRB3(STITCH), GABRR1(STITCH), NR1I2(STITCH), PGR(STITCH), Lindane GLRA1(STITCH), GLRA2(STITCH), GLRA3(STITCH), GLRB(STITCH) Mitomycin C NQO1(STITCH), POR(STITCH) ESR1(STITCH), SULT1C4(STITCH), SDHB(STITCH), Pentachlorophenol SDHC(STITCH), ENSG00000255292(STITCH), SDHD(STITCH), SDHA(STITCH) p-cresidine SOD1(STITCH) NR1I2(STITCH), GABRA1(STITCH), CHRNA4(STITCH), CHRNA7(STITCH), GRIA2(STITCH), GRIK2(STITCH), GRIN1(STITCH), GRIN2A(STITCH), GRIN2B(STITCH), Phenobarbital GRIN2C(STITCH), GRIN2D(STITCH), GRIN3A(STITCH), GRIN3B(STITCH), NR1I3(STITCH), CYP1A1(STITCH), CHRFAM7A(STITCH) APPENDICES | 91

AHR(STITCH), AR(STITCH), ESR2(STITCH), NR1I2(STITCH), NR3C1(STITCH), ORM1(STITCH), PGR(STITCH), SHBG(STITCH), ESR1(STITCH), NR3C2(STITCH), CYP17A1(STITCH), OPRK1(STITCH), AKR1D1(STITCH), AKR1C2(STITCH), AKR1C1(STITCH), CYP3A4(STITCH), Progesterone SERPINA6(STITCH), SRC(STITCH), PAQR5(STITCH), GNB1(STITCH), GNA1(STITCH), CATSPER4(STITCH), CATSPERG(STITCH), CATSPERD(STITCH), CATSPER1(STITCH), CATSPER2(STITCH), CATSPER3(STITCH), CATSPERB(STITCH), GNGT1(STITCH), VDR(STITCH), APOD(STITCH), OPRK1(STITCH) ACTB(STITCH), AHR(STITCH), ALB(STITCH), ATP5A1(STITCH), ATP5B(STITCH), CBR1(STITCH), CYP1B1(STITCH), ESR1(STITCH), ESR2(STITCH), NQO2(STITCH), NR1I2(STITCH), SHBG(STITCH), HCK(STITCH), PIM1(STITCH), CYP1A1(STITCH), HIBCH(STITCH), STK17B(STITCH), ATP5C1(STITCH), Quercetin MMP9(STITCH), SRC(STITCH), CYP19A1(STITCH), PIK3CG(STITCH), UGT3A1(STITCH), UGT1A3(STITCH), UGT1A1(STITCH), UGT2B15(STITCH), UGT1A9(STITCH), AOX1(STITCH), UGT1A8(STITCH), ALOX12(STITCH), NT5E(STITCH), GSK3B(STITCH), EGFR(STITCH), AKR1B1(STITCH) BIRC5(STITCH), SLC18A2(STITCH), SLC18A1(STITCH), Reserpine ABCG2(STITCH), DRD2(STITCH), DRD3(STITCH), ABCB1(STITCH) ESR1(STITCH), TPO(STITCH), NQO2(STITCH), Resorcinol CSNK2A1(STITCH), PTGS1(STITCH), PTGS2(STITCH), INS- IGF2(STITCH), CA2(STITCH), PNMT(STITCH) Simazine ESR1(STITCH) 2,3,7,8-Tetrachloro AHR(STITCH), ARNT(STITCH), AHRR(STITCH), dibenzo-p-dioxin FLT1(STITCH) P4HA1(STITCH), SLC23A1(STITCH), PLOD1(STITCH), PLOD2(STITCH), PHYH(STITCH), PLOD3(STITCH), BBOX1(STITCH), DBH(STITCH), PAM(STITCH), P3H1(STITCH), OGHOD2(STITCH), ALKBH2(STITCH), P3H2(STITCH), P3H3(STITCH), OGFOD1(STITCH), EGLN2(STITCH), ALKBH3(STITCH), KDM5D(STITCH), L-Ascorbic acid EGLN1(STITCH), EGLN3(STITCH), TMLHE(STITCH), P4HTM(STITCH), LCT(STITCH), GSTO1(STITCH), CYBRD1(STITCH), HMOX1(STITCH), CYBASC3(STITCH), P4HB(STITCH), CYB561(STITCH), LEPREL2(STITCH), LEPREL1(STITCH), LEPRE1(STITCH), LPO(STITCH), HMOX2(STITCH), ADRA2B(STITCH), OGFOD2(STITCH), EPX(STITCH), MPO(STITCH), TPO(STITCH) 92 | APPENDICES

PPARA(STITCH), PPARD(STITCH), PPARG(STITCH), Pirinixic acid NR1I3(STITCH) Cisplatin A2M(STITCH), ALB(STITCH), ATOX1(STITCH), TF(STITCH) PTGS1(STITCH), CYP1A2(STITCH), CYP1A1(STITCH), Phenacetin CYP2A6(STITCH) ALB(STITCH), TMPRSS11D(STITCH), RAN(STITCH), INS- IGF2(STITCH), SNUPN(STITCH), PRKACA(STITCH), Phenol NUF2(STITCH), NDC80(STITCH), EIF2C2(STITCH), XPO1(STITCH), PRSS3(STITCH), PRSS1(STITCH) *For protein targets reported in STITCH, targets with less than 0.7 evidence score were not considered.

APPENDICES | 93

Table C3. Known targets for mouse chromatin targeting study compounds Compounds Direct targets (source) Hdac3(STITCH*), Hdac6(STITCH), Hdac10(STITCH), Hdac4(STITCH), Hdac11(STITCH), Hdac9(STITCH), PXD101 Hdac1(STITCH), Hdac2(STITCH), Hdac5(STITCH), Hdac8(STITCH), Gm10093(STITCH), Hdac7(STITCH) Hdac1(STITCH), Hdac2(STITCH), Hdac3(STITCH), Pyroxamide Gm10093(STITCH), Hdac6(STITCH), Hdac8(STITCH), Hdac10(STITCH) Hdac3(STITCH), Hdac2(STITCH), Hdac6(STITCH), Hdac1(STITCH), Hdac4(STITCH), Hdac10(STITCH), ITF-2357 Hdac9(STITCH), Hdac5(STITCH), Hdac11(STITCH), Hdac8(STITCH), Gm10093(STITCH), Hdac7(STITCH) CRA-024781 Hdac1(DrugBank) Hdac1(STITCH), Hdac6(STITCH), Hdac8(STITCH), Hdac5(STITCH), Hdac2(STITCH), Hdac3(STITCH), Trichostatin A Hdac10(STITCH), Hdac4(STITCH), Hdac11(STITCH), Hdac7(STITCH), Hdac9(STITCH), Gm10093(STITCH) Hdac5(STITCH), Hdac6(STITCH), Hdac10(STITCH), Hdac3(STITCH), Hdac9(STITCH), Hdac4(STITCH), Scriptaid Hdac11(STITCH), Hdac1(STITCH), Hdac2(STITCH), Hdac8(STITCH), Hdac7(STITCH), Gm10093(STITCH) Hdac4(STITCH), Hdac6(STITCH), Hdac5(STITCH), Hdac3(STITCH), Hdac9(STITCH), Hdac11(STITCH), LBH-589 Hdac10(STITCH), Hdac8(STITCH), Hdac1(STITCH), Hdac2(STITCH), Hdac7(STITCH), Gm10093(STITCH) Hdac8(STITCH), Hdac1(STITCH), Hdac6(STITCH), Hdac3(STITCH), Hdac11(STITCH), Hdac9(STITCH), SAHA Hdac10(STITCH), Hdac4(STITCH), Hdac2(STITCH), Hdac7(STITCH), Hdac5(STITCH), Gm10093(STITCH) Hdac3(STITCH), Hdac11(STITCH), Hdac2(STITCH), Hdac1(STITCH), Hdac9(STITCH), Hdac4(STITCH), MGCD-0103 Hdac8(STITCH), Hdac10(STITCH), Hdac5(STITCH), Hdac6(STITCH), Gm10093(STITCH), Hdac7(STITCH) Hdac1(STITCH), Hdac2(STITCH), Gm10093(STITCH), CI-994 Hdac3(STITCH), Hdac8(STITCH), Hdac11(STITCH), Hdac6(STITCH), Hdac7(STITCH) Hdac4(STITCH), Hdac3(STITCH), Hdac11(STITCH), Hdac8(STITCH), Hdac9(STITCH), Hdac1(STITCH), MS-275 Hdac5(STITCH), Hdac2(STITCH), Hdac10(STITCH), Hdac6(STITCH), Hdac7(STITCH), Gm10093(STITCH) Hdac3(STITCH), Hdac8(STITCH), Hdac1(STITCH), Hdac2(STITCH), Hdac4(STITCH), Hdac9(STITCH), Apicidin Hdac5(STITCH), Hdac11(STITCH), Hdac10(STITCH), Hdac6(STITCH), Gm10093(STITCH), Hdac7(STITCH) 94 | APPENDICES

BIX-01294 Ehmt2(STITCH), Ehmt1(STITCH) *For protein targets reported in STITCH, targets with less than 0.7 evidence score were not considered.

APPENDICES | 95

Table C4. AUROCs of target predictions for NCI-DREAM drug synergy challenge dataset. Compounds without known targets are excluded. Compounds ProTINA DeMAND DE* Aclacinomycin A 0.877 0.839 0.714 Blebbistatin 0.917 0.865 0.397 Camptothecin 0.707 0.483 0.477 Doxorubicin Hydrochloride 0.831 0.678 0.658 Etoposide 0.818 0.707 0.562 Geldanamycin 0.870 0.933 0.83 Methotrexate 0.425 0.405 0.429 Mitomycin C 0.923 0.622 0.72 Monastrol 0.952 0.738 0.841 Rapamycin 0.690 0.616 0.641 Trichostatin A 0.802 0.666 0.65 Vincristine 0.880 0.773 0.683 Average 0.808 0.694 0.633 *DE: differential gene expression (log2FC) analysis

96 | APPENDICES

Table C5. AUROCs of target predictions for the compound genotoxicity study (human HepG2). Compounds without known targets are excluded. Compounds ProTINA DeMAND DE* Azathioprine 0.828 0.78 0.612 Diazinon 0.754 0.753 0.497 Tetradecanoyl phorbol acetate 0.869 0.764 0.674 D-Mannitol 0.922 0.757 0.646 2-Acetylaminofluorene 0.832 0.741 0.982 8-Hydroxyquinoline (8-Quinolino) 0.978 0.856 0.769 4-Aminobiphenyl 0.792 0.766 0.901 Ampicillin (Ampicillin trihydrate) 0.669 0.63 0.578 o-Antranilic acid 0.904 0.659 0.393 Acetaminophen 0.733 0.647 0.547 Benzo(a)pyrene 0.795 0.804 0.856 Carbon tetrachloride 0.843 0.79 0.639 Cyclophosphamide 0.792 0.782 0.564 Chlorambucil 0.791 0.768 0.593 Cyclosporine A 0.782 0.675 0.692 Curcumin 0.804 0.696 0.7 1,1,1-Trichloro-2,2-di-(4-chlorophenyl) ethane 0.707 0.691 0.52 Bis(2-ehtylhexyl) phthalate 0.883 0.711 0.666 Diethylstilbestrol 0.799 0.746 0.534 Diclofenac 0.597 0.654 0.53 Estradiol (17beta-estradiol) 0.716 0.64 0.526 Ethylbenzene 0.807 0.841 0.614 Eugenol 0.755 0.565 0.582 Lindane 0.737 0.631 0.406 Mitomycin C 0.866 0.723 0.692 Pentachlorophenol 0.694 0.68 0.731 p-cresidine 0.979 0.989 0.788 Phenobarbital 0.746 0.83 0.552 Progesterone 0.674 0.658 0.444 Quercetin 0.645 0.651 0.592 Reserpine 0.699 0.808 0.756 Resorcinol 0.543 0.711 0.47 Simazine 0.830 0.783 0.818 2,3,7,8-Tetrachloro dibenzo-p-dioxin 0.952 0.932 0.918 L-Ascorbic acid 0.566 0.58 0.455 Pirinixic acid (wy 14643) 0.666 0.825 0.63 Cisplatin 0.821 0.756 0.695 Phenacetin 0.707 0.697 0.56 Phenol 0.637 0.663 0.683 Average 0.772 0.734 0.636 *DE: differential gene expression (log2FC) analysis APPENDICES | 97

Table C6. AUROCs of target predictions for the chromatin targeting study (mouse pancreatic α- and β-cells). Compounds without known targets are excluded. ProTINA DeMAND DE* Compounds α-cell β-cell α-cell β-cell α-cell β-cell PXD101 0.914 0.924 0.739 0.785 0.658 0.731 Pyroxamide 0.907 0.966 0.722 0.738 0.602 0.71 ITF-2357 0.922 0.918 0.77 0.867 0.634 0.75 CRA-024781 0.989 0.989 0.947 0.974 0.958 0.943 BIX-01294 0.983 0.991 0.941 0.957 0.933 0.942 Trichostatin A 0.894 0.905 0.685 0.819 0.572 0.683 Scriptaid 0.874 0.876 0.757 0.698 0.674 0.678 LBH-589 0.873 0.870 0.77 0.785 0.73 0.691 MGCD-0103 0.883 0.886 0.71 0.715 0.691 0.687 CI-994 0.913 0.934 0.702 0.666 0.718 0.66 SAHA 0.865 0.892 0.782 0.746 0.701 0.672 MS-275 0.890 0.911 0.774 0.719 0.713 0.691 Apicidin 0.896 0.889 0.756 0.745 0.624 0.684 Average 0.914 0.78 0.72 *DE: differential gene expression (log2FC) analysis

98 | APPENDICES

Table C7. Ranks and signs (+: enhancement, -: attenuation) of ProTINA target scores of the canonical DNA-damage response proteins. GADD45A CDKN1A PCNA AURKA CCNB1 PLK1 Camptothecin 58 + 91 + 174 + 651 - 142 - 110 - Doxorubicin 134 + 15 + 465 + 210 - 61 - 10 - Hydrochloride Etoposide 78 + 4 + 501 + 20 - 5 - 1 - 1006 Methotrexate 117 + 52 + 44 + + 1835 - 435 - 7 Mitomycin C 35 + 1 + 73 + 1272 - 357 - 39 - Trichostatin A 3741 + 4705 + 1563 - 352 - 900 - 2451 - Aclacinomycin A 5636 - 2830 - 5659 - 7125 + 2006 + 4654 + Blebbistatin 2214 + 7086 + 8674 - 7040 - 3574 + 7486 + Cycloheximide 5479 + 2604 + 1348 - 293 + 3998 + 7675 - Geldanamycin 768 - 738 - 6008 + 3275 - 4959 - 8516 - H- 1752 - 7040 + 7417 - 8747 - 3055 - 9052 + 7,dihydrochloride 1079 Monastrol 3787 + - 117 - 44 + 93 + 126 + 2 Rapamycin 3116 - 2798 - 1127 - 1028 - 749 - 3987 - Vincristine 1571 + 2289 + 1383 - 54 + 20 + 90 +

APPENDICES | 99

Table C8. Ranks of DeMANDnetwork dysregulation scores of the canonical DNA-damage response proteins. DeMAND scores are strictly positive. GADD45A CDKN1A PCNA AURKA CCNB1 PLK1 Camptothecin 2 · 1 · 42 · 5 · 9 · 17 · Doxorubicin 1 · 2 · 873 · 114 · 43 · 23 · Hydrochloride Etoposide 3 · 6 · 159 · 9 · 3 · 13 · Methotrexate 501 · 46 · 3905 · 3359 · 6809 · 111 · Mitomycin C 3 · 1 · 13 · 338 · 8 · 14 · Trichostatin A 2140 · 78 · 10369 · 227 · 12 · 70 · Aclacinomycin A 9746 · 9332 · 9494 · 5092 · 7262 · 8973 · Blebbistatin 587 · 7874 · 9625 · 6243 · 6543 · 1833 · Cycloheximide 9697 · 5502 · 9483 · 69 · 9310 · 9430 · Geldanamycin 8799 · 720 · 1000 · 3890 · 828 · 10295 · H- 9249 · 95 · 9904 · 9801 · 9804 · 10266 · 7,dihydrochloride Monastrol 476 · 9530 · 103 · 12 · 8 · 123 · Rapamycin 10013 · 10046 · 793 · 2643 · 251 · 1336 · Vincristine 136 · 633 · 611 · 21 · 2 · 39 ·

100 | APPENDICES

Table C9. Ranks and signs (+: enhancement, -: attenuation) of log2 fold-change gene expressions of the canonical DNA-damage response proteins. GADD45A CDKN1A PCNA AURKA CCNB1 PLK1 Camptothecin 50 + 82 + 241 + 661 - 84 - 29 - Doxorubicin 47 + 33 + 755 + 255 - 48 - 4 - Hydrochloride Etoposide 59 + 27 + 766 + 77 - 14 - 2 - Methotrexate 42 + 138 + 153 + 9848 + 4465 - 257 - Mitomycin C 39 + 24 + 99 + 397 - 68 - 10 - Trichostatin A 3073 + 3225 + 3681 - 1188 - 712 - 672 - 1061 Aclacinomycin A 5845 + 8914 - 5759 - - 4442 + 8468 + 4 Blebbistatin 3494 + 7506 + 3338 - 5057 - 5718 - 3915 - Cycloheximide 5200 + 3163 + 3112 - 684 + 6796 + 6738 - 1027 Geldanamycin 922 - 962 - + 2213 - 2732 - 7896 - 2 H- 3275 - 6441 + 7317 - 7095 - 1662 - 8997 - 7,dihydrochloride Monastrol 834 + 4474 + 57 - 130 + 88 + 82 + Rapamycin 8457 + 10471 + 720 - 2125 - 275 - 988 - Vincristine 1193 + 2609 + 3999 - 170 + 24 + 39 +

APPENDICES | 101

Figure C1. Hierarchical clustering of the protein scores by ProTINA, DeMAND, and DE analysis for 14 drugs of the NCI-Dream drug synergy challenge data set. Labels in red show DNA damaging agents. The hierarchical clustering of ProTINA prediction was based on Pearson

102 | APPENDICES

Table C10. Proteins involved in the overall DNA damage repair (DDR) pathways DNA damage repair Proteins (DDR) pathways ALKBH1, APEX1, APEX2, APLF, APTX, FEN1, GADD45A, GADD45G, HMGB1, HMGB2, HUS1, LIG1, LIG3, MBD4, MPG, MUTYH, NEIL2, NEIL3, NTHL1, OGG1, PARG, PARP1, PARP2, Base excision repair PARP3, PARP4, PCNA, PNKP, POLB, POLD1, POLD2, POLD3, POLD4, POLE, POLH, POLL, SMUG1, TDG, TDP1, UNG, WRN, XRCC1 BARD1, BLM, BRCA1, BRCA2, BRE, BRIP1, DNA2, FAAP100, FAN1, FANCA, FANCB, FANCC, FANCD2, FANCE, FANCF, Fanconi anemica FANCG, FANCI, FANCL, FANCM, HELQ, HES1, KAT5, PALB2, pathway RAD51, RAD51C, RMI2, STRA13, TELO2, TOP3A, TOP3B, UBE2T, USP1, WDR48 BLM, BRCA1, BRCA2, EID3, EME1, EME2, GEN1, H2AFX, HELQ, HFM1, KAT5, MRE11A, MUS81, NBN, NDNL2, NFATC2IP, NSMCE1, NSMCE2, NSMCE4A, PARG, PAXIP1, Homologous PPP4C, PPP4R1, PPP4R4, RAD50, RAD51, RAD51B, RAD51C, recombination RAD51D, RAD52, RAD54B, RAD54L, RDM1, RECQL, RECQL4, RECQL5, RMI2, RPA1, RPA2, RPA3, RPA4, SHFM1, SLX4, SMC5, SMC6, SPO11, TOP3A, TOP3B, UIMC1, WRN EXO1, HMGB1, LIG1, MLH1, MLH3, MSH2, MSH3, MSH4, Mismatch repair MSH5, MSH6, PCNA, PMS1, PMS2, RFC1, RFC2, RFC3, RFC4, RFC5, RPA1, RPA2, RPA3, RPA4 APLF, APTX, ATM, DCLRE1C, DNTT, LIG4, MDC1, MRE11A, Non-homologous end NBN, PARG, PARP3, PNKP, POLB, POLL, POLM, PRKDC, joining RAD50, RNF168, RNF8, TP53BP1, XRCC2, XRCC3, XRCC4, XRCC5, XRCC6 CDK7, CETN2, CUL3, CUL4A, CUL5, DDB1, DDB2, ERCC1, ERCC2, ERCC3, ERCC4, ERCC6, ERCC8, GADD45A, GADD45G, GTF2H1, GTF2H3, GTF2H4, GTF2H5, LIG1, LIG3, MNAT1, POLD1, POLD2, POLD3, POLD4, POLE, POLE2, POLE3, POLE4, Nucelotide excision POLK, POLR2A, POLR2B, POLR2C, POLR2D, POLR2E, POLR2F, repair POLR2G, POLR2H, POLR2I, POLR2J, POLR2K, POLR2L, RAD23A, RAD23B, RBX1, RFC1, RFC2, RFC3, RFC4, RFC5, RPA1, RPA2, RPA3, RPA4, TCEB1, TCEB2, TCEB3, UVSSA, XAB2, XPA, XPC, XRCC1 HLTF, MAD2L2, PCNA, POLH, POLI, POLK, POLM, POLN, Translesion synthesis POLQ, RAD18, REV1, REV3L, SHPRH, UBE2A, UBE2B, UBE2N, UBE2V2, USP1, WDR48 Direct repair ALKBH2, ALKBH3, MGMT

APPENDICES | 103

Table C11. Proteins involved in the DDR associated pathways DDR-associated Proteins pathways ABL1, AMN1, ATM, ATR, BRCA1, BRCC3, CCNA1, CCNA2, CCNB1, CCNB2, CCNB3, CCNC, CCND1, CCND2, CCND3, CCNH, CDC25A, CDC25B, CDK2, CDK4, CDKN1A, CDKN2D, CHEK1, CHEK2, CLSPN, GADD45A, HUS1, Checkpoint factors HUS1B, KAT2A, MDC1, MMS22L, PER2, PER3, POLA1, RAD1, RAD17, RAD50, RAD9A, RAD9B, RBBP8, RFC2, RFC3, RFC4, RFC5, TIMELESS, TIPIN, TONSL, TOPBP1, TP53BP1, TP73, WEE1 ACTL6A, ACTR5, ACTR8, ARID1A, ARID1B, ARID2, BAZ1A, BRD7, CHRAC1, INO80, INO80B, INO80C, INO80D, Chromatin remodelling INO80E, MCRS1, NFRKB, PBRM1, POLE3, RUVBL1, RUVBL2, SMARCA2, SMARCA4, SMARCA5, SMARCB1, SMARCC1, SMARCC2, SMARCD1, SMARCE1, TFPT NCAPD2, NCAPD3, NCAPG, NCAPG2, NCAPH, NCAPH2, Chromosom segregation PDS5A, PDS5B, RAD21, SMC1A, SMC1B, SMC2, SMC3, SMC4, STAG1, STAG2 CCND1, CCNE1, CDK2, CDK4, CDKN1A, CDKN2A, MDM2, P53 pathway TP53, TP73 ACD, ATRX, BLM, CTC1, DAXX, DCLRE1B, GAR1, MRE11A, NBN, NHP2, NOP10, OBFC1, PARP4, POT1, Telomere maintenance RAD50, SMG6, TELO2, TEP1, TERF1, TERF2, TERF2IP, TERT, TINF2, TNKS, WRAP53 COPS2, COPS3, COPS4, COPS5, COPS6, COPS7A, COPS7B, COPS8, GPS1, MDM2, MDM4, RAD18, RNF168, RNF4, RNF8, Ubiquitin response SUMO1, SUMO3, SUMO4, UBA1, UBA2, UBB, UBC, UBE2B, UBE2N, UBE2NL, UBE2V2

104 | APPENDICES

Table C12. The ranks of FANCD2, FANCI, RAD51 and BRCA1 proteins of DNA- crosslinking damage repair for DNA-damage agents. Ranks by ProTINA score FANCD2 FANCI RAD51 BRCA1 Camptothecin 3527 3924 9057 506 Doxorubicin 2362 2013 206 1720 Hydrochloride Etoposide 4449 1614 1493 3272 Methotrexate 148 96 551 262 Mitomycin C 63 123 32 70 Trichostatin A 328 279 2045 171 Ranks by DeMAND score FANCD2 FANCI RAD51 BRCA1 Camptothecin 10372 9274 9209 3370 Doxorubicin 8671 6035 1939 9550 Hydrochloride Etoposide 8751 6230 7895 5153 Methotrexate 8204 9546 317 10363 Mitomycin C 10254 5439 170 9981 Trichostatin A 10190 8145 9302 16 Ranks by DE FANCD2 FANCI RAD51 BRCA1 Camptothecin 3457 3440 6030 836 Doxorubicin 2290 1753 545 1421 Hydrochloride Etoposide 10466 1926 1946 3314 Methotrexate 342 628 1242 2124 Mitomycin C 824 2115 886 2225 Trichostatin A 1531 1118 3601 1142

CURRICULUM VITAE | 105

List of Figure and Table

FIGURES

Figure 1-1. An example of drug mechanism of action. In this example, the drug interacts with or binds to a protein, which subsequently causes the downstream genes that are regulated directly or indirectly by the protein, to be differentially expressed. TFs are proteins that can bind on the regulatory regions of the DNA and regulate the gene transcriptions. The non- directed edges indicate protein-protein interactions, while the directed edges indicate the regulation of gene expression (pointed arrows: gene activation, blunt arrows: gene repression) ...... 2

Figure 2-1. Workflow of gene target prediction using DeltaNet. The performance of DeltaNet in predicting known gene perturbations was evaluated using gene expression data of E. coli, S. cerevisiae and D. melanogaster...... 15

Figure 2-2. True positive rates of gene target predictions from DeltaNet, SSEM, MNI, and z- scores. The results of DeltaNet-LAR came from analyses using a threshold δr = 10%...... 18

Figure 2-3. Rankings of known TF targets of chemical compounds based on TF enrichment analysis of DeltaNet, SSEM, MNI, and z-scores predictions. The TFs are ranked according to (a) the adjusted p-values of YEASTRACT for yeast dataset and (b) the combined enrichment scores of Enrichr for human MCF-7 dataset...... 21

Figure 3-1. A simple three-gene network with a perturbation to the upstream gene (gene A). The nodes represent genes, while arrows represent gene regulations. (a) The dynamics of gene expressions over three time points due to a perturbation on gene A. Genes showing differential expressions are drawn as filled nodes. (b) Inference of the GRN from time series differential expressions with steady state assumption. Data at each time point are viewed as independent steady state samples. The differential expressions are consistent with a GRN with the reverse regulatory interactions under time-varying perturbations...... 25

Figure 3-2. True positive rates of target predictions by DeltaNeTS, DeltaNet, TSNI and z- scores for in silico time series dataset, using (a) uniform time sampling and (b) non-uniform time sampling, and for (c) yeast time series dataset...... 32

Figure 3-3. True positive rates of target predictions by DeltaNeTS and DeltaNet for (a) time series samples and (b) steady-state samples using yeast microarray datasets...... 33

Figure 3-4. Enriched pathways resulting from gene set enrichment analysis of DeltaNeTS predictions using Reactome pathways. The size and color of dots indicate negative 10-based logarithm of p-values for the enriched terms. The influenza infection period was divided into 106 | CURRICULUM VITAE

3 time phases: Phase1 = 0-7 hours, phase 2 = 7-18 hours, and phase 3 ≥ 18 hours post-infection...... 36

Figure 4-1. Protein target prediction by ProTINA. (a) The protein-gene network describes direct and indirect regulations of gene expression by TFs and their protein partners (P), respectively. A drug interaction with a protein is expected to cause differential expression of the downstream genes in the PGN. (b) Based on a kinetic model of gene transcriptional process, PROTINA infers the weights of the protein-gene regulatory edges, denoted by akj, using gene expression data. The variable akj describes the regulation of protein j on gene k, where the magnitude and sign of akj indicate the strength and mode (+akj: activation, -akj: repression) of the regulatory interaction, respectively. (c) A candidate protein target is scored based on the deviations in the expression of downstream genes from the PGN model prediction (Pj: log2FC expression of protein j, Gk: log2FC expression of gene k). The colored dots in the plots illustrate the log2FC data of a particular drug treatment, while the lines show the predicted expression of gene k by the (linear) PGN model. The variable zk denotes the z-score of the deviation of the expression of gene k from the PGN model prediction. A drug-induced enhancement of protein- gene regulatory interactions is indicated by a positive (negative) zk in the expression of genes that are activated (repressed) by the protein (i.e. akjzk > 0). Vice versa, a drug-induced attenuation is indicated by a negative (positive) zk in the expression of genes that are activated (repressed) by the protein (i.e. akjzk < 0). (d) The score of a candidate protein target is determined by combining the z-scores of the set of regulatory edges associated with the protein in the PGN. A positive (negative) score indicates a drug-induced enhancement (attenuation). The larger the magnitude of the score, the more consistent is the drug induced perturbations (enhancement/attenuation) on the protein-gene regulatory edges...... 42

Figure 4-2. Prediction of known targets of drugs. AUROCs of protein target predictions from ProTINA, DeMAND and DE methods for the NCI-DREAM drug synergy (human B-cell lymphoma), the compound genotoxicity (human HepG2) and the chromatin targeting study (mouse pancreatic cell) datasets (*: p-value < 0.01, **: p-value <0.001 by paired t-test)...... 52

Figure 4-3. Canonical p53 DNA damage response pathway. In response to DNA damage, GADD45A, CDKN1A, PCNA are activated, while AURKA, CCNB1, and PLK1 proteins are inhibited [19]...... 53

Figure 4-4. Mechanism of action of compounds based on target predictions by ProTINA. (a) The rank distribution of the canonical p53 DNA damage response proteins in the drug target predictions of PROTINA, DeMAND and DE for the NCI-DREAM drug synergy dataset. (B) The rank distribution of proteins involved in the core DNA-damage repair (DDR) and DDR- associated pathways [96] in the target predictions of PROTINA, DeMAND, and DE for the DNA damaging compounds in the NCI-DREAM drug synergy study (**: p-value <0.001 by Wilcoxon signed rank tests)...... 54

Figure 4-5. Prediction of targets of influenza A virus. The receiver operative characteristic curves give the true positive rate versus the false positive rate relationship of the protein target CURRICULUM VITAE | 107 predictions from ProTINA, DeMAND, and DE against proteins that co-immunoprecipitate with influenza A viral proteins. The AUROCs for ProTINA, DeMAND and DE analysis are 0.758, 0.687 and 0.647, respectively...... 55

Figure 4-6. Gene set enrichment analysis for KEGG pathways for the influenza A protein target predictions from ProTINA, DeMAND, and DE. The size and color of dots correspond to –log10 scale of the q-values. Only pathways with q-value < 0.01 are shown...... 57

APPENDIX FIGURES

Figure A1. True positive rates of gene target predictions using DeltaNet-LAR until completion and with δr = 1, 5, 10 and 20% for E. coli, S. cerevisiae and D. melanogaster datasets...... 78

Figure A2. Comparison of true positive rates using gene ranking sorted by pki magnitudes and by q-values from DeltaNet. The analyses using DeltaNet-LAR were performed using δr=1%...... 83

Figure A3. Comparison of the average numbers of nonzero coefficients in ak, the row vector of matrix A, inferred by DeltaNet-LASSO and SSEM over (a) the entire genes and (b) target genes predicted within top 10 only by DeltaNet-LASSO. The numbers of target genes within top 10 prediction by DeltaNet-LASSO but not by SSEM were 10, 15, and 13 respectively for E. coli, yeast, and Drosophila data sets...... 84

Figure A4. Comparison of true positive rates for gene target predictions using DeltaNet-LAR (δr=10%) with and without time-series data for E. coli and yeast. Since Drosophila dataset consists of only steady-state data, we only compared the results for E. coli and S. cerevisiae...... 85

Figure B1. Enriched pathways resulting from gene set enrichment analysis of log2FCs for Reactome pathway terms. The size and color of dots indicate negative 10-based logarithm of p-values for enriched terms. Each influenza infection was divided into 3 time phases: Phase1 = 0-7 hours, phase 2 = 7-18 hours, and phase 3 = more than 18 hours of post-infection...... 86

Figure C1. Hierarchical clustering of the protein scores by ProTINA, DeMAND, and DE analysis for 14 drugs of the NCI-Dream drug synergy challenge data set. Labels in red show DNA damaging agents. The hierarchical clustering of ProTINA prediction was based on Pearson ...... 101

TABLES

Table 2-1. Computational times of DeltaNet, SSEM and MNI ...... 19 108 | CURRICULUM VITAE

Table 2-2. AUROC and AUPR of DeltaNet, SSEM, MNI and z-scores...... 19

Table 3-1. AUROCs and AUPRs of DeltaNeTS, TSNI, DeltaNet, and z-scores ...... 33

Table 3-2. Transcription factor binding sites enriched in the top 50 target predictions ...... 35

APPENDIX TABLES

Table A1. Ranking of known TF based on the p-values of TF enrichment analysis using the top 100 gene target predictions by DeltaNet, SSEM, MNI and z-scores in the yeast compendium...... 79

Table A2. Statistical significance of the median differences in the rankings of enriched TF between two methods. The p-values were calculated by Wilcoxon rank sum test in MATLAB...... 80

Table A3. Ranking of TFs enriched among the top 100 gene target predictions by DeltaNet, SSEM, and z-scores in MCF-7 compendium from C-Map...... 81

Table C1. Known targets for NCI-DREAM drug synergy challenge compounds ...... 87

Table C2. Known targets for HepG2 genotoxicity study compounds ...... 89

Table C3. Known targets for mouse chromatin targeting study compounds ...... 93

Table C4. AUROCs of target predictions for NCI-DREAM drug synergy challenge dataset. Compounds without known targets are excluded...... 95

Table C5. AUROCs of target predictions for the compound genotoxicity study (human HepG2). Compounds without known targets are excluded...... 96

Table C6. AUROCs of target predictions for the chromatin targeting study (mouse pancreatic α- and β-cells). Compounds without known targets are excluded...... 97

Table C7. Ranks and signs (+: enhancement, -: attenuation) of ProTINA target scores of the canonical DNA-damage response proteins...... 98

Table C8. Ranks of DeMANDnetwork dysregulation scores of the canonical DNA-damage response proteins. DeMAND scores are strictly positive...... 99

Table C9. Ranks and signs (+: enhancement, -: attenuation) of log2 fold-change gene expressions of the canonical DNA-damage response proteins...... 100 CURRICULUM VITAE | 109

Table C10. Proteins involved in the overall DNA damage repair (DDR) pathways ...... 102

Table C11. Proteins involved in the DDR associated pathways ...... 103

Table C12. The ranks of FANCD2, FANCI, RAD51 and BRCA1 proteins of DNA-crosslinking damage repair for DNA-damage agents...... 104