Data-Driven Approaches to Understand Development, Diseases And

Data-driven approaches to understand development, diseases and

identify therapeutics

A dissertation presented by

Yunguan Wang

M.S., University of Cincinnati, United States B.E., Dalian University of Technology, China

A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy

in Pathobiology and Molecular Medicine in the College of Medicine of University of Cincinnati, Ohio

Committee:

Bruce J. Aronow, PhD, Chair Anil G. Jegga, DVM, MRes Jeffrey A. Whitsett, MD Harinder Singh, PhD Kathryn A. Wikenheiser-Brokamp, MD, PhD

Abstract

Recent technological advances in biomedical, genomics, and computational fields have brought exponential growth in both the amount and accessibility of biological data. These include health records, medical imaging data, omics data including genomics, proteomics, metabolomics, phenomics, disease, and small molecule data. The resultant biological big data poses both great opportunities and challenges. For example, the large amount of heterogeneous data not only allows researchers to query and pursue investigations in health and disease from an unprecedented wide perspective but also enables novel discoveries that were previously obscured by the lack of comprehensive and in-depth analysis.

In this dissertation, I use various data-driven approaches to generate testable hypotheses and actionable biological insights related to lung development, disease, and candidate therapeutic discovery. In the first part of this thesis, I used unsupervised machine learning to identify novel cell type and sub types in developing mouse lung from embryonic day (E) 16 to post-natal day

(PND) 28 and discovered functionally distinct gene modules associated with each cell populations.

These gene modules are analyzed further to identify their roles in lung development, specifically, how they contribute to cell-cell communication during lung development. In the second part of the thesis, I focus on a lethal rare lung disease, idiopathic pulmonary fibrosis (IPF), and identify molecular signatures that not only explain the heterogeneous nature of IPF but also the potential molecular basis of disease severity. I analyzed a large cohort of IPF data and identified clinically significant subgroups using only transcriptomic data. Lastly, I combined connectivity mapping and systems biology-based approaches to identify and prioritize candidate therapeutics for

i another rare lung disorder – cystic fibrosis (CF). We identified PP-2, a src-kinase inhibitor as a novel CFTR modulator that could potentially correct F508del CFTR in CF. We validated our finding in various in vitro cell-based assays.

In summary, I developed a pipeline for single-cell RNA seq, a generalizable workflow for connectivity-based drug screening and a web database for single-cell data. With the developed tools and methods, I Identified potential therapeutic drugs for CF and repurposing drug candidates for IPF.

Keywords: Machine learning, Data mining, Lung development, Idiopathic pulmonary fibrosis,

Cystic fibrosis, Drug discovery, Single-cell, IPF subgroup, CFTR corrector.

Chapters 3 and 4 of this dissertation are adaptations of manuscripts published in BMC Pulmonary

Medicine and bioRxiv (pre-print), respectively. Both are under the Creative Commons Attribution

License (CC BY), which grant users the right to unrestricted dissemination and re-use of the work, as long as proper attribution is given to original authors.

iii

Acknowledgements

First, I would like to express my deepest gratitude to my mentors, Dr. Bruce Aronow and Dr. Anil

Jegga, for giving me guidance and encouragement throughout my graduate studies. Dr. Aronow’s unparalleled enthusiasm for science and insightful visions in systems biology had been my source of inspiration and enlightenment. I am particularly grateful for Dr. Aronow’s confidence in me, without which, I would not be where I am today. Dr. Jegga introduced me to the field of translational bioinformatics, his remarkable understanding of modern drug discovery and resourcefulness ensured the success of our research projects. In addition, Dr. Jegga provided insightful advice and generous support for my career development in many forms, and for that,

I will be forever grateful. It has truly been an honor to work with these two mentors in the BMI department at Cincinnati Children’s Hospital.

I also want to express my sincere gratitude to Dr. Jeffery Whitsett, Dr. Kathryn Wikenheiser-

Brokamp and Dr. Harinder Singh for being a part of the graduation committee, providing valuable advice on various projects during presentations and meetings. I would also like to send my gratitude to Dr. David Askew, who was my first-year advisor. Dr. Askew’s advice and encouragement was one of the driving force that motivated me to make the transition from experimental biology to computational biology.

Thanks to Dr. Naren’s team for providing experimental validation of our drug candidates and Dr.

Madala’s team for their valuable insights. Thanks to Dr. Jing Chen, Phil Dexheimer, Eric Bardes and Scott Tabar in the BMI department for advice on data analysis and programming. I enjoyed working with Eric Bardes on the ToppCell database, which proved to be both educational and fun.

I have been very lucky to be part of the pathobiology and molecular medicine graduate program at the College of Medicine at the University of Cincinnati. I am grateful for all feedbacks, suggestions, and jokes from faculties and students during seminars, classes and hallway chats.

Special thanks to Ms. Heather Anderson, the best program coordinator, for her always prompt replies to my questions and concerns, and for being an incredible listener and friend.

Finally, I wish to thank my parents, Wang Bing, and Gao Shuling, for their love, support, guidance and continued encouragement throughout the years. They ensured that I got the best education possible while also provided enough freedom so that I could explore my interest to the fullest. I wish to thank my son Jifeng for the joys he continues to bring me. I would also like to thank my wife, Dan Wang, for her love, care and understanding during the ups and downs in these five years. This dissertation could not have been completed without the support of my family.

Table of Contents

Abstract ...... i

Acknowledgements ...... iv

Table of Contents ...... vi

List of Figures ...... xi

List of Tables ...... xv

1. Chapter 1. Introduction ...... 1

1.1. Overview of lung development and key questions to be answered ...... 1

1.2. Understand Lung development at the single cell level ...... 3

1.3. Pathobiology and heterogeneity in idiopathic pulmonary fibrosis ...... 5

1.4. In silico drug screening methods for disease therapeutics ...... 8

2. Chapter 2. Single-cell-Based Gene Module Analysis of Differentiation Pathways and Cell

Interactions that Drive Lung Development in Mouse ...... 11

2.1. Introduction ...... 11

2.2. Methods and Materials ...... 13

2.2.1. Isolation of Single Lung Cells for RNA-Seq Analysis ...... 13

2.2.2. Single-cell RNA-Seq Analysis Pipeline ...... 13

2.2.3. Reiterative Cell Type and Subtype Identification ...... 14

2.2.4. Cluster Validation Based on Functional Enrichment of Cluster Derived Genes ...... 15

2.3. Results ...... 16

2.3.1. An iterative single-cell data analysis pipeline for cell population detection, cell-

population-specific signature gene prioritization, and cell identify learning ...... 16

2.3.2. In detail analysis of mouse distal lung cell types and subtypes at E16.5 ...... 20

2.3.3. Extracting biological insights from predicted cell populations ...... 32

2.3.4. The cellular landscape of developing lung from E16.5 to PND28 ...... 38

2.3.5. Toppcell: a database for modular single-cell data analysis, ...... 39

2.4. Discussion ...... 43

3. Chapter 3. Unsupervised gene expression analyses identify IPF-severity correlated signatures, associated genes and biomarkers ...... 47

3.1. Abstract ...... 49

3.2. Background ...... 51

3.3. Methods ...... 52

3.3.1. Cohort selection ...... 52

3.3.2. Clustering, principle component analysis (PCA), and differential expression analysis

...... 54

3.3.3. Validation of IPF gene sets with logistic classifiers ...... 54

vii

3.3.4. Functional enrichment analysis and candidate gene prioritization ...... 55

3.4. Results ...... 55

3.4.1. Gene expression profiles of UIP/IPF patients are highly heterogeneous and are not

consistent within clinical FVC or DLCO categories ...... 55

3.4.2. Clustering analysis identifies UIP/IPF patient subgroups correlating with IPF-severity

...... 57

3.4.3. Functional characterization of IPF subgroups ...... 60

3.4.4. Validation of IPF subgroups with independent IPF cohorts ...... 62

3.4.5. Functional prioritization of novel IPF-associated genes ...... 65

3.4.6. Prioritization of putative bronchoalveolar lavage fluid biomarkers for IPF ...... 67

3.5. Discussion ...... 68

3.6. Conclusions ...... 70

3.7. List of abbreviations ...... 71

3.8. Availability of data and material ...... 71

4. Chapter 4. PP-2, a src-kinase inhibitor, is a potential corrector for F508del-CFTR in cystic fibrosis

...... 72

4.1. Abstract ...... 74

4.2. Background ...... 75

4.3. Methods ...... 77

viii

4.4. Results ...... 81

4.4.1. Identifying candidate anti-CF small molecules through integrated gene expression

profiling and systems biology ...... 81

4.4.2. Screening of candidate compounds in mouse ΔF508/ΔF508-cftr enteroids for mutant

CFTR functional rescue ...... 83

4.4.3. Screening of PP2 and its non-src-kinase-inhibitor analog PP3 using enteroids derived

from CF patients (ΔF508/ΔF508-CFTR) ...... 85

4.4.4. PP2 corrected ΔF508-CFTR mutant protein ...... 86

4.4.5. Computational modeling of ∆F508-CFTR and PP2 ...... 87

4.4.6. Characterization of PP2 as candidate therapeutic for CF ...... 88

4.5. Discussion ...... 91

4.6. Conclusions ...... 93

4.7. Abbreviations ...... 94

4.8. Declarations ...... 94

4.9. Authors' contributions ...... 95

4.10. Acknowledgement ...... 96

5. Chapter 5. Integrative in silico screening of candidate therapeutic discovery for Idiopathic pulmonary fibrosis ...... 97

5.1. Introduction ...... 97

5.2. Methods ...... 98

5.2.1. Cohort selection ...... 98

5.2.2. Differentially analysis of IPF gene expression datasets ...... 99

5.2.3. Permutation analysis to estimate significance of drug-disease connectivity across

datasets and celllines...... 99

5.3. Result ...... 101

5.3.1. Differential analysis of 8 IPF datasets ...... 101

5.3.2. Connectivity analysis and permutation tests ...... 101

5.3.3. Prioritization based on drug targets dysregulated in IPF ...... 103

5.4. Discussion ...... 106

Summary and future directions ...... 107

References ...... 108

List of Figures

Figure 2-1: Schematic of single-cell RNA-seq workflow...... 17

Figure 2-2: Heatmap overview of seven major lung cell populations and associated subpopulations...... 19

Figure 2-3: Representative functional enrichment terms and genes of each cell class...... 20

Figure 2-4: Expression of 36 AT1 or AT2 marker genes and 32 cell cycle genes in five epithelial subtypes...... 22

Figure 2-5: Overview heatmap of E16.5, E18.5 and P107 epithelial cell subtypes...... 24

Figure 2-6: Representative functional enrichment terms of each epithelial subtype...... 26

Figure 2-7: Heatmap overview of fibroblast populations...... 28

Figure 2-8: Representative functional enrichment terms of each fibroblast population...... 30

Figure 2-9: Minimal spanning tree (MST) of five epithelial subtypes...... 32

Figure 2-10: AT1 and AT2 markers in MST and in confocal images...... 33

Figure 2-11: Representative predicted marker gene for each epithelial subtype...... 34

Figure 2-12: Interaction networks of adhesion molecules expressed by lung cell types...... 35

Figure 2-13: Epithelial growth factor interaction network...... 36

Figure 2-14: Epithelial growth factor receptor interaction network...... 37

Figure 2-15: Endothelial growth factor interaction network...... 37

Figure 2-16: Endothelial growth factor receptor interaction network...... 37

Figure 2-17: Legend of growth factor/receptor interaction network...... 38

Figure 2-18: ToppCell Navigation page ...... 40

Figure 2-19: Interactive heatmaps and available analysis ...... 41

Figure 2-20: Expression profile of a custom gene list visualized by ToppCell...... 42

Figure 2-21: Cell-cell interaction analysis by ToppCell...... 43

Figure 3-1: Gene expression profiles in lung tissues taken from IPF patients were highly heterogeneous...... 56

Figure 3-2: Principal Component analysis (PCA) plot characterized separation of IPF sample by three grouping methods...... 57

Figure 3-3: UIP/IPF patient subgroups stratified by disease severity by FEV1, FVC and DLCO have different gene expression profiles...... 58

Figure 3-4: Unsupervised clustering followed by differential analysis recovered almost all the DEG identified by other methods and discovered additional DEG...... 59

Figure 3-5: Comparison of DEGs of each patient cluster revealed genes commonly dysregulated in IPF and genes associated with severe lung function decline...... 61

Figure 3-6: Enriched biological processes in each gene category revealed commonly and high- severity-associated biological pathways perturbed in IPF...... 62

Figure 3-7: Heat maps of Core and advanced IPF gene set...... 64

xii

Figure 3-8: The core IPF gene set robustly differentiated IPF patients from normal controls in three independent validation cohorts...... 65

Figure 3-9: The advanced IPF gene set could differentiate end-stage IPF but not AEIPF from usual

IPF...... 65

Figure 3-10: Performance of Logistic Regression Classifier build on up to 50 top ranked putative

BALF biomarkers...... 68

Figure 4-1: Schematic representation of workflow to identify CF correctors...... 82

Figure 4-2: FIS in ΔF508/ΔF508-cftr intestinal organoids...... 84

Figure 4-3: PP2 Dose response experiments in mice...... 85

Figure 4-4: Preclinical validation of PP2...... 86

Figure 4-5: CF pathway enrichment network for genes that are reciprocally connected in CF and

PP2 treatment...... 89

Figure 4-6: PP2 reversed the expression profiles of genes dysregulated in F508del-CFTR bronchial epithelial cells...... 90

Figure 4-7: Functional enrichment network of differentially expressed genes following treatment with PP2 (2 µM; 48 h) in human CFBE (ΔF508/ΔF508-CFTR)...... 91

Figure 5-1: Workflow of integrated into silico drug screening for IPF...... 98

Figure 5-2: Heatmap of compounds with positive IPF connectivity reported at least once...... 102

Figure 5-3: Heatmap of 82 compounds that were significantly connected with IPF...... 103

xiii

Figure 5-4: Network of 31 drug candidates whose target were dysregulated in IPF...... 104

xiv

List of Tables

Table 3-1: Patient demographics and clinical characteristics of the LGRC IPF cohort...... 53

Table 3-2: Top 10% prioritized genes in the core and advanced IPF gene sets...... 66

Table 5-1: Summary of 8 datasets comparing IPF lung tissue with healthy controls...... Error!

Bookmark not defined.

Table 5-2: Literature annotation for 24 drug candidates based on Pubmed search...... Error!

Bookmark not defined.

1. Chapter 1. Introduction

1.1. Overview of lung development and key questions to be answered

The lung is one of the most important organs in territorial animals for its indispensable functions in conducting gas exchange. Gas exchange is the process of simultaneous diffusion of oxygen in the inhaled air from spaces in the lung into the bloodstream and the movement of carbon dioxide from the bloodstream into the lung spaces, which is mixed with the oxygen-depleted inhaled air and subsequently exhaled. The process of gas exchange is optimized in the lung by its structure featuring: (1) a massive number of alveoli residing in the distal lung, 700 million or 75m2 in total surface area (1), and (2) the extensive vascular system accompanying these alveoli that brings blood close to the gas exchange interface.

The process of lung development can be divided into distinct stages based on various criteria. For example, embryonic, fetal and postnatal stages based on the developmental time, or embryonic, pseudoglandular, canalicular, saccular and alveolarization stages based on morphology (2). Lung development is primarily dependent on two distinct biological mechanisms occurring at different stages, i.e., branching morphogenesis and alveolarization. Branching morphogenesis is a process that remodels epithelial or endothelial sheath into a branched, tubular and tree-like structure.

This process is responsible for generating the complex airway networks in the lung and it also drives the development of other organs and tissues such as the mammary glands and the kidneys

(3). Alveolarization is a process in the lung where terminal airspaces are divided into many smaller compartments (alveoli), resulting in a great increase in the surface area of the lung (4). In this work, we will mainly use the morphology-defined lung development stages.

During the embryonic stage, the specification of the lung is driven by a ventral dominant expression of the transcription factor Homeobox Protein Nkx-2.1 (Nkx2.1) (5). Morphologically, this is demonstrated as the separation of anlage of the trachea and two lung buds from the foregut. Notably, the two lung buds develop separately from the foregut rather than branching from a common lung bud (2).

Following the embryonic stage, the two lung buds undergo extensive branching morphogenesis to form a tubular, gland-like bronchial tree. In this process, the lung tubules extend via directed division of proximal epithelial cells, and the terminal epithelial cells invade into the surrounding lung mesenchymal and bifurcate into new branches (2, 6). Lung branching morphogenesis is highly stereotypical and is dependent on Fgf10 signaling, whose expression is region specific and regulated by Bmp4 and Sonic hedgehog (Shh) (7-9).

During the canalicular stage, the airway undergoes further morphological changes that involve epithelial cells and mesenchymal cells. Lung epithelial cells at this stage have started to show the morphology of mature alveolar epithelial cells. Some epithelial cells become alveolar type I cells

(AT1), which are squamous and form a thin lining around the surface of alveolar duct and sacculi.

Another population of epithelial cells differential into alveolar type II cells (AT2), which are cuboidal cells located between AT1 cells, and often carry surfactant-rich lamellar bodies.

Extensive angiogenesis in the mesenchyme between airways also occurs at this stage, leading to a multiplication of capillaries surrounding sacculi. The distance between capillaries and airways is reduced, leading to the formation of future air-blood exchange interface (6).

After the canalicular stage, the biological process in the developing lung start to transit from branching morphogenesis to alveolarization, and this intermediate stage is called the saccular stage (10). At this stage, the terminals of the airway (acini) expand and form large airspaces named sacculi. Primary septa are also formed during this stage as a result of condensing mesenchyme between two expanding sacculi (11).

The last stage of lung development is the alveolarization stage, featuring development of alveoli and maturation of vasculature in the septa. The alveolus is development through the process of

“septation”, where new septa (secondary septa) are formed from primary septa derived from the sacculation phase. Septation starts with accumulation of smooth muscle cell precursor cells, elastic fibers and collagen in the primary septa at the site of future septa (12). Then, “upfoldings” are formed at these sites, and subdivides preexisting airspaces into separate compartments, which are called alveoli. Maturation of microvasculature continues at this stage and forms a thin single-layered interface between alveoli. Septation can also occur on preexisting secondary septa to further divide alveoli (6).

1.2. Understand Lung development at the single cell level

Technology advances recently have rapidly broadened our understanding of the molecular biology of lung development through integration of data-driven approaches and systems biology methods, and among these, expression profiling is a major contributor of new knowledge and insights (13). RNA analyses of bulk lung tissue prior to birth (14-16) provide a global understanding of cellular differentiation, pathways and gene networks regulating lung

3 maturation; however, understanding how intercellular interactions responsible for organ-level tissue morphogenesis is lacking. Since the mature and developing lung consists of dozens of different cell types, the use of pooled cell populations, limits insights about the specialized gene programs and mechanisms controlling cell lineages, interactions, and functions during organ formation. More recent reports explored pneumocyte differentiation with single-cell RNA-seq data of the lung and shed addition insights into the roles of individual cell types during lung development. Analysis of single-cell RNA-seq data derived CD45-Epcam+ cells identified a common epithelial bipotent progenitor that expresses both AT1 and AT2 markers (17). This study is interesting because it is the first study that showed that information on intermediate progenitor cells, lineage, and potential gene markers can be learned via mining single-cell transcriptional profile at multiple developmental time points. Another study looking at postnatal

AT1 cells using single-cell RNA-seq data identified two distinct population of AT1 cells separable by the expression of insulin-like growth factor-binding protein 2 (Igfbp2), and the Igfbp2 positive subset represents the terminally differentiated AT1 cells, which remained non-proliferative and maintained their AT1 phenotype during alveolar regeneration (18). This study provided a new marker for the terminally differentiated, ‘mature’ AT1 cells, and raised a further question into the Igfbp2- AT1 cells. Given the recent finding that Hopx+ alveolar epithelial cells could differentiate into both AT1 and AT2 cells (19), the role of such AT1-like progenitor cells during alveolarization and lung regeneration remains to be elucidated.

Despite these advancements in epithelial development in the lung, information of molecular profiles of other cell types (including endothelial cells, fibroblasts, pericytes, neuronal cells, myeloid and lymphoid cells) during lung development is still limited. Such knowledge is important

4 for understanding the genes and pathways that regulate different processes in lung development, and how molecular and physical interactions among the cells and their microenvironment guide formation of individual functional units of the lung such as alveoli and capillaries. Inspired by such gap in knowledge, the molecular atlas of the developing lung (LungMAP) consortium are formed by a set of research centers with the goal to collect and share detailed structural and molecular data regarding normal perinatal and postnatal lung development in the mouse and human (20).

The LungMAP consortium has so far collected a massive amount of data in a large scope: (1) imaging data including H&E staining, quantitative multicolor immunohistochemistry, in situ hybridization, 3-dimensional (3D) confocal imaging and mass spectrometry; (2) molecular profiles including genomic, proteomic and lipidomic profiles; and (3) mRNA, miR, and methylation profiles. Among these tasks, the single-cell RNA-seq profiling of developing lung is carried out at

CCHMC, and the analysis and interpretation of such single-cell data are discussed in the second chapter of this thesis.

1.3. Pathobiology and heterogeneity in idiopathic pulmonary fibrosis

Idiopathic pulmonary fibrosis (IPF) is a chronic and fatal fibrotic lung disease primarily in people over 50 years old, and it is estimated to affect every 14 to 42.7 per 100,000 people (21). This disease is characterized by progressive subpleural and paraseptal fibrosis, heterogeneous honeycomb cysts (honeycombing), and clusters of fibroblasts and myofibroblasts (22). The median survival time of patients with IPF is 2.5 to 3.5 years, with 5-year survival rate around 20%

(21). Currently, two anti-fibrotic drugs, pirfenidone, and nintedanib, are approved for IPF for their efficacy in slowing down lung function decline caused by disease progression. However,

5 neither of the two drugs can restore lost lung function (23, 24). For advanced IPF, lung transplantation is the only treatment option that can prolong survival.

Despite recent advances in the understanding of IPF, the exact cause for IPF remains unclear (25,

26). The onset of IPF has been shown or implicated with environmental hazards including cigarette smoke, dusts (metal, wood, silica) and viruses, as well as genetic factors including mutations in SFTPC, MUC5B, TERT and TOLLIP genes (27). It is believed that IPF is a consequence of abnormal wound healing followed by recurrent micro-injuries to alveolar epithelial cells(25,

27). In IPF, micro-injury to epithelial cells leads to dysregulation of alveolar cell homeostasis, abnormal epithelial-fibroblast communication, activation of matrix-producing myofibroblasts, interstitium remodeling and ultimately fibrosis (27). Dysregulation of alveolar cells homeostasis is largely caused by failure in AT2 cells’ regeneration capacity. IPF AT2 cells suffered from premature telomere shortening, which leads to pulmonary fibrosis in mouse models (28). Single- cell RNA seq data from IPF patients showed that the IPF distal epithelial cells were very different from normal AT1 or AT2 cells and were sources of pro-fibrotic growth factors such as TGF-beta and platelet-derived growth factor (29). In addition, a subset of epithelial cells that resembled airway basal cells were also found in these IPF patients (29).

In addition to the complexity of pathobiology of IPF, significant heterogeneity in disease progression exists among IPF patients. The typical IPF progression feature slow decline in lung function and worsening dyspnea measured by a decline in FVC by 0.13L to 0.21L annually (30).

In IPF patients who were also heavy cigarette users, IPF progression was accelerated with a higher mortality rate (31). In some patients, IPF progresses rapidly causing acute deterioration in respiratory function with no known causes (acute exacerbations of IPF), and 5–10% of patients 6 with diagnosed IPF suffer from acute exacerbations annually and die within months (22, 32).

Sometimes, IPF progression within the same patients can fluctuate between relative stable phases and periodic acute exacerbations (33). These differences in disease progression make predicting clinical outcome of IPF very difficult.

Recent studies have connected various risk factors with IPF progression, partially explaining the heterogeneity observed in clinical studies. These risk factors can be divided into several aspects such as the immune system, fibroblasts, and biomarkers. Cytokines such as IL-13 and CXCL13 had been shown to be involved in IPF progression and poor prognosis (34, 35), and alternatively activated macrophages were shown to be dominant in IPF lungs and were associated with IPF progression (26). Fibroblast population in IPF patients were also heterogeneous in their response to IL-6 and TLR-9 signaling (36, 37), and fibroblasts from rapid progressing IPF were able to accelerate fibrotic response in mice compared with those from slow progressing IPF (38). Recent advancement in IPF research has led to the identification of various biomarkers that are associated with poor IPF prognosis or survival. These biomarkers include genetic factors, blood protein, epigenetic profiles, BALF proteins and lung microbiome (39).

Gene signatures derived from transcriptomic studies have also been reported to be highly heterogeneous in IPF patients. Comparison of gene signatures of healthy controls with ungrouped IPF patients revealed extensive genetic heterogeneity in the disease samples and differential gene expression profiles in IPF subgroups have been reported in several studies(40-

42). This demands development of computational approaches to resolve heterogeneity and identify IPF-specific transcriptomes that may help to predict disease progression. However, none of these works attempted to identify subgroups with IPF before applying differential expression 7 analysis. This could introduce a bias in the derived differentially expressed genes toward those more uniformly expressed across all IPF samples, thus limiting the power of such analysis in uncovering potentially distinct molecular subtypes of IPF. We therefore reasoned that unsupervised machine learning approaches could be applied prior to differential gene expression analysis to facilitate recognition of potential IPF subgroups with novel gene signatures that have predictive or prognostic value. The methods and results of this project will be discussed in

Chapter 3 of this thesis.

1.4. In silico drug screening methods for disease therapeutics

Drug discovery today is an expensive and a very time consuming and resource consuming process.

According to a recent study it takes about 15 years and more than 2 billion USD to find a marketable new drug (43, 44). On the other hand, there are currently around 7,000 rare diseases

(45) affecting about 400 million people worldwide (46), however, only a few of these diseases have therapeutics. Taken together, these drive the need for innovative approaches to drug discovery that should be not only less expensive but also less risky for pharmaceutical companies.

In recent years, technological advances in experimental and computational biology resulted in several rapidly expanding genomic and biomedical databases. These include transcriptomic data

(e.g., gene expression profiles from human patients and animal models of human diseases, small molecule treatment, etc.), protein-small molecule or protein-protein interactions, disease- associated phenotype and drug-induced side effects (47-49). To cope with the increased availability of data, several computational approaches have also been developed to enable data analysis and discovery of candidate gene and therapeutic discovery (50). These methods can be

8 classified based on how the links are defined into: i) structural similarity based approaches that screen for new drugs from candidates that are structurally similar to a known drug (51); ii) network-based methods that predict connections between drugs and diseases based on shared targets, pathways, disease phenotypes, drug indication and/or side effects (52-56); iii) gene expression profile based similarity, or connectivity mapping, that searches for drugs with gene expression profiles that are negatively correlated with disease-associated gene expression profile, with the assumption that a drug with ‘reversed’ gene expression to that of disease could potentially relieve the disease by drive gene expression towards the normal state (57). My work is focused on connectivity mapping-based drug discovery.

The concept of connectivity mapping between a drug and a disease was first introduced as the

ConnectivityMap (CMap) platform encompassing more than 7000 microarray-based gene expression profiles from 1309 FDA-approved drugs (57). The similarity (connectivity) between a disease-derived gene expression signature (query set) and a drug-derived signature (reference set) is calculated by Kolmogorov-Smirnov statistic like algorithm (58, 59). Here, genes in the reference set need to be ranked based on their differential expression compared to the control in the order of the most up-regulated genes to the most down-regulated genes. The genes in the query set, however, do not have to be ordered. The connectivity score is calculated by comparing the query set against a reference set, and a positive connectivity score is given if the up-regulated genes in the query set are near the top of the reference set and the down-regulated genes in the query set are near the bottom of the reference set, and vice versa for negative connectivity score.

Such process is repeated for each reference set to get a connectivity score for each drug in the data base with the query disease, and then they are ranked based accordingly. Those drugs at

9 the bottom are then considered as potential therapeutics for the disease since they could potentially revert the disease gene expression profile back to its normal state.

The CMap project rapidly gained popularity among the drug discover community and has more than 18,000 active users today (60). Despite its success, the potential impact on drug repurposing research of original CMap is inevitably limited due to its narrow small molecule and cell line coverage: there are only 164 drugs in 3 cancer cell lines in the CMap database. Thus, a new platform for drug-disease connectivity mapping was developed, utilizing the L1000 high throughput platform (61), as the successor of the CMap platform. The new ConnectivityMap platform, CLUE, holds more than a million L1000 profiles covering more than 2,800 small molecules in more than 72 cell lines (60). In Chapter 4, we will discuss a project focusing on screening small molecules for ΔF508del CFTR correctors in cystic fibrosis using an integrative approach of connectivity mapping and systems biology.

2. Chapter 2. Single-cell-Based Gene Module Analysis of Differentiation

Pathways and Cell Interactions that Drive Lung Development in Mouse

2.1. Introduction

Mammalian lung development is the process by which endodermally-derived epithelial cells interact with mesenchymal stromal cells to generate a vast air-blood interface of complex branched and tubuloepithelial structures surrounding a unique vascular network. This process has served as a strong model to unravel the molecular and cellular basis of multilineage- determined organ morphogenesis. However, uncertainty about the states and relationships among individually differentiating lineages has limited our ability to define their roles in driving the formation of specific cell types and tissue structure. Foregut endodermal cells serve as progenitors of the respiratory epithelial cells that produce distinct conducting and peripheral regions of the lung tubules during branching morphogenesis (7, 62). Mesenchymal and epithelial cells proliferate, migrate, and differentiate in a precisely timed and ordered manner along the proximal-peripheral axis of the lung as the conducting airways and peripheral saccules are formed (7). During late lung development, extensively branched bronchiolar tubules give rise to the acinar structures, which will develop into the acinar saccules. Prior to birth, alveolar epithelial progenitors differentiate into mature alveolar type 2 epithelial (AT2) cells, that synthesize and secrete surfactant proteins and lipids needed to reduce surface tension in the alveoli (63) and squamous type 1 (AT1) cells, that are in close apposition to capillary endothelial cells creating the gas exchange region needed for efficient transport of oxygen and carbon dioxide at birth (64). As peripheral lung saccules dilate during late gestation and after birth, the interstitial mesenchyme

11 thins to maximize gas-exchanging surface area and minimize the molecular distance between the air and blood. During early alveolarization, the septae still have a double capillary layer. After alveolarization is completed, epithelial cells, fibroblasts, and endothelial cells form a thin layer, laced together by an inflatable, fibro-elastic extracellular matrix (ECM). While mechanisms controlling early embryonic lung formation are increasingly understood (65-67), knowledge of the cell and molecular processes mediating later stages of lung morphogenesis, as peripheral tubules sacculate and alveolar septae form, are less well-defined. Processes forming and maintaining the peripheral lung are of considerable clinical interest related to lung maturation and function in preterm infants.

Recent advances in cell lineage tracing and RNA sequencing from single cells have shed new insights into the heterogeneity of cell types during lung development. Multiple subtypes of epithelial cells and fibroblasts have been identified in developing lung (17, 68, 69), as well as additional complexity in both function and marker gene expression of adult AECs (19, 70). In our previous study, we generated a RNA-seq expression profile of 148 single cells from the embryonic mouse lung in the canalicular phase (E16.5). These analyses identified major cell types including epithelial, endothelial and myeloid cells as well as several fibroblastic cell populations (69).

Heterogeneity in RNA expression was present within major cell types but was not adequate to precisely identify subtype clustering. In the present study, we extend cell numbers and provide a new analytic platform to define the heterogeneity of major cell types and expand potentially under-represented cell populations and predict biological functions and interactions of cell populations through gene module enrichment and construction of a lung-specific interactome predicting protein-protein interactions among these cell populations.

2.2. Methods and Materials

2.2.1. Isolation of Single Lung Cells for RNA-Seq Analysis

Lungs were dissected from time-mated E16.5 (CD-1) embryos and digested with 0.5% trypsin-

EDTA. Single-cell suspensions were loaded onto a Fluidigm C1 10-17 chip and captured according to the Manufacturer’s protocol. Chambers (96) were microscopically examined and those with single cells recorded. Cells were lysed for RNA isolation. RNA was dT primed for cDNA synthesis. cDNA products were harvested, diluted, barcoded, and ‘tagmented’ using Illumina

Nextera, and sequenced on a single lane with Illumina HiSeq2500, generating approximately 250 million single-end 50-100 base pair reads per 96 well plate.

2.2.2. Single-cell RNA-Seq Analysis Pipeline

Three independent RNA-Seq experiments were performed, resulting in whole genome transcriptome profiles from 218 individual lung cells. An analytic pipeline was developed to identify unique gene expression signatures for lung cell types, as outlined in Fig. 2.1. Briefly, the analytic pipeline consisted of four components that included: 1) Transcripts were quantified using Kallisto v0.42.2.1 (71) using the UCSC transcript model from the mm10 genome, 2) gene- level counts filtering and normalization, 3) reiterative cell type and subtype identification with cluster validation based on enrichment and 4) prioritization of cell type or subtype-specific genes and cell-cell interaction construction. Item 3 and 4 will be further described below.

2.2.3. Reiterative Cell Type and Subtype Identification

Cell clustering was carried out by a re-iterative and resampling cluster analysis pipeline constructed in Python. Briefly, this pipeline works iteratively in the following 3 steps: dimension reduction, clustering and signature gene identification and cluster validation.

Dimension reduction was performed by principle component analysis (PCA) or selecting known lung development genes. PCA was performed using the Scikit-learn package in Python, with the number of PC components explaining at least 80% of the variance in data. Alternatively, a curated list of genes (training set) that are known or implicated by previous microarray and RNA-seq data to be involved in lung development was compiled and optionally expanded via adding genes in the expression data with highest the Pearson correlation to the training set. After dimension reduction by either approach, a subset of the original expression data was generated and then used in the following clustering step.

Clustering of the single cells in the dimension-reduced expression data was carried out using the

K-means and hierarchical clustering algorithms with distance metric as 1 - the Pearson correlation between cells. Optionally, a second round of clustering was applied to single cells in a major class to divide them into biologically relevant and functionally distinct subclasses, which will be elaborated in the next section.

Following cell clustering, genes highly expressed in each clustered cell population were identified through integrating up to 4 gene rankings using the Rank Product algorithm. These rankings were based on the Wilcoxon rank-sum test p-values and expression fold change values (FC) comparing the target cell cluster or subcluster with other clusters or subclusters, respectively. Top-ranked

14 genes from each cell population were then pooled and used as input training set for the next clustering iteration.

2.2.4. Cluster Validation Based on Functional Enrichment of Cluster Derived Genes

The biological relevance of each cell cluster and subclusters, which is used as the cluster validation score, is estimated by the functional enrichment results of highly expressed genes derived from them. Functional enrichment analysis evaluates the significance of an input gene set containing a subset of a predefined group of genes that summarized the common function, pathway or phenotype associated with the group. There are many ensembles of such gene annotations with emphasis on the different aspect of biological information, such as the Gene

Ontology for biological function and localization, KEGG for pathways, and MGI Mammalian

Phenotype for phenotypes. In this study, we limited the annotation terms to the “Gene Ontology

Biological Process”, “Gene Ontology Cellular Component” and “Mouse Phenotype” categories.

One of the limitations functional enrichment analysis has is term over-representation. These limitations need to be addressed before using results from enrichment analysis as a method of clustering validation.

Term over-representation is the phenomenon that a term is annotated with significantly more genes than the rest of the terms. To address the term over-representation problem, we counted the number of genes associated with each term for each annotation category and then removed terms whose gene counts deviated from the average of all term gene counts of the same category.

A total of 212 over-represented terms from the “Gene Ontology Biological Process”, “Gene

Ontology Cellular Component” and “Mouse Phenotype” categories were removed.

After removing these over-represented terms, the overall biological relevance of cell clustering is calculated from significance scores of all meta-terms from all signature gene sets:

푴 푵풄

푬 = ∑ ∑(−푳풐품ퟏퟎ(푷풄풋)) 풄 풋

Where M is the number of cell populations, Nc is the total number of enriched annotation terms associated with cell population c, and Pcj is the enrichment p-value of term j in cell population c.

A larger E indicates higher biological relevance of a given cell clustering result.

2.3. Results

2.3.1. An iterative single-cell data analysis pipeline for cell population detection, cell- population-specific signature gene prioritization, and cell identify learning

We utilized single cell RNA to determine 1) emerging lineage subtypes, 2) relationships of these subtypes to cell cycle, and 3) differentiation states, as cells transition into more mature subtypes.

We designed an iterative cluster analysis pipeline which integrates principal component analysis

(PCA), clustering, differential gene extraction algorithms, and cluster refining through functional enrichment, Figure 2-1, In brief, the pipeline works in the following steps: Step 1: obtain an initial segmentation of cells by PCA followed with hierarchical clustering. Step 2: extract most differentially expressed genes (signature genes) of each cell population in which the biological relevance of cell clustering is evaluated based on the overall significance of enriched functional annotation terms derived from signature genes (72). Step 3: perform additional rounds of clustering based on different ensembles of signature genes from each cell population from the previous clustering, repeat Step 2, maintaining the cell segmentation if a superior significance of

16 enriched functional enrichment term is calculated. Step 4: after optimization of cell segmentation, steps 2 and 3 are repeated to search for substructures within any major cell population that improves statistical significance.

Figure 2-1: Schematic of single-cell RNA-seq workflow.

Our pipeline identified seven major cell clusters and multiple subclusters within each. Genes that are highly selectively expressed in each cell population are prioritized through aggregating gene ranks from multiple different ranking criteria examining both expression fold changes and p- values from differential analysis. A summary view of cell classification and associated top 200 ranked genes per each cell population showed highly distinct gene expression patterns between

17 different cell clusters, and to a lesser extent, among subclusters, Figure 2-2

Figure 2-2: Heatmap overview of seven major lung cell populations and associated subpopulations.

24 gene modules representing key genes of each cell population or sub-population are shown. Cell types and subtypes were determined by marker genes expression and enrichment analysis. From left to right: endothelial, epithelial, proliferative fibroblast, matrix fibroblast, proliferative fibroblast, myofibroblast, pericyte, and myeloid cells. Genes and cells are sorted by 2D hierarchical clustering. Each row represents a gene, and each column represents a single cell. Row side colored bar corresponds to gene modules. Column side colored or greyscale bar corresponds to cell types, or subtypes, respectively. TPM, fragments per kilobase of transcript per 106 mapped reads.

Modular enrichment analysis of top 500 genes from each cell cluster and subclusters revealed both unique and shared functional modules among these diverse cell populations. For example, modules representing ECM, lung morphogenesis, respiratory tube development, and mesenchyme development were enriched in all but one cell clusters. On the other hand, modules involved in lung epithelium differentiation, endothelial differentiation, immune response, collagen trimers and elastic tissues were more selectively enriched, Figure 2-3. According to these enrichment results of associated gene modules, we named seven major cell clusters as endothelial, epithelial, myeloid, and four fibroblastic cell types termed matrix fibroblast, myofibroblast, pericyte, and proliferative fibroblast.

Figure 2-3: Representative functional enrichment terms and genes of each cell class.

Terms were selected from enrichment analysis of top 500 most differentially expressed genes of each cell class. Top ten of these genes that are also involved in lung, epithelium or vascular development were selected. Enrichment analysis was performed using the Toppgene Suite with a p-value threshold of 0.05 and FDR < 0.05. Representative genes were selected from representative enrichment terms.

2.3.2. In detail analysis of mouse distal lung cell types and subtypes at E16.5

Distinct epithelial cell subtypes express genes enriched in distinct biological functions

The epithelial cell cluster consisted of 49 out of 218 individual cells that were marked by selective expression of Nkx2-1, Sox9, Ager, and Pdpn, and by large-scale adoption of the transcriptional

20 machinery critical for epithelial specialization, including formation of bicellular tight junctions, apical-basal orientations, and ‘respiratory’ tube formation, Figure 2-2. A recent study identified three distal airway epithelial sub-populations from fetal mouse lung at E18.5 that are not present at E14 or E16 (17); however, in this study, we identified five epithelial cells subtypes with distinct gene expression patterns and functional enrichments at E16.5. These epithelial subtypes expressed distinct levels of E18.5 AT1, E18.5 AT2 (17) and cell cycle genes, Figure 2-4.

Figure 2-4: Expression of 36 AT1 or AT2 marker genes and 32 cell cycle genes in five epithelial subtypes.

Epithelial subtypes (left to right as shown in the bottom color bar) are: “Uncommitted AEC precursor”, “Proliferative AT2 like precursor”, “Non-cycling AT2 like precursor”, “Proliferative AT1 like precursor” and “Non-cycling AT1 like precursor”. Left color bar indicates the type of gene, AT1 (red), AT2 (blue), and cell cycle (green). 22

In order to gain deeper insights into these E16.5 epithelial subtypes, we compared their signature genes with those from E18.5 and P107 epithelial cells (64), Figure 2-5. ‘Non-cycling AT1 precursor’

NcAT1 cells expressed the highest level of AT1 genes, including Ager, Cldn18, Emp2, Pdpn, and

Hopx (73, 74). They also expressed a low level of AT2 genes but no cell cycle genes. The

‘proliferative AT1 progenitor’ (PlfAT1) cells were like NcAT1 in their AT1 and AT2 gene expression but moderately expressed cell cycle genes. ‘Non-cycling AT2 precursor’ (NcAT2) expressed the highest level of AT2 genes, including Sftpc, Sftpa1, Cxcl15, Lamp3, and Slc34a2, low level of AT1 genes and no cell cycle genes. The ‘proliferative AT2 progenitor (PlfAT2) cells were similar to

NcAT2 except for their highly expressed cell cycle genes. The ‘Uncommitted AT progenitor’ (UcAT) expressed AT1, AT2, and cell cycle genes, indicating they represent an intermediate epithelial population with both AT1 and AT2 cell characteristics. Our results demonstrate that at this developmental stage, epithelial cells are rather heterogeneous and AT1 and AT2 cell distinction was already prominent.

Figure 2-5: Overview heatmap of E16.5, E18.5 and P107 epithelial cell subtypes.

Each row corresponds to an individual gene, and each column corresponds to a single cell. Column side color bars represent cell types. From left to right: PlfAT2, NcAT2, PlfAT1, NcAT1, E18 AT1, E18AT2 and P107AT2 cell.

Modular enrichment analysis of epithelial signature genes revealed the distinct functional focus of E16.5 epithelial subtypes. Pathways such as ‘cell-cell junction organization’, ‘epithelial tube morphogenesis’ and ‘proteinaceous extracellular matrix’ were highly enriched in AT1 subtypes while pathways like ‘glycerophospholipid metabolic process’, ‘secretory granule’ and ‘cholesterol metabolic process’ were highly enriched in AT2 subtypes. PlfAT1 and II were both enriched in cell cycle processes and ‘regulation of epithelial cell proliferation’ but only the PlfAT1 genes were enriched in ‘connective tissue suggesting these cells’ functional similarity to fibroblasts in promoting ECM, Figure 2-6.

Figure 2-6: Representative functional enrichment terms of each epithelial subtype.

Enrichment analysis was performed on the top 500 signature genes of each epithelial subtype using the Toppgene Suite with p-value threshold of 0.05 and FDR < 0.05. Biological processes that are important to lung development are highlighted.

Diversity of Lung Fibroblasts

Extensive complexity was found in the fibroblast cell clusters and subclusters, Figure 2-7. ‘Matrix fibroblasts’ expressed RNAs associated with ‘extracellular matrix’ and ‘cell adhesion,’ e.g. Fn1,

Eln, Vcam, and Fgf10, that latter a growth factor essential for early lung morphogenesis and alveolarization. Three subtypes of ‘matrix fibroblasts’ including ‘intermediate fibroblast’ (MFBIf),

‘mature fibroblast 1’ (MFBMtr1) and ‘mature fibroblast 2’ (MFBMtr2) were identified. MFBMtr1 was potentially the most mature subtype since it was most highly enriched in pathways involved in cell-matrix adhesion and proteinaceous ECM, compared with the less mature subtypes

MFBMtr2 and MFBIf. Between the latter two, MFBIf was uniquely enriched in pathways involved in cell cycle checkpoints and posttranscriptional control of gene expression.

Figure 2-7: Heatmap overview of fibroblast populations.

Top 200 genes for each population are shown. Each row corresponds to an individual gene, and each column corresponds to a single cell. Row side color bar corresponds to different gene modules. Column side color bars represent different fibroblast populations: from left to right, early fibroblast progenitor,

28 intermediate fibroblast, mature fibroblast 1, mature fibroblast 2, proliferative myofibroblast progenitor, migrating myofibroblast, smooth-muscle-like myofibroblast, and pericytes. ‘Myofibroblasts’ were identified by shared RNAs involved in ‘proteinaceous extracellular matrix’ and ‘muscle phenotype,’ e.g. Adamtsl2, Bgn, Col1a2, Col24a1, and Acta2, Myocd and Tagln. Two subtypes of ‘Myofibroblasts’ were identified, including the ‘migrating myofibroblast’ (MyoFBMig) enriched in ‘morphogenesis of a branching epithelium’ and ‘canonical Wnt signaling pathway’, and the ‘smooth-muscle-like myofibroblast’ (MyoFBSM) enriched in ‘sarcolemma’ and

‘regulation of muscle contraction’.

‘Proliferative Fibroblasts’ were enriched in ‘mitotic cell cycle’ and ‘chromosome modification’ associated processes and could be divided into two subtypes. The ‘early fibroblast progenitor’

(PlfFBe) was functionally close to matrix fibroblasts and the ‘proliferative myofibroblast’

(PlfMyoFB) was functionally similar to myofibroblasts. ‘Pericyte’-like cells expressed signature genes including Pdgfrb, Dlk1, Rgs5, Cspg4, and Mcam, and were enriched pathways involved in angiogenesis, Figure 2-8.

Figure 2-8: Representative functional enrichment terms of each fibroblast population.

Enrichment analysis was performed on the top 500 signature genes of each fibroblast population using the Toppgene Suite with p-value threshold of 0.05 and FDR < 0.05. Biological processes that are important to lung development are highlighted.

Diversity of Other Pulmonary Cells

The sources and roles of mesenchymal cells in lung morphogenesis are less well understood than those of the respiratory epithelium. There is evidence that the proximal-peripheral patterning of the embryonic lung is regulated by functionally and spatially distinct fibroblast subpopulations

(75). Our cluster analysis pipeline identified distinct subsets of lung stromal cells that were represented by 169/218 cells, including endothelial, myofibroblast, matrix associated fibroblast and myeloid cells (Figure 2) and cell selective marker genes that will be useful to further identify their anatomic sites and functions among the various non-epithelial cell populations, Table S3.

‘Endothelial’ cells were identified by RNAs defining ‘blood vessel development/morphogenesis,’ e.g. Cav1/2, Cd34, Fli1/4, Kdr, Nrp1, Pecam1, Tek, Emcn, Sox17, and Tie1. ‘Myeloid’ cells shared

‘immune/defense response,’ ‘lysosome,’ and ‘phagocytosis’ related RNAs, e.g. Fcgr1a/2b/3,

Mpeg1, Mrc1, Ly86, and Cyba/b, RNAs expressed by macrophages and monocytes that are not abundant in the normal fetal lung (76-78). Further clustering of each major cell type identified subtypes with distinct gene expression profiles and functional enrichment: three endothelial subtypes termed ‘maturing endothelial’ (EndMtr), ‘matrix endothelial’ (EndMtx), and

‘proliferative endothelial’ (EndPlf) involved in vascular development, ECM, and proliferation, respectively were identified. Two myeloid cell subtypes termed ‘mature myeloid’ (MyeloMtr) and ‘proliferative myeloid’ (MyeloPlf).

2.3.3. Extracting biological insights from predicted cell populations

Prediction of epithelial lineage progression and stage-specific epithelial subtype markers

In order to predict the transition of epithelial subtypes at E16.5, we selected top signature genes for each epithelial subtype and constructed a differentiation pathway using Minimal Spanning

Tree (MST). The analysis predicted a transition from UcAT -> AT2 subtypes -> AT1 subtypes,

Figure 2-9. In addition, a heatmap of the combined dataset of E16.5 plus E18.5 and PND107 epithelial cells showed that 1) most of the widely expressed genes of P107 AT2 cells (median

TPM >= 3) were increasingly expressed in E16.5 and E18.5 AT2 cells, and 2) expression of these genes were minimal in E16.5 AT1 cells, Figure 2-5. These results support the concepts that 1)

AT2 cells mature in a direction from PlfAT2 -> NcAT2 -> E18 AT2 and 2) AEC lineage determination occurs around or before E16.5.

Figure 2-9: Minimal spanning tree (MST) of five epithelial subtypes.

Each node represents an epithelial single cell, and each edge connects an epithelial cell to the most similar epithelial cell among the rest. Node colors were assigned based on subtype name.

At E16.5, Sox9 and Sftpc were highly expressed in cuboidal epithelial cells at distal regions of peripheral acinar tubules; in contrast, HopX was more highly expressed in peripheral bronchioles and to a lesser extent in proximal regions of acinar tubules. This spatial difference in Sftpc and

Hopx expression was represented in the predicted epithelial subtypes. Sftpc and Hopx expression was unable to differentiate peripheral epithelial subtypes, and were occasionally co-localized within individual cells, Figure 2-10. We therefore predict more subtype-specific markers genes to further identify these cells, Figure 2-11, that could be useful in visualizing and characterizing subsets of epithelial progenitors.

Figure 2-10: AT1 and AT2 markers in MST and in confocal images.

Left, MST of AT1 marker Hopx or AT2 marker Sftpc. Each node represents an epithelial single cell, and each edge connects an epithelial cell to the most similar epithelial cell among the rest. Node color was assigned based on expression of target gene. Right, Immunofluorescence confocal microscopy was used

33 to image fetal mouse lungs at E16.5 and epithelial cell MST with Sftpc and Hopx using antibodies against known alveolar cell markers such as Nkx2-1, Sftpc, and Hopx). Magnification was at 20X.

Figure 2-11: Representative predicted marker gene for each epithelial subtype.

From left to right, Cym for UcAEC, Cthrc1 for PlfAT2, Slc34a2 for NcAT2, Mmp23 for PlfAT1 and Smoc2 for NcAT1. Each node represents an epithelial single cell, and each edge connects an epithelial cell to the most similar epithelial cell among the rest. Node color was assigned based on expression of target gene. Dissecting biological processes based on cell type and subtype-specific expression patterns

Lung morphogenesis is the result of orchestrated signaling that instructs cell differentiation, proliferation, and motion within dynamically generated cell layers to form a consistently patterned organ structure. This massive, parallel and orchestrated process is dependent on precise temporal and spatial information exchange among cells via interactions among proteins, e.g. growth factors and receptors, ECM and adhesion/junction molecules (79). To begin to unravel the complexity of cell-cell communications and the distinct roles in the process of each predicted lung cell subtype, we combined biological knowledge, protein-protein interactions

(PPI), and predicted cell types to provide a comprehensive lung single-cell interactome useful for generating hypotheses for experimental validation. An intersection of the lung interactome with genes with known involvement in adhesion, ECM or growth factor signaling revealed complex interacting patterns between different cell types and between different cells of the same cell type. Epithelial and fibroblastic cells were most involved in these interactions, while pericyte had the fewest number of interactions with other cell types, Figure 2-12.

Figure 2-12: Interaction networks of adhesion molecules expressed by lung cell types.

Cell type specific genes that are highly enriched in biological adhesion, extracellular matrix, cell-cell junctions, and growth factor and receptor-based functions. Genes were classified based on expression as being specific in a specific lineage (clusters around the periphery) or in two or more of cell classes (central clusters). The network was constructed using Cytoscape V3.1 and layout was manually created. Cell-cell communications are also mediated via paracrine and autocrine signaling pathways, and their functional dependence on ligand-specific receptors make them useful in demonstrating how single-cell data can be used to identify cell-cell communications. We thus subsetted the lung interactome and constructed a network view of growth factor interactions focusing on epithelial and endothelial cells. The networks comprehensively summarized all potential signaling pathways between endothelial cells and epithelial cells, fibroblasts, and pericytes in E16.5 lungs.

Known growth factor signaling pathways between mesenchymal and epithelial/endothelial cells were re-discovered in these networks, including fibroblast growth factor (FGF), SHH, bone morphogenetic protein (BMP) and WNT signaling in epithelial cells, Figure 2-13 and Figure 2-14,

(80-84), and vascular endothelial growth factor (VEGF), FGF and WNT in endothelial cells Figure

2-15 and Figure 2-16 (84). The growth factor interactome integrated these known signaling patterns with predicted ones that are either 1) known signaling pathways in new cell types; 2) known signaling pathways in known cell types, but with a distinct pattern in cell subtypes; or 3) new signaling pathways predicted by PPI. Thus, it provides new insights into communication among cell types and subtypes, and more importantly, how such communications could be influenced by other signaling events in the micro-environment.

Figure 2-13: Epithelial growth factor interaction network.

Figure 2-14: Epithelial growth factor receptor interaction network.

Figure 2-15: Endothelial growth factor interaction network.

Figure 2-16: Endothelial growth factor receptor interaction network.

Figure 2-17: Legend of growth factor/receptor interaction network.

Single-cell gene expression data analysis of candidate autocrine and paracrine growth factor regulation pathways of different cells at e16.5 lung development is shown. (a, b) Known growth factor signalings that are responsible for the epithelial-mesenchymal crosstalk (a) or regulation of vasculogenesis (b) are re-discovered in the growth factor interactome. (c) Complete growth factor profile of matrix fibroblasts and myofibroblasts. Growth receptors (rectangles) are shown connected to growth factors (ovals) that they are known to interact with from each different cell type (rows) and subtype (alternate family color) that were identified as being present at E16.5. Size of the nodes is proportional to their average level of expression as log2(TPM+1) in associated cell class or subclass. The network was constructed using Cytoscape V3.1. 2.3.4. The cellular landscape of developing lung from E16.5 to PND28

Using the clustering analysis pipeline discussed previously, we identified cell types and subtypes in additional lung development stages including E18.5, postnatal day (PND) 1, PND3, PND7,

PND10, PND14, and PND28. Epithelial, endothelial, fibroblastic and immune cells were consistently found in E16.5, E18.5, PND1, PND3 and PND7, while PND10, PND14 and PND28 single cells were dominated by immune system derived cells. Cell subtypes in these major cell types with distinct expression profiles were also discovered. Epithelial cell could mainly be divided into

AT1, AT2, bi-potential and ciliated cells categories, while fibroblastic cells were mainly divided 38 into matrix fibroblast, myofibroblast and pericytes. Subtypes in immune cells were not evident until in later stages.

2.3.5. Toppcell: a database for modular single-cell data analysis,

To facilitate access to analyzed single-cell data and enabling real-time visualization and reanalysis, we have constructed a web portal, ToppCell. The web portal provides two easy to use functions,

(1) an interactive gene expression heatmap featuring highly expressed genes by each cell type and subtype from E16.5 to PND28; and (2) the cell interaction analyzer that predicts cell-cell interactions based on their expressed genes.

Dynamic organization of data

The ToppCell database features all single-cell gene expression profiles of developing lung from

E16.5 to PND28 generated by the LungMAP consortium. The minimal component of our database is the cell population-specific gene modules, which are annotated with metadata covering information on sequencing platform, data analysis protocol, associated cell population, tissue origin, age and species of animal source. A hierarchical navigation tree was generated based on these metadata to facilitate users to explore these data, which is currently ordered as

Project=>Species=>Tissue=>Age=>Cell type and subtype. Organization of these gene modules can be easily adjusted by changing the order of the metadata types. Users can also filter on these metadata to narrow down their searches faster (Figure 2-18).

Figure 2-18: ToppCell Navigation page

The hierarchical navigation tree on the homepage of ToppCell. It shows all cell population-specific gene modules from the LungMAP data. Visualize gene expression profiles of any cell population in the developing lung

Once a user selects a gene module, an interactive Morpheus heatmap containing genes in selected gene module is generated and visualized in real time. It allows users to see expression patterns of selected genes in cells within the same dataset. In addition, the interactive heatmap enables basic analysis including data transformation, selecting and filtering and grouping, as well as some advanced utilities such as clustering and differential expression analysis, Figure 2-19.

Figure 2-19: Interactive heatmaps and available analysis

An overview of interactive heatmaps in ToppCell and available browser-based analysis tools powered by Morpheus. Counter clockwise: indexing genes and/or cells, grouping based on gene/cell annotation and aggregation, hierarchical clustering with several distance metrics, box, line and scatter plots, enrichment analysis by ToppFun and statistical analysis comparing one cell population against another.

Visualization and analysis of custom genelists

Users can also visualize a custom gene list in any one of the single cell datasets in the ToppCell database, which could provide insights into cell population-specific expression of interested genes at a given lung development stage. Custom gene lists can be uploaded or queried based on gene annotation terms from Gene Ontology. For example, the expression profiles of 1136 genes included in the GO term “Epithelium development” in E16.5 mouse lung were shown in

Figure 2-20.

Figure 2-20: Expression profile of a custom gene list visualized by ToppCell.

Expression profiles of 1136 genes included in the GO term “Epithelium development” are shown in the dataset of E16.5 mouse lung. Genes and cells are represented as rows and columns, respectively. Top color bar represents cell type and subtypes.

Cell-cell interaction analyzer

ToppCell also enables cell-cell interaction prediction by mining protein-protein interactions between signature gene lists from cell populations. Users can build a cell-cell interaction network by selecting two or more cell-population-derived gene lists from the gene list navigator and

ToppCell will then construct a bi-partite network with gene-gene and gene-cell edges. Optionally, a gene functional group, for example, adhesion, ECM or growth factor+receptor, can be selected to pre-filter genes showed in the network, Figure 2-21. Cell-cell interaction analysis results are available as pictures or. xgmml files which can be further analyzed in software such as Cytoscape or Gephi.

Figure 2-21: Cell-cell interaction analysis by ToppCell.

Example of cell-cell interaction analysis looking for E16.5 epithelial-fibroblast interaction through biological adhesion genes. Gene modules corresponding to E16.5 epithelial cells and matrix fibroblasts were selected and intersected with genes involved in biological adhesion. Then, protein-protein interactions among gene products in both gene modules were extracted and visualized. Predicted cell- cell interaction through protein-protein interactions are shown as blue edges in the network.

2.4. Discussion

In this study, we elucidated cell types and subtypes in mouse lung from E16.5 to PND28 and constructed a web database for visualizing and reanalysis single cell data of developing lung. We identified both major cell types and novel, functionally distinct cell subtypes. These cell types and

“learned” subtypes were identified by distinct gene expression profiles and functional enrichment. Most importantly these cell types are distinct in genes that are known to play major

43 roles in lung morphogenesis. We combined known protein-protein interaction data with our learned cell populations to construct a lung-specific protein-protein interactome containing

122,498 pairs of interactions amongst 8760 interactants expressed in 24 lung cell populations.

Subsetting this interactome with specific proteins of interest enables studying the cell-cell interactions mediated through these protein-protein interactions in the developing lung. Our analysis provides a valuable resource to explore both epithelial and stromal cells of E16.5 lung including 1) expression profiles of both known and novel cell populations and their associated cell markers, and 2) inferences regarding cell-cell interactions, including, but not limited to growth factor signaling and direct cell-cell matrix and adhesion/junction interactions. We also demonstrate that complicated, multiple-cell-involving biological processes, such as morphogenesis and organ development, can be de-convoluted from single-cell genomic data using modular geneset representation combined with annotation feature enrichment- interaction analyses.

The extraction of novel biological insights and knowledge from single-cell transcriptomic analysis relies upon the use of unsupervised machine learning methods, especially cluster analysis techniques. The high dimensional and noisy nature of single-cell RNA-seq data made the task of clustering very challenging. The uncertainty regarding the most suitable distance functions, clustering algorithms and feature-selection methods for these analyses further complicates the task (85). For these reasons, an unsupervised cluster validation step after clustering is required before interpretation. In previous studies involving clustering of single cells, unsupervised cluster validation was often combined with result interpretation or is not included. Our cluster analysis incorporated a clustering validation step based on prior biological, i.e. geneset enrichment

44 analysis. Enrichment analysis results are prone to weaknesses related to term redundancy and over-representation, and thus cannot directly validate unsupervised clustering. For this reason, we clustered enriched terms into metagroups, and calculated an overall p-value based on hypergeom test of each enriched term metagroup (72). This process removed biases towards redundant and overrepresented terms. Iterative clustering improved the biological relevance of each prediction. To the best of our knowledge, this is the first work that incorporates cluster validation through enrichment analysis in the clustering process. This methodology can be adopted in other high-dimensional expression data analysis to facilitate biologically relevant clustering.

Cells in the developing lung exhibit a high degree of heterogeneity, making identification of cell populations from single-cell data challenging. During the canalicular stage of lung development, epithelial cells start to differentiate toward AT2 and AT1. Parallel differentiation has not been previously described but must occur in cell types that form the vasculature and mesenchyme. By increasing the number of cell clusters beyond major cell types our analysis predicted a cell type heterogeneity that can account for differentiation amongst mesenchymal cells. At present, the analytic pipeline depends on human decisions to interpret the biological meaning and is not solely data-driven. Forcing separation of closely related cell populations to exclude recognition of important genes that are specific to a new subset of cells but not as specific to one particular cluster. We first utilized a two-step clustering protocol in our analysis pipeline to segregate cells into major cell types such as epithelial, endothelial, and fibroblastic cells. We then performed a second round of clustering within each major cell type, enabling identification of “natural” subtypes that are defined by subtle distinctions in gene expression profiles that are otherwise

45 hidden. This clustering step took their relation to ‘parent’ and ‘sibling’ cell populations into account to reflect cellular differentiation within one cell type. To prevent generating unnecessary/false subtypes, we employed the aforementioned cluster validation step after clustering. Such a strategy to analyze single-cell data and characterize heterogeneous cell populations will be useful to discover previously unknown cell subtypes during formation of other organs.

3. Chapter 3. Unsupervised gene expression analyses identify IPF- severity correlated signatures, associated genes and biomarkers

Unsupervised gene expression analyses identify IPF-severity correlated signatures, associated

genes and biomarkers

Yunguan Wang1, Jaswanth Yella1, Jing Chen1, Francis X. McCormack2, Satish K. Madala†,3,4, Anil G.

Jegga†,1,4,5

1Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati,

Ohio, USA

2Division of Pulmonary, Critical Care and Sleep Medicine, University of Cincinnati, Cincinnati, Ohio

USA.

3Division of Pulmonary Medicine, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio,

USA

4Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, USA

5Department of Computer Science, University of Cincinnati College of Engineering, Cincinnati,

Ohio, USA

Citation: Yunguan Wang, Jaswanth Yella, Jing Chen, Francis X. McCormack, Satish K. Madala author and Anil G. Unsupervised gene expression analyses identify IPF-severity correlated signatures, associated genes and biomarkers. BMC Pulmonary Medicine 2017; 10.1186/s12890-

017-0472-9

3.1. Abstract

Background

Idiopathic Pulmonary Fibrosis (IPF) is a fatal fibrotic lung disease occurring predominantly in middle-aged and older adults. The traditional diagnostic classification of IPF is based on clinical, radiological, and histopathological features. However, the considerable heterogeneity in IPF presentation suggests that differences in gene expression profiles can help to characterize and distinguish disease severity.

Methods

We used data-driven unsupervised clustering analysis, combined with a knowledge-based approach to identify and characterize IPF subgroups.

Results

Using transcriptional profiles on lung tissue from 131 patients with IPF/UIP and 12 non-diseased controls, we identified six subgroups of IPF that generally correlated with the disease severity and lung function decline. Network-informed clustering identified the most severe subgroup of

IPF that was enriched with genes regulating inflammatory processes, blood pressure and branching morphogenesis of the lung. The differentially expressed genes in six subgroups of IPF compared to healthy control include transcripts of extracellular matrix, epithelial-mesenchymal cell cross-talk, calcium ion homeostasis, and oxygen transport. Further, we compiled differentially expressed gene signatures to identify unique gene clusters that can segregate IPF from normal, and severe from mild IPF. Additional validations of these signatures were carried

49 out in three independent cohorts of IPF/UIP. Finally, using knowledge-based approaches, we identified several novel candidate genes which may also serve as potential biomarkers of IPF.

Conclusions

Discovery of unique and redundant gene signatures for subgroups in IPF can be greatly facilitated through unsupervised clustering. Findings derived from such gene signatures may provide insights into pathogenesis of IPF and facilitate the development of clinically useful biomarkers.

Keyword

Idiopathic pulmonary fibrosis, IPF, gene expression analysis, gene signature, IPF subtyping

3.2. Background

The clinical course of idiopathic pulmonary fibrosis (IPF), a chronic and fatal fibrotic lung disease, is highly variable. With a median survival of about 3 years, it ranges from a slow, steady loss of lung function over 5 or more years to a rapid progressive state and death within 1–3 years post- diagnosis. The typically slowly progressive course of IPF can be punctuated by intermittent episodes of precipitous decline in lung function termed acute exacerbation (AEIPF) (42, 86), which often lead to a new, worsened baseline of respiratory impairment. The mechanisms underlying AEIPF continue to be poorly understood (42, 87). Further, the lack of a robust means of identifying biological heterogeneity and selecting patient cohorts at risk for outcomes of interest continue to limit the scope and design of interventional clinical studies in IPF (42).

The current approach in IPF diagnosis is limited to clinical assessment based on imaging and histology features. Stellar attempts, however, are currently underway to develop genomic signatures and blood-specific or lung-specific biomarkers in the future (88). Gene signatures derived from transcriptomic studies have been reported to differentiate IPF patients and from other interstitial lung diseases (89, 90) and from healthy controls(91, 92). Comparison of gene signatures of healthy controls with ungrouped IPF patients revealed extensive genetic heterogeneity in the disease samples and differential gene expression profiles in IPF subgroups have been reported in several studies (40-42). This demands development of computational approaches to resolve heterogeneity and identify IPF-specific transcriptomes that may help to predict disease progression. We therefore reasoned that unsupervised machine learning

51 approaches could be applied prior to differential gene expression analysis to facilitate recognition of potential IPF subgroups with novel gene signatures that have predictive or prognostic value.

We postulated that data-driven and knowledge-based approaches using gene expression profiling of a large set of IPF/UIP cases would both allow us to identify novel patient subgroups with shared molecular characteristics and reveal novel candidate genes. Using transcriptional profiles on lung tissue from 131 patients with IPF/UIP and 12 non-diseased controls, we identified six sub-types of IPF that reflect disease severity. We have further identified molecular signatures that are capable of differentiating (a) IPF from normal controls and (b) severe from mild IPF.

These signatures were subsequently validated in three independent cohorts of IPF/UIP. Finally, using knowledge-based approaches, we identified several novel candidate genes and potential biomarkers for IPF.

3.3. Methods

3.3.1. Cohort selection

We used the microarray data from the IPF cohort in the Lung Genomics Research Consortium’s

(LGRC) website (http://www.lung-genomics.org; also deposited in data repository GEO -

GSE47460 (93)). Among 582 subjects in dataset GSE47460, 12 had clinical and pathological designations as “controls”, and 131 had clinical and pathological diagnoses of “UIP/IPF”. We selected these 143 subjects for our cluster analysis, differential analysis, and to train classifiers.

Demographic and clinical characteristics of the selected cohort are summarized in Table 3-1.

There was no statistically significant difference in age between control and IPF patients, but there were more males in the IPF group. The predicted forced expiratory volume in one second (FEV1),

52 forced vital capacity (FVC), and diffusing capacity of the lung for carbon monoxide (DLCO) were significantly lower in UIP/IPF patients compared to those of the control group. For evaluating the classifier performance and assessing the relevance of our identified IPF sub types, we used three independent IPF cohorts (GSE24206 (92), GSE10667 (40) and GSE53845 (42)).

Table 3-1: Patient demographics and clinical characteristics of the LGRC IPF cohort.

Disease Group UIP/IPF Control p Value

Number 131 12

Age—mean (SD) 62.6 (12.2) 64.1 (8.2) 0.5631*

%predicted FEV1 (SD) 71.37 (19.00) 94.33 (9.86) 6.3E-5*

%predicted FVC (SD) 64.78 (17.41) 91.75 (7.44) 4.3E-7*

%predicted DLCO (SD) 49.33 (18.14) 97.00 (21.30) 1.1E-11*

Gender—% male 67.2 25 0.0375†

*By two-tailed Student's t-Test.

†By χ2 test.

IPF, idiopathic pulmonary fibrosis; LGRC, Lung Genome Research Consortium; UIP, usual interstitial pneumonia; FVC, forced vital capacity; FEV1, forced expiratory volume in 1 second;

DCLO, diffusing capacity of the lung for carbon monoxide; SD, standard deviation.

3.3.2. Clustering, principle component analysis (PCA), and differential expression analysis

We used the Scikit-learn (94) package in Python for clustering analysis and PCA, and the limma

(95) package in R for differential analysis. Data was first preprocessed by aggregating redundant transcript, log2-tranformed and median normalized across each gene, resulting an expression data matrix of 14110 genes by 143 samples (or subjects). For PCA, the principal components were calculated using only expression data containing only IPF samples, and expression data from control and IPF patients were projected onto these principal components. Then, hierarchical clustering using the Ward linkage method on Euclidean space was performed on the transformed matrix. The number of clusters was chosen as the smallest number that allowed maximal difference in average FEV1, FVC, and DLCO values among clusters. Following clustering, differential analysis was performed across IPF clusters and control with Benjamini–Hochberg false discovery rate (FDR) correction. Differentially expressed genes (DEG) were defined as those with FDR- adjusted p-value ≤ 0.05 and absolute log2 fold-change ≥1 when compared to control.

3.3.3. Validation of IPF gene sets with logistic classifiers

We used the Scikit-learn package in Python to build and evaluate logistic regression classifiers to evaluate classification power of each IPF gene set. The datasets from the training and validation cohort were median normalized and scaled to (0, 1) across each gene. In order to assess classifier accuracy and reduce over-fitting, we included a 2-fold cross-validation step before training the final classifier using all samples in the training cohort. Then, classifiers were used to predict IPF

54 status in each validation cohort. Receiver-Operating-Characteristic (ROC) curve, overall accuracy, sensitivity and specificity were used to evaluate classifier performance.

3.3.4. Functional enrichment analysis and candidate gene prioritization

We used ToppFun of the ToppGene Suite (96) for functional enrichment analysis and ToppGene for candidate gene prioritization. For candidate gene prioritization, we used ‘GO: Biological

Process’, ‘GO: Cellular Component’, ‘Human Phenotype’, ‘Mouse Phenotype’, ‘Pathway’ and

‘Disease’ as features to compute the similarity. The significance threshold was set as FDR- adjusted p-value ≤ 0.05. We used “known” IPF genes from the Orphanet (97) and DisGenet (98) databases as training sets to rank the differentially expressed genes in IPF and identify and prioritize novel candidate genes for IPF.

3.4. Results

3.4.1. Gene expression profiles of UIP/IPF patients are highly heterogeneous and are not consistent within clinical FVC or DLCO categories

To examine genes associated with UIP/IPF, we first performed differential gene expression analysis comparing 131 UIP/IPF patients with 12 control subjects and identified 988 differentially expressed genes. However, among these genes there were several distinctive gene expression patterns that defined distinct subsets of IPF patients (Figure S1a). To determine whether this molecular heterogeneity correlated with disease severity, we grouped UIP/IPF patients based on their available clinical FVC and DLCO measurements (FVC ≥ 55% or DLCO ≥ 40%: mild-to-moderate

IPF; otherwise: severe IPF) and repeated the differential analysis comparing each of the phenotype-based UIP/IPF sub-groups with the control group. This resulted in 1175 and 1167

DEGs in IPF patients grouped by DLCO and FVC measurements, respectively. Surprisingly, we observed that even among UIP/IPF patients within the same FVC or DLCO sub group, expression of the genes was still highly variable (Figure 3-1). This finding was further corroborated with results from principal component analysis (PCA) of all patient samples using 3657 (first quartile) most variable genes, wherein FVC or DLCO subgroups could not be separated from each other by the first two principal components (Figure 3-2 a, b). The heterogeneity within the gene expression profile of UIP/IPF patients and their poor concordance with markers of IPF severity motivated us to take an unbiased approach of clustering the IPF patients based on their gene expression profile to detect (a) potential IPF subgroups and (b) identify DEGs that correlate with lung function.

Figure 3-1: Gene expression profiles in lung tissues taken from IPF patients were highly heterogeneous.

IPF samples were pooled (a) or grouped based on FVC (b) or DLCO (c). Differentially expressed genes were then extracted from each condition with FDR-adjusted P-value cut-off at 0.05 and fold-change cut- off at 2. Genes (rows) and samples (columns) were ordered using hierarchical clustering with Pearson correlation distance metric and complete linkage.

Figure 3-2: Principal Component analysis (PCA) plot characterized separation of IPF sample by three grouping methods.

Distribution of IPF samples along the first two principal components derived from top 25% most variant genes are shown, and sample grouping were based on FVC (a), DLCO (b), or Ward clustering(C). Each point represents an IPF sample. 3.4.2. Clustering analysis identifies UIP/IPF patient subgroups correlating with IPF- severity

We performed ward clustering followed by PCA on the gene expression profiles of 131 UIP/IPF patients and identified 6 distinct patient clusters (C1 through C6) (Figure 3-2c). Subgroups of differential IPF severity, as reflected by the average of clinical measures (DLCO, FEV1, and FVC), were arranged in descending order from patient clusters C6 to C1. DLCO values in patient clusters

C5 orC6 were significantly lower than those in C1, C2, C3, or C4, and significantly higher in-patient

57 clusters C1 or C2 than those in C3, C4, C5, or C6. On the other hand, FVC and DLCO values did not differ significantly between C1 and C2, C3 and C4, or between C5 and C6 (Figure 3-3 a-c). These results suggested that the patient subgroups C1 and C2 had modest changes while C5 and C6 had a significant decline in lung function compared to control. Whereas patient subgroups of C3 and

C4 exhibited intermediate changes in their lung function compared to those with mild (C1 and

C2) and severe (C5 and C6) disease phenotypes based on lung function tests.

Figure 3-3: UIP/IPF patient subgroups stratified by disease severity by FEV1, FVC and DLCO have different gene expression profiles.

Patient subgroups were identified using hierarchical clustering with Euclidean distance metric and Ward’s linkage (panels a, b, and c). Average DLCO (a), FEV1 (b), and FVC (c) in six UIP/IPF patient subgroups. Panel d shows heat map representation of 2968 DEG (rows) in 143 controls and UIP/IPF patients (columns). Genes and patients were ordered using hierarchical clustering. Color bar represents patents subgroups (1st row in heat map), DLCO (2nd row), FEV1 (3rd row), FVC (4th row) and smoking status (5th row). Data are expressed as the mean ± SD. Differential expression analysis was performed using Limma. Comparison of lung function measures was carried out using two-tailed Student’s T-test.

To examine transcriptomic differences between these patient clusters, we performed differential analysis comparing IPF patient clusters with control using the R package ‘limma’ and identified

2968 DEGs. Interestingly, these DEGs included nearly all DEGs identified by earlier grouping methods using (a) pooled IPF patients; (b) patients grouped based on DLCO measurements; and

DEGs within each of the six patient clusters was more homogenous compared to earlier methods.

Figure 3-4: Unsupervised clustering followed by differential analysis recovered almost all the DEG identified by other methods and discovered additional DEG.

Comparison of differentially expressed gene identified based on different IPF patient grouping methods. Pooled-IPF, IPF patients were not divided into subgroups; FVC or DLCO grouping, IPF patients were divided into subgroups based on the FVC or DLCO categories, respectively; Clustering, IPF patients were divided into subgroups using PCA and Ward clustering. Substantial differences in gene expression profiles were found between patient clusters that had similar disease severity, namely C1 vs. C2, C3 vs. C4 and C5 vs. C6 (Figure 3-3d). Three major gene expression modules (Gm) can be identified from the total 2968 DEGs. Gm1 was up-regulated in patient cluster C1, C3, C4, C6 and moderately in C5, Gm2 was up-regulated in patient cluster C1,

C3, C5 and C6, while Gm3 was down-regulated in patient cluster C3, C5, C6 and moderately in C1 and C4. To gain functional insight into these genes, we performed enrichment analysis on them and found Gm1 was enriched in processes such as ‘extracellular matrix organization’, ‘regulation of cell migration’ and ‘collagen catabolic process’, Gm2 was enriched in processes such as ‘cilium’ and ‘cilium assembly’ and Gm3 was enriched in ‘angiogenesis’ and ‘lung alveolar morphology’.

Taken together, these results show that UIP/IPF patient subgroups stratified by disease severity can be distinguished using gene expression profiles-based clustering, which in turn reveals the involvement of different molecular pathways in the pathogenesis and severity of fibrotic lung disease.

3.4.3. Functional characterization of IPF subgroups

The number of DEG found in each of the patients’ clusters ranged from 262 genes (patient cluster

C2) to 2117 genes (C5). About 34% of the total DEGs were unique to one patient subgroup while the remaining were found to be overlapping with the others (Figure 3-5). All IPF subgroups shared a set of 145 DEGs, which were named the IPF core gene set. Among them, genes involved in

‘proteinaceous extracellular matrix’ and ‘regulation of epithelial to mesenchymal transition

(EMT)’ were up-regulated, while hemoglobin genes such as HBA1, HBA2, HBD, HBG1 and HBQ1 were down-regulated (Figure 3-6). The most severe patient subgroups C5 andC6 shared DEGs in

C1 and C3, which were enriched in processes including ‘mitotic nuclear division’, ‘cilium assembly’, ‘epithelial/endothelial migration’ and ‘tube development’. Patient clusters C5 andC6 also uniquely expressed 840 DEG, with 448 genes in C5 and 392 genes inC6. Subsets of C5 unique genes were enriched in pathways such as ‘cilium assembly’ and ‘tube development’, which were

60 also perturbed in less-severe IPF subgroups C1 and C3. This suggests a potential IPF progression path of C1→C3→C5 characterized by increasing expression of cilium-associated genes, and is in consistent with a previous study that reported a positive correlation of cilium-associated gene expression and increased IPF severity (41). TheC6-specific gene set included inflammatory response genes such as HMOX1, IL1R1, IL20RB, IL36G, SELE, SERPINF2, TNFRSF21 and TNFRSF6B, but not genes enriched in cilium-associated pathways (Figure 3-6). This suggests IPF can alternatively progress via up-regulating inflammation genes without further up-regulation of cilium-associated genes, and is consistent with a recent report showing increased inflammation in rapid progressive IPF (99).

Figure 3-5: Comparison of DEGs of each patient cluster revealed genes commonly dysregulated in IPF and genes associated with severe lung function decline.

DEG were divided into six groups based on the number of patient clusters where a gene was differentially expressed. Panel (a) shows schematic representation of 2968 DEG in six IPF patient clusters. DEGs along with their group designation are shown in the same order along the outer rim of

61 each circular plot. The center of each circular represents patient cluster with the color intensity representing average % predicted DLCO in that cluster. Each colored edge (red: up-regulated; green: down-regulated) from a patient cluster to a gene in the rim indicates differential expression of that gene in the connected patient cluster. Panel b is a heat map representation of the 2968 DEG. Up- or downregulated genes in each group that are involved or implicated in IPF were highlighted.

Figure 3-6: Enriched biological processes in each gene category revealed commonly and high- severity-associated biological pathways perturbed in IPF.

Selected enrichment terms derived from gene lists in DEG groups were shown. Connection from a gene (rectangular node) to a biological process (purple oval node) indicates involvement of that gene in the connected process. Differential expression status of a gene in each patient subgroup was shown as a mini heat map (orange: up-regulation; turquoise: down-regulation; gray: not differentially expressed, patient subgroup order: C1, C2, C3, C4, C5 and C6). Network was made in Cytoscape 3.5, and layout was performed using AllegroLayout v2 Professional with manual curation. 3.4.4. Validation of IPF subgroups with independent IPF cohorts

To further validate and assess the relevance of our identified IPF subgroups, we used three independent IPF cohorts (GSE24206, GSE10667 and GSE53845). By utilizing multiple testing datasets, we determined if the gene sets reveal key differences in gene expression between IPF and normal, and between severe IPF (explant) and usual IPF (biopsy). 62

Out of 145 genes in the core IPF gene set (Figure 3-7a), 133 were found in all three validation cohorts. We used the LGRC dataset with 12 controls and 131 IPF patients as the training set and trained a logistic regression classifier for classification of IPF patients. Then, IPF status (normal or

IPF) was predicted by using the classifier on each validation dataset, where the decision threshold was set to provide at least 90% sensitivity. The ROC curve, sensitivity, specificity, and overall accuracy in each of the validation datasets are shown in Figure 3-8. Specificity, sensitivity and accuracy in all three validation datasets were >90%. Similarly, we validated the C5 andC6 unique gene sets (Figure 3-7b). The C5 unique gene set (448 genes) performed poorly in differentiating severe IPF from usual IPF in all validation datasets (Figure 3-9). Hence, this gene set was not considered for further analysis. However, although classifiers built on theC6 unique gene set failed to distinguish AEIPF from IPF, they could moderately differentiate IPF explant from IPF biopsy, indicating the unique gene expression profile of patient clusterC6 were also present in severe, end-stage IPF. Taken together, these results demonstrate that the core IPF gene set is a robust gene signature to separate IPF from control. The advanced IPF gene set (c6 unique gene set) on the other hand can differentiate advanced IPF, but not AEIPF, from stable IPF.

Figure 3-7: Heat maps of Core and advanced IPF gene set.

145 core IPF genes (a) and 392 advanced IPF genes were ordered using hierarchical clustering with Pearson correlation distance and complete linkage method. Patients (columns) were ordered in the same way in each heat map.

Figure 3-8: The core IPF gene set robustly differentiated IPF patients from normal controls in three independent validation cohorts.

Logistic regression models were trained on the core IPF gene sets using the training cohort with 2-fold cross-validation and tested with each validation cohort. The decision threshold was set to provide at least 90% sensitivity for IPF discovery. ROC curves were shown in the left column, and classification scatter plots of IPF and control samples were shown in the right column.

Figure 3-9: The advanced IPF gene set could differentiate end-stage IPF but not AEIPF from usual IPF.

Logistic regression models were trained on the advanced IPF gene sets using the training cohort and tested using each validation cohort. The decision threshold was set to provide at least 90% sensitivity for IPF discovery. ROC curves were shown in the left column, and classification scatter plots of IPF and control samples were shown in the right column. 3.4.5. Functional prioritization of novel IPF-associated genes

To identify genes that were most functionally relevant to biologic processes perturbed in IPF, we employed a systems biology approach to rank each gene in the core and advanced gene sets based on their functional similarity to one of the two training sets comprising genes known or implicated to be involved in IPF (97, 98) and their gene expression fold change in IPF compared

65 to control. Specifically, functional similarity was calculated using the ToppGene Suite’s gene prioritization tool (96). Genes in the core and advanced IPF gene sets that were also in the training set (“known” IPF genes) were removed from ranking. The remaining 133 and 382 genes from the core and advanced gene sets respectively were then ranked separately based on either similarity score or fold change compared with normal. The rankings of each of the genes were aggregated using the rank product method (100). Top 10% ranked genes in the advanced and core IPF gene set are shown in Table 3-2. Twenty-two of these genes had been shown to be differentially expressed in IPF patients compared with healthy control or involved in IPF pathogenesis.

Enrichment analysis of the novel candidates showed these genes were involved in pathways often perturbed in IPF. For example, up-regulated genes in the advanced set such as SERPINF2,

MMP14, DMP1 and CTSL, were enriched in ‘extracellular matrix organization’. On the other hand, genes involved in leukocyte activation such as BLM, RAG1, PRKCZ, LBP, and MMP14 were only present in the advanced gene set.

Table 3-2: Top 10% prioritized genes in the core and advanced IPF gene sets.

Down-regulated genes Up-regulated genes

CCL11, CCL13, CCL19, CDH2, COL17A1, Core IPF HBA1, HBEGF GREM1, IL13RA2, KRT5, PLA2G2A, RPS4Y1, Gene set SCG5, SFRP2, WNT10A

ACE2, AREG, BLM, CTSL, DMBT1, DMP1, CDKN1C, CEBPA, ELN, ESR1, FASLG, FST, GJB6, GNL3, GPC1, HAS1, HMOX1, Advanced HLA-DQA2, HLA-DQB1, HLA-DQB2, KLK8, LBP, LIF, LMNB1, LOX, MMP14, IPF Gene set HLA-DRB4, MAPK3, PRKCZ, RAG1, NPPA, SDS, SELE, SERPINF2, SPINK13, SFRP5, TCF7L2 THBS1, TNFRSF6B

3.4.6. Prioritization of putative bronchoalveolar lavage fluid biomarkers for IPF

Among the genes in the core IPF gene set, 60 of them encode secreted proteins (based on Uniprot

(101) annotation) or were previously found in bronchoalveolar lavage fluid (BALF). Given these genes’ classification power and their potential clinical utility, we decided to prioritize candidate

IPF BALF biomarkers among them. We first ranked these genes based on magnitude of the coefficients from a logistic regression model. Then, we built a series of logistic regression models, each trained on up to 50 top ranked genes, to determine the threshold for marker selection

(Figure 3-10). In the end, we identified 11 putative biomarkers, including HMGCS2, CHL1, DAO,

CRTAC1, EDN1, WNT10A, HBEGF, IL6, CCK, EPHA3 and SEMA3E, in the core IPF gene set which is the smallest biomarker set that allowed >0.8 specificity and 0.9 sensitivity in distinguishing IPF from healthy control. Notably, three core IPF markers, HMGCS2, CHL1 and SEMA3E, were also differentially expressed compared with control in BALF of bleomycin-treated mouse and their direction of dysregulation was consistent with our study (102).

Figure 3-10: Performance of Logistic Regression Classifier build on up to 50 top ranked putative BALF biomarkers.

Putative BALF biomarkers were ranked based on the magnitude of their decision function coefficient derived from a logistic classifier trained using the training cohort. A series of logistic classifiers trained on up to 50 top ranked genes using the training cohort were tested using each validation cohort. The decision threshold was set to provide the highest prediction accuracy. 3.5. Discussion

In this study of patients with UIP/IPF, we stratified subgroups based on lung function measures and applied unsupervised analysis on gene expression data. Genes enriched in cilium or lung alveolar morphology were expressed at different levels in two distinct transcriptomic profiles from patient clusters with moderate disease (cluster average DLCO: 40-60%), but not severe disease (cluster average DLCO: <40%). Comparison of DEG from each patient cluster revealed additional gene signatures that robustly differentiated IPF from normal lung, and advanced IPF from usual IPF. Finally, using knowledge-based approaches, we identified several novel gene candidates and potential BALF biomarkers for IPF.

The uniqueness of current study is that we used unsupervised, data-driven approaches to discover potential subgroups within IPF patient samples prior to extracting IPF-specific gene signatures, which allowed us to identify genes commonly involved in IPF or only associated with advanced IPF. In contrast, gene signatures of previous studies were all derived from comparing pooled IPF samples with healthy controls (40, 89-92). As a result, we identified additional 1981

DEG along with genes discoverable without incorporating clustering steps prior to differential analysis, and 382 out of 392 advanced IPF genes were among the additional genes.

Our results indicate that gene expression profiles from IPF patients are heterogeneous. Grouping patients according to lung function measures such as FVC and DLCO reduced such heterogeneity 68 and allowed discovery of more DEG. However, different gene expression profiles could still be found within the lung function group defined based on FVC or DLCO measurements, and several genes were expressed at similar levels in different patient groups. Gene expression heterogeneity not yet explainable by lung function measures suggested activation of disease-driving pathways that could potentially be informative in efforts to improve the therapeutic response and outcome.

On the other hand, genes expressed at comparable levels across patient subgroups of different severity suggest potential involvement of these genes and linked biological processes in distinct stages of IPF. In this study, we validated these clusters by cross referencing with clinical data to avoid generating clusters that are less relevant clinically. Clustering patients based on gene expression prior to differential analysis may also help to circumvent some of the limitations we encountered.

A recent study reported that cilium-associated genes were associated with more extensive microscopic honeycombing in IPF patients, although no difference in lung function measures were found in patient groups defined by these genes (41). Our results are consistent with these data in that we found that cilium-associated genes are most highly expressed in patient cluster

C5 with more severe IPF. These genes include MUC5B and DSP which were known to be involved in IPF (41, 103), matrix metalloproteinases that are implicated in IPF such as MMP1, MMP3 and

MMP7 (104), and collagens involved in ECM organization. However, cilium-associated genes were also highly expressed, although to a less extent, in less-severe patient clusters, C1 and C3.

More importantly, patient cluster C4 with low cilium-associated gene expression had DLCO values that were comparable to those of C3, suggesting potential additional driver genes underlying IPF severity.

Our analyses revealed novel IPF associated genes and biomarkers. Among the 55 prioritized genes, 22 were previously shown to be dysregulated in IPF or involved in IPF pathogenesis. For example, up-regulated expression of candidate genes including CTHRC1, CTSE, GREM1, NELL1 and PLA2G2A in the core set, and AREG, FST, LOX, THBS1 and SELE in the advanced set, were also found to be increased in IPF animal models or in human IPF patients (91, 105-109). On the other hand, ACE2, SFRP2 and WNT10A were known to be associated with fibrosis in IPF animal models and survival in IPF patients (110-112). The presence of these genes in the candidate list supports the validity and robustness of our prioritization although further studies are needed to validate the remaining novel candidate genes identified. In addition to novel candidate IPF genes, we also identified putative BALF biomarkers that can potentially differentiate IPF patients from healthy normal volunteers. The high consistency of the expression of these biomarker genes with their corresponding protein expression in BALF (102) suggest that classifiers built on them could achieve comparable predictive power observed in our study. Thus, our biomarker list may inform future efforts to identify diagnostic, predictive and prognostic biomarkers in BALF that could obviate the need for more invasive diagnostic maneuvers and be used in decision making for IPF care.

3.6. Conclusions

In conclusion, our results show that discovery of robust gene signatures for IPF diagnosis can be greatly facilitated through integration of unsupervised and systems biology approaches. Findings derived from gene signatures may provide insights into pathogenesis of IPF and facilitate the development of clinically useful biomarkers.

3.7. List of abbreviations

IPF: idiopathic pulmonary fibrosis

FEV1: forced expiratory volume in the first one second

FVC: forced vital capacity

DLCO: diffusing capacity of the lung for carbon monoxide

DEG: differentially expressed gene

PCA: principal component analysis

FDR: Benjamini–Hochberg false discovery rate

SD: standard deviation

3.8. Availability of data and material

The microarray data sets analyzed in this study were obtained from the National Center for

Biotechnology Information Gene Expression Omnibus repository (GSE47460). All other data supporting the findings of this study are made available as Supplementary Information files. All the data sets and results generated including the IPF clusters are made available as a Web-based resource (https://ipf.research.cchmc.org/) using Morpheus software

(https://software.broadinstitute.org/morpheus). Users can also export gene lists of interest to the ToppGene Suite (96) to perform functional enrichment analysis.

4. Chapter 4. PP-2, a src-kinase inhibitor, is a potential corrector for

F508del-CFTR in cystic fibrosis

PP-2, a src-kinase inhibitor, is a potential corrector for F508del-CFTR in cystic fibrosis

Yunguan Wang†,1, Kavisha Arora†,2, Fanmuyi Yang2, Woong-Hee Shin3, Jing Chen1, Daisuke

Kihara3,4,5, Anjaparavanda P. Naren*,2, Anil G. Jegga*,1,5

1Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati,

Ohio, USA

2Division of Pulmonary Medicine, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio,

USA

3Department of Biological Sciences, Purdue University, West Lafayette, Indiana, USA

4Department of Computer Science, Purdue University, West Lafayette, Indiana, USA

5Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, USA

†Equal contribution

Citation: Yunguan Wang, Kavisha Arora, Fanmuyi Yang, Woong-Hee Shin, Jing Chen, Daisuke

Kihara, Anjaparavanda P Naren, Anil G Jegga. PP-2, a src-kinase inhibitor, is a potential corrector for F508del-CFTR in cystic fibrosis. bioArxiv; doi: https://doi.org/10.1101/288324

4.1. Abstract

Cystic fibrosis (CF) is an autosomal recessive disorder caused by mutations in the CF transmembrane conductance regulator (CFTR) gene. The most common mutation in CF, an in- frame deletion of phenylalanine 508, leads to a trafficking defect and endoplasmic reticulum retention of the protein where it becomes targeted for degradation. Successful clinical deployments of ivacaftor and ivacaftor/lumacaftor combination have been an exciting translational development in treating CF. However, their therapeutic effects are variable between subjects and remain insufficient. We used the Library of Integrated Network-based

Cellular Signatures (LINCS) database as our chemical pool to screen for candidates. For in silico screening, we integrated connectivity mapping and CF systems biology to identify candidate therapeutic compounds for CF. Following in silico screening, we validated our candidate compounds with (i) an enteroid-based compound screening assay using CF (ΔF508/ΔF508-CFTR) patient-derived enteroids, (ii) short-circuit current analysis using polarized CF primary human airway epithelial cells and (iii) Western blots to measure F508-del-CFTR protein maturation. We identified 184 candidate compounds with in silico screening and tested 24 of them with enteroid- based forskolin-induced swelling (FIS) assay. The top hit compound was PP2, a known src-kinase inhibitor that induced swelling in enteroid comparable to known CF corrector (lumacaftor).

Further validation with Western blot and short-circuit current analysis showed that PP-2 could correct mutant CFTR mis-folding and restore CFTR-mediated transmembrane current. We have identified PP2, a known src-kinase inhibitor, as a novel corrector of ΔF508-CFTR. Based on our studies and previous reports, src kinase inhibition may represent a novel paradigm of multi- action therapeutics – corrector, anti-inflammatory, and anti-infective – in CF. 74

Keywords

Cystic fibrosis, F508del-CFTR, ΔF508-CFTR, src-kinase inhibitor, CFTR corrector, in silico drug screening, drug discovery, organoids

4.2. Background

Cystic fibrosis (CF) is a life-limiting genetic disorder affecting 70,000 individuals worldwide (about

30,000 in the United States). Although the recent drug approvals for CF (potentiator ivacaftor alone and in combination with corrector lumacaftor) are promising, the pursuit for additional therapeutics for CF needs to be continued (113) for the following reasons. In vitro studies have shown that the approved combinatorial has limited clinical efficacy with potential interference of potentiator (augment CFTR channel function) with corrector (promote the read-through of nonsense mutations or facilitate the translation, folding, maturation, and trafficking of mutant

CFTR to the cell surface) actions and destabilization of corrected ΔF508-CFTR, the most common mutant in CF (114-116). Additionally, apart from being cost-prohibitive (>$250K/year), the approved combinatorial will not work for all CF patients (114, 115, 117, 118). The context of various CFTR mutations (>2000), the complexity of the underlying pathways, and pathway crosstalk suggest that a computational big data approach that uses high-throughput experimental data and systems biology attributes could lead to the discovery of hitherto unanticipated potential targets and therapeutic candidates for CF. Indeed, mining some of the genomics and small- molecule data silos (e.g., disease/drug gene expression profiles or signatures) in an integrative manner has already been shown to be of translational benefit such as finding novel drug–drug and drug–disease relationships (57, 119-128). However, a majority of such signature comparison-

75 based approaches treats cells as black boxes; gene expression patterns from small-molecule treatments and from patients or disease models are the inputs and a ranked list of compounds is the output. Further, the output from such computational analysis is often large, necessitating further prioritization. Thus, there is a critical and unmet need to triage small molecules discovered from high-throughput screens for further development so that investigators can focus on a smaller number with greater success and lower cost. By incorporating additional layers of systems biology attributes (e.g., wild type and mutant CFTR-specific protein interactions and CF- relevant signaling networks and pathways), the current study sought to unveil the internal configuration of the black box underlying the connected small molecules, their targets, and CF disease gene signatures. Such understanding could lead to identifying new therapeutics, drug targets, and target pathways and novel mechanisms of action for CF. Additionally, doing these studies earlier in the drug discovery pipeline could help in potentially foreseeing or avoiding late- stage clinical trial failures. Since the compound screening framework in the current study includes approved drugs, drug repositioning candidates for CF might also be found.

The CF gene expression signature derived from rectal epithelia (RE) of human CF patients with

∆F508-CFTR mutation (129) was used to search the Library of Integrated Network-based Cellular

Signatures (LINCS) database (60, 130) to identify compounds that are anti-correlated with the CF signature. We then complemented this unbiased chemical signature-based screening method with CF knowledge-based systems biology approaches to characterize and infer mechanistic insights. Finally, we tested computationally prioritized small molecules from this integrated approach in vitro for their effect on fluid secretion (in the presence and absence of CF-corrector

VX809) by using intestinal organoids (enteroids or miniguts) generated from the intestines of

ΔF508-CFTR homozygous mice and CF patients. Additional candidate compounds were validated by measuring the CFTR-dependent short circuit currents (Isc) by using polarized CF primary human airway epithelial cells (131, 132).

4.3. Methods

Differentially expressed genes in CF

Microarray data of rectal epithelial cells from healthy volunteers and from CF patients with

∆F508-CFTR mutation (homozygous) were downloaded from the National Center for

Biotechnology Information Gene Expression Omnibus (GEO) (133). Differential analysis for genes was performed using the R package limma with P=0.05 and fold change threshold at 1.5 (95).

Computational small-molecule compound screening

We used the LINCS cloud web tool (60) and gene set enrichment analysis (GSEA) to identify small molecules from the NIH’s LINCS library that could potentially reverse the CF gene expression profile. LINCS signatures (460,000) were downloaded from LINCScloud (www.lincscloud.org) using API. Each signature consisted of a list of 100 most up-regulated and 100 most downregulated probes. The original ConnectivityMap method (57) was applied to a CF disease signature against each of the 460,000 LINCS signatures to calculate a connectivity score. This resulted in 460,000 LINCS Connectivity Scores for each CF disease signature. Since in LINCS experiment the same treatment was applied in multiple cell lines with different duration and dosages, they were all considered biological replicates of the same treatment in the current analysis. Therefore, a non-parametric two sample Kolmogorov–Smirnov test was used to compare the Connectivity Scores of signatures from one treatment against the background

Connectivity Scores of the entire 460,000 LINCS signatures. Since we were interested in the compound whose signature was inversely correlated to the CF disease signature, the one-tailed

KS test was applied wherein the alternative hypothesis was that the Connectivity Scores of the underlying treatment were higher than the background Connectivity Scores (note that negative

Connectivity Score indicates reverse relation between the treatment and CF disease signature).

This one-tailed two sample KS test was performed for each unique compound treatment, and the resulting p-value was reported as the significance of the compound. This p-value was then used to rank all compound treatments in LINCS. The complete process was performed using R environment. LINCS signatures were downloaded using “jsonlite” package (134).

Computational modeling of ∆F508-CFTR and docking of PP2 and VX-809

Open-state WT CFTR structure was obtained from a previous work by Dalton et al. (135). ∆F508-

CFTR mutant structure was predicted by Modeller9v18 (136) using wild type structure as a template. Three-dimensional ligand structures of VX-809 and PP2 were obtained from PubChem

(137), and adding hydrogens to the receptor and assigning Gasteiger charges to the protein and the compounds were done by AutoDockTools (138).

Reagents and antibodies

PP2 and other compounds tested were purchased from Tocris Bioscience (Ellisville, Missouri) while VX-809 and VX-661 were obtained from Selleck (Houston, Texas).

Human studies

Human intestinal biopsy and lung tissues were obtained from CFTRWT/WT non-CF and

ΔF508/ΔF508-CFTR CF individuals under the protocol and consent form approved by the

Institutional Review Board at the Cincinnati Children’s Hospital Medical Center (IRB # 2011-2616).

All adult participants provided informed consent, and a parent or guardian of any child participant provided informed permission on the child’s behalf. All consent was obtained in written form.

Enteroid cultures

Mouse intestinal organoids were cultured as described previously (25). Human duodenal crypt isolation and enteroid expansion was performed as described previously

(https://www.jove.com/video/52483/establishment-human-epithelial-enteroids-colonoids- from-whole-tissue) with some adaptations. Briefly, fresh biopsy is rinsed in ice-cold Dulbecco’s

Phosphate buffered saline without Ca2+ and Mg2+ (DPBS, Gibco), mounted and immersed in DPBS in a silica gel coated petri-dish using minutien pins with the mucosal side facing up.

Mucosa is gently scraped with curved forceps to remove villi and debris followed by 3-4 washes with DPBS. Crypts were dissociated using 2 mM EDTA (30 min, 4°C with gentle shaking) followed by gentle scraping of the mucosa. The crypt suspension was filtered through a 150 µm nylon mesh twice and pelleted at 50 X g, 4°C. The crypt pellet was resuspended in matrigel matrix (200 to 500 crypts/50 µl matrigel per well of a 24 well plate). Matrigel was allowed to polymerize by placing the plate in a 37°C, 5% CO2 incubator for 30 min followed by addition of complete growth factor supplemented human minigut medium (Advanced DMEM/F12 medium with 2 mM glutamine, 10 mM HEPES, 100 U/mL penicillin, 100 g/mL streptomycin, 1 N2 supplement, 1 B27 supplement and 1% BSA supplemented with 50% Wnt-3A-conditioned medium, 1 µg/ml R-

79 spondin 1, 100 ng/ml Noggin. 50 ng/mL EGF, 500 nM A-83-01, 10 µM SB202190, 10 nM [Leu]15-

Gastrin 1, 10 mM Nicotinamide and 1 mM N-Acetylcysteine.

Fluid secretion measurement in intestinal spheroids

Isolation of intestinal spheres and measurement of fluid secretion were performed as described previously (139). Day 1-4 intestinal organoids were treated with 0.1-10 µM of the test compound for 24 h before stimulation of CFTR function using forskolin (10 µM). Fluid secretion measurements were done before and after 30 min of a stimulation period for mouse organoids and 120 min for human organoids. Quantitation of fluid secretion in the intestinal spheres was performed as described previously (139, 140).

Isc measurement

Primary human ΔF508/ΔF508 CFTR bronchial epithelial cells grown on Costar Transwell permeable supports (Cambridge, MA; filter diameter 12 mm) were mounted in an Ussing chamber maintained at 37°C. Epithelial cells were pretreated with DMSO, PP2 (2 µM, 24 h), VX-

809 (2 µM, 24 h), and PP2 + VX-809. A 2-mV pulse was applied every 1 min throughout the experiment to check the integrity of the epithelia. Cells were treated with 50 µM amiloride at the beginning of the experiment. After stabilizing Isc, cells were treated with forskolin (Fsk) (10 µM) on the apical side. CFTRinh-172 (20-50 μM) was added to the apical side at the end of each experiment to verify current dependence on CFTR.

PP2 RNA-Seq

RNA sequencing was performed on primary human bronchial epithelial cells homozygous for

∆F508-CFTR mutation (sample ID KKCFFT004I, Charles River, Wilmington, MA). Briefly, fully differentiated bronchial epithelial cells maintained on air-liquid interface were treated with

DMSO or PP2 (2 µM, 24 h). Total RNA was isolated using mirVana™ miRNA Isolation Kit (Carlsbad,

California) and used for RNA sequencing.

4.4. Results

4.4.1. Identifying candidate anti-CF small molecules through integrated gene expression profiling and systems biology

We used a published CF patient gene expression data set (GSE15568; (129), which was based on rectal epithelia samples of CF patients with ΔF508-CFTR mutation. Differential analysis of gene expression was performed by the R package ‘limma’ with P-value threshold at 0.05 and fold change threshold at 1.5 (95). Compared with healthy controls, 1330 genes were differentially expressed in ∆F508-CF patients. Among these genes, 555 were downregulated, and 775 were upregulated. We used this differentially expressed gene (DEG) set to identify small molecules from the NIH’s LINCS library that could potentially reverse the CF gene expression profile. We used the LINCS cloud web tool and gene set enrichment analysis (GSEA) for this purpose (Figure

4-1). Following this signature reversal step, 184 compounds were identified as significantly reversing the CF disease signature.

Figure 4-1: Schematic representation of workflow to identify CF correctors.

Differentially expressed genes from CF patients (∆F508-CFTR homozygous) were used for connectivity mapping to identify CF candidate therapeutics. These candidate therapeutics were further prioritized by 82 incorporating additional layers of information from systems biology of CF. Prioritized CF candidate therapeutics were experimentally validated using intestinal organoids and bronchial epithelial cells from CF patients. To elucidate computationally each of the 184 compounds, specifically to determine their putative functional relatedness to CFTR protein (WT or ∆F508) interactome (141) and to compiled CF- relevant pathways (see Methods), we performed a singular enrichment analysis (SEA) by using the DEGs of each of these compounds in the A549 cell lines available from the LINCS database.

Eight compounds that were not enriched in any of the CF-relevant pathways were removed from the candidate list of compounds. To further narrow the candidates potentially involved in CF, we selected 10 pathways that were highly related with CF pathogenesis and ranked all the remaining compounds based on their average enrichment p-values in these terms. To diversify our selection, we also selected a few compounds with low ranking scores. Finally, we randomly selected 18 small molecules from the top 1/3 ranked, 5 from the middle, and 1 from the bottom ranked compounds. These compounds also represented different chemical classes (such as flavonoids, src family kinase inhibitors, MAPK inhibitors, mTOR inhibitor, and PI3K inhibitor).

4.4.2. Screening of candidate compounds in mouse ΔF508/ΔF508-cftr enteroids for mutant CFTR functional rescue

To screen for the effect of 24 selected candidates from the integrated computational and CF systems biology on ΔF508-CFTR functional rescue, we sorted to physiologically relevant intestinal stem cell-derived spheroids (enteroids) isolated from ΔF508/ΔF508-cftr mice. Intestinal spheroids are the validated models for studying CFTR-dependent fluid secretion (139, 142). Of the 24 total compounds screened, two compounds (PP2 and LY-294002) demonstrated increased forskolin-induced swelling (FIS) in the enteroids compared to the DMSO-treated control (Figure 83

4-2). Among these, small-molecule PP2 showed the maximal rescue of ΔF508 CFTR function, and the effects were noted to be higher than that with approved and investigational CF correctors

(VX-809 and VX-661) 12% and 14%, respectively. We also observed a dose-dependent (0.1 μm to

10 μm) rescue of mutant CFTR in the presence of PP2 in the enteroids, and this effect was mitigated in the presence of CFTRinh-172 (Figure 4-3). A potential synergism was noted between

PP2 and CF correctors VX-809 and VX-661 at 1 μm PP2 (Figure 4-3B).

Figure 4-2: FIS in ΔF508/ΔF508-cftr intestinal organoids.

Representative images of intestinal spheres demonstrating fluid secretion in response to the test compounds (Panel A). Bar graph represents quantitation of fluid secretion in the intestinal organoids (Panel B). In this preliminary screening assay, all compounds were tested at a dose of 10 µM, while the known correctors (VX-809 and VX-661) were tested at a dose of 2 µM. The highlighted compounds (PP2

84 and LY294002) in the bar graph were found to show forskolin-induced swelling significantly better than that seen with the treatment of VX-809 or VX-661.

Figure 4-3: PP2 Dose response experiments in mice.

Panel A. Representative images of ΔF508/ΔF508-cftr enterospheres depict secretion under various treatment conditions: (i) PP2 (0, 0.1, 1 and 10 µM, 24 h), (ii) PP2 (1 µM, 24 h) + CFTRinh-172 (20 µM, 30 min), (iii) VX-809 (2 µM, 24 h), and (iv) PP2 (1 µM, 24 h) + VX-809 (2 µM, 24 h). Panel B. Line graph represents quantitation of fluid secretion in enterospheres under various treatment conditions as described previously (Panel A). 4.4.3. Screening of PP2 and its non-src-kinase-inhibitor analog PP3 using enteroids derived from CF patients (ΔF508/ΔF508-CFTR)

The corrector potential of PP2 could be recapitulated in CF (ΔF508/ΔF508-CFTR) patient-derived duodenal organoids. PP2 induced significant FIS at doses of 2 µM, and the enteroid swelling induced by PP2 was comparable to that by VX-809. To examine the dependence of PP2-mediated

FIS on src kinase inhibition, we included PP3, an inactive PP2 analog, in our CFTR assay; the analog failed to mediate FIS (Figure 4-4A).

Figure 4-4: Preclinical validation of PP2.

Panel A shows representative images of intestinal spheres demonstrating forskolin-induced swelling (FIS) in enteroids from normal subject and those from cystic fibrosis patients in response to DMSO, PP3, VX- 809, and PP2. The bar graph represents quantitation of fluid secretion in the intestinal organoids. Panel B. Western blot data depict bands B (immature or endoplasmic reticulum form) and C (mature or membrane form) of CFTR immunoprecipitated from HEK 293 cells that overexpressed FLAG ΔF508-CFTR with and without treatment with PP2 (0.5 and 2 µM, 24 h), VX-809 (2 µM, 24 h) alone, and PP2 + VX-809 (2 µM each, 24 h). Panel C. Representative CFTR-mediated short-circuit currents (Isc) in human bronchial epithelial cells carrying ΔF508/ΔF50- CFTR in response to VX-809, PP2, and VX-809+PP2 treatment. Bar graphs represent data quantification of maximal increase in Isc/cm2 from n = 5 (DMSO-treated) and n=6 (PP2-treated) independent traces. Panel D. Predicted binding pose of PP2 to ATP binding site. 4.4.4. PP2 corrected ΔF508-CFTR mutant protein

Since we observed that PP2 functionally rescued ΔF508-CFTR, we sought to investigate whether

PP2 rescues ΔF508-CFTR trafficking. Processing of immature band B of ΔF508-CFTR to mature

86 band C would suggest improved trafficking or correction of the mutant ΔF508-CFTR protein.

Treatment of ΔF508-CFTR expressing HEK 293 cells with PP2 increased the formation of mature band C, an effect like that of VX-809. Additionally, a strong synergistic effect was observed upon simultaneous treatment of cells with PP2 and VX-809 (Figure 4-4B). This observation qualified

PP2 as a candidate CF corrector. We validated the rescue potential of PP2 by measuring CFTR- mediated short circuit currents in primary human bronchial epithelial cells with ΔF508/ΔF508-

CFTR mutation and observed a functional restoration of the mutant protein by ~4-fold and synergistic functional rescue in the presence of PP2 and VX-809 combination (Figure 4-4C).

4.4.5. Computational modeling of ∆F508-CFTR and PP2

To predict the PP2 mode of action (MoA), we docked PP2 and VX-809 to ∆F508-CFTR and compared binding poses and affinities. Predicted binding affinities of PP2 and VX-809 by

Autodock4 to the NBD1:ICL4 interface was -7.92 kcal/mol and -10.00 kcal/mol, respectively. In the case of the ATP binding site, the binding affinities for PP2 and VX-809 were -8.47 and -11.14 kcal/mol, respectively. These results suggest that PP2 might bind to the ATP binding site. The predicted binding poses of VRT-325 and PP2 is illustrated in Figure 4-4D. Further, the heterobicyclic aromatic rings of PP2 positioned at the same place in the binding pocket, supporting our hypothesis that PP2 binds to the ATP site. To further compare the binding affinity of PP2 to the ATP binding site, we docked PP2 with four known binders (Fyn, Hck, Lck, and Src).

Interestingly, the predicted binding affinities between PP2 and the four kinases were weaker than for PP2 and ATP binding site of CFTR. For instance, although the Lck protein is co-crystalized with

PP2 (1QPE), its binding affinity was ~1.8 kcal/mol lower than the ATP binding site of CFTR. This result strongly suggests that PP2 binds to the ATP binding site of CFTR.

4.4.6. Characterization of PP2 as candidate therapeutic for CF

To gain insight into the mechanism of ∆F508-CFTR correction by PP2 in the context of CF pathophysiology, we analyzed the transcriptional profile of PP2 (from both LINCS and CF RNASeq data) along with CF DEGs.

We first analyzed the top upregulated and downregulated genes from PP2-treated cells (LINCS

L1000 data) and DEGs from CF RE. Specifically, using our compiled CF- and CFTR-relevant biological processes, pathways, (143), and protein interactions (WT-CFTR and ∆F508-CFTR) (141), we performed functional enrichment analysis of these gene sets. PP2 upregulated genes including TSPAN13, PRCP, RHOBTB3, TPD52L1, NRIP1, XBP1, SPDEF, MUC5B, HOXA5, and AGR2, which were enriched in CF-related processes such as goblet cell differentiation and SPEDF induced genes. PP2 downregulated genes such as PSMB1, NPLOC4, DYNLT1, DYNC1LI1, PSMB7,

GNAI3, TUBB6, PPP2CB, TUBA4B, TUBB4B, and ADCY1, which were enriched in CF-related processes including regulation of degradation of WT and DF508 CFTR and regulation of CFTR activity (normal and CF) (Figure 4-5). Genes involved in CFTR proteostasis were enriched in upregulated DEGs in CFRE and top downregulated genes by PP2. These results suggest that PP2’s correction of CFTR might be by modulating degradation of mutant ∆F508-CFTR.

Figure 4-5: CF pathway enrichment network for genes that are reciprocally connected in CF and PP2 treatment.

Rectangular orange nodes represent CF-related pathways; octagonal nodes represent differentially expressed genes either in CFRE or PP2-treated cells. Green edges represent downregulation; red edges represent upregulation. Second, to gain context-specific insight into PP2 function, we performed RNA sequencing on

ΔF508/ΔF508 CFTR and normal primary human bronchial epithelial cells treated with DMSO or

PP2 (2 µM, 48 h). 119 genes were differentially expressed in PP2-treated bronchial epithelial cells compared with vehicle control (Figure 4-6A). PP2 DEGs showed the highest positive connectivity with src-kinase inhibitors and the highest negative connectivity with tubulin inhibitors. Notably,

PP2 partially reversed abnormal gene expression caused by F508del CFTR mutation in bronchial epithelial cells (Figure 4-6 B, C; Table 3-1). These genes included matrix metalloprotease such as MMP-1, MMP-9, MMP-10, MMP-12, and MMP-13, which were downregulated by PP2. Among them, MMP-9 and MMP-12 had increased activity in CF patients (144, 145), and increased MMP-

9 activity was negatively correlated with lung function (146). PP2 also downregulated genes

89 involved in CF-related pathway macrophage activation including INHBA, IL1A, and FN1. This is important because leukocyte inflammation in the airway is associated with increased CF severity in patients (147). On the other hand, PP2 upregulated and restored expression of CFI, C1R, and

C1S, which are involved in classical complement activation. This is potentially therapeutic since complement activation helps contain bacterial infection, which is an important contributing factor for CF pathogenesis in the lung (148, 149). Taken together, these results suggest that PP2 could potentially protect the airway from inflammation and proteolytic damage in CF patients in addition to correcting the fundamental defect of mutant CFTR in CF (Figure 4-7).

Figure 4-6: PP2 reversed the expression profiles of genes dysregulated in F508del-CFTR bronchial epithelial cells. 90

Expression of 117 differentially expressed genes in bronchial epithelial cells with WT CFTR (green), F508del-CFTR (light blue), and F508del-CFTR treated with 2 µM PP2 (dark blue) is shown. Each row represents a gene, and each column represents a sample. Colors were mapped based on Reads Per Kilobase of transcript per Million mapped reads (RPKM) values, where yellow corresponds to high expression and blue corresponds to low expression.

Figure 4-7: Functional enrichment network of differentially expressed genes following treatment with PP2 (2 µM; 48 h) in human CFBE (ΔF508/ΔF508-CFTR). 4.5. Discussion

In this study, we described a CF-specific systems biology-guided unbiased computational compound screening to identify and prioritize novel small molecules that could potentially rescue

∆F508-CFTR function. Using enteroids generated from the ileum of ΔF508-CFTR homozygous mice and from rectal biopsy of CF patients (ΔF508-CFTR homozygous) and Isc analysis of human bronchial epithelial cells harvested from homozygous ΔF508-CFTR transplant patients (extensive functional validation), we validated our computationally predicted and prioritized small-molecule

91 candidates. Recent studies have shown the utility of using CF patient-derived rectal organoids for drug screening (150).

Based on our findings, we reported a novel corrector (PP2, a src-kinase inhibitor) of the ΔF508-

CFTR defect. Connectivity mapping of two other gene expression data sets from CF (human CF bronchial epithelia (151) and CF Pigs (152)) with LINCS signatures also showed PP2 among the top hits. Our proof-of-principle studies also demonstrated src-kinase inhibitors as a new class of compounds that can be used for rescuing ΔF508-CFTR.

While we observed synergy with VX-809, the potential side effects of PP2 in combination with

VX-809 (or VX-661) are difficult to predict. Further, we noted that PP2, like most other src-kinase inhibitors, might be mechanistically associated with drug-induced side effects. Thus, thorough toxicity tests in animal models followed by clinical studies of PP2 and the combinations (PP2 and

VX-809/VX-661) would have to be performed.

Although we tested 24 compounds from a total of 184 candidates, we believe that additional CF candidate therapeutics can be identified from the remaining untested compounds. For instance, among the remaining 160 compounds, we further prioritized 30 compounds that are highly similar to both PP2 and LY-294002 as CF candidate therapeutics.

Although results from our in vitro analysis and computational docking (Figure 4-4 B, D) suggest potential direct action of PP2 on mutant CFTR, additional in vitro and in vivo analysis are warranted. For instance, is the PP2-induced mutant CFTR correction dependent on the canonical

SFK (src family kinase) inhibition? SFKs have important roles in biological processes altered in CF such as apoptosis, inflammatory response, autophagy and mucin production, although the exact

92 relation between SFK and these processes in the context of CF remain largely unknown (153). A recent study showed that CFTR deficiency leads to SFK self-activation and intensified inflammatory response in mouse cholangiocytes exposed to endotoxin or LPS, and inhibition of

SFK with PP2 decreased inflammation (154). This is in consistence with our results wherein we found that PP2 downregulated genes in human CFBE showed an enrichment (P<0.05) for macrophage activation markers such as INHBA, IL1A, MMP12, MMP9, and FN1, suggesting that

PP2 may potentially ameliorate SFK-triggered inflammatory cascades. Recurrent Pseudomonas aeruginosa (PsA) infection is a major contributor to CF pathology (155). Previous studies (156) have shown that Lyn, an SFK member is critical for PsA internalization into lung cells and that blocking the activity of Lyn with PP2 prevented PsA internalization. It has also been shown that the NLRP3 inflammasome is mediated by SFK activity (157) and that NLRP3 activation exacerbates the PsA-driven inflammatory response in CF (158) and NLRP3 inflammasomes are potential targets to limit the microbial colonization in CF (159). Additionally, it has been reported that PsA infection induces the expression of MMPs (MMP12 and MMP13), which further exacerbates chronic lung infection and inflammation (160, 161). In our study, we found PP2 downregulated

MMPs and NLRP3 in CFBE treated with PP2, suggesting PP2 could also be therapeutic in CF though inhibition of PsA infection.

4.6. Conclusions

In addition to novel candidate drugs to target mutated CFTR, novel anti-inflammatory and anti- infective drugs that address secondary disease pathology in CF are needed to lessen morbidity, prolong survival, and improve quality of life (162, 163). Thus, although additional studies are

93 required, our results suggest that the SFK-inhibition (e.g., PP2) may represent a novel paradigm of multi-action therapeutics – corrector, anti-inflammatory, and anti-infective – in CF.

Interestingly, while this study was underway, a second study from Strazzabosco lab reported (164) that the combination of correctors and PP2 in ΔF508 cholangiocytes significantly increased the amount of the Band C of ΔF508-CFTR.

4.7. Abbreviations

CF: Cystic fibrosis

CFTR: CF transmembrane conductance regulator

LINCS: Library of Integrated Network-based Cellular Signatures

FIS: Forskolin-induced-swelling

CFRE: CF rectal epithelia

Isc: short circuit currents

DEG: differentially expressed gene

GSEA: gene set enrichment analysis

SEA: singular enrichment analysis

PsA: Pseudomonas aeruginosa

PP2: 1-tert-Butyl-3-(4-chlorophenyl)-1H-pyrazolo[3,4-d] pyrimidin-4-amine

4.8. Declarations

Ethics approval and consent to participate

Human studies: Patient-derived organoid assay studies for research were approved by CCHMC

IRB under 2011-2616.

Mice studies: All procedures in mice were performed in compliance with institutional guidelines and were approved by CCHMC’s IACUC.

Consent for publication

Not applicable.

Availability of data and material

All data generated or analyzed during this study are included in this published article and its supplementary files.

Competing interests

The authors declare that they have no competing interests.

Funding

Supported in part by the NIH grants NHLBI 1R21HL133539 and 1R21HL135368 (to AGJ) and by the Cincinnati Children’s Hospital and Medical Center, and R01GM123055 and NSF grant

DMS1614777 (to DK).

4.9. Authors' contributions

YW, KA, APN, and AGJ conceived the project, designed experiments, analyzed data, and wrote the manuscript. YW and JC helped with the data analysis. FY helped with the experiments. WH

95 and DK performed the docking studies. All the authors have read and approved the final manuscript.

4.10. Acknowledgement

The authors thank David L. Armbruster for editing the manuscript. The authors would like to thank Drs. Jeffrey Whitsett and Bruce Aronow for their support.

5. Chapter 5. Integrative in silico screening of candidate therapeutic discovery for Idiopathic pulmonary fibrosis

5.1. Introduction

(21). Currently, two anti-fibrotic drugs, pirfenidone, and nintedanib, are approved for IPF for their efficacy in slowing down lung function decline caused by disease progression.

The advent of two FDA-approved therapies for idiopathic pulmonary fibrosis (IPF) has energized the field, but enthusiasm is tempered by the recognition that their side effect profiles are formidable, and their therapeutic effects are suppressive rather than pulmonary fibrosis remission- or reversion-inducing (23, 24). Hence, the pursuit of relatively safer and efficacious therapies or combinatorial that arrest, or reverse fibrosis continues. Recent technological advances, high-throughput approaches, and abundant genomic and patient data can be harnessed to identify drug candidates for testing in clinical trials. We integrate information from multiple public data sources and the complex relationships between drugs and IPF-related biological networks to identify candidate therapeutics for IPF in a systematic and generalizable manner, as outlined in Figure 5-1.

Figure 5-1: Workflow of integrated into silico drug screening for IPF.

5.2. Methods

5.2.1. Cohort selection

We used publicly available gene expression profiles from the Gene Expression Omnibus (GEO)

(133) database as the basis for generating IPF gene expression signatures. 8 GEO datasets comparing primary healthy human lung tissues with primary IPF lung tissues were selected for this study Table 5-1. Each dataset was analyzed separately.

Table 5-1: Summary of 8 datasets comparing IPF lung tissue with healthy controls.

Data set Name Description

GSE10667 23 IPF samples vs 16 controls

GSE24206 17 IPF samples vs 6 controls

GSE48149 13 IPF samples vs 9 controls

GSE52463 8 IPF samples vs 7 controls

GSE53845 40 IPF samples vs 8 controls

GSE92592 20 IPF samples vs 19 controls

GSE47460 131 IPF samples vs 12 controls

GSE32537 119 IPF samples vs 50 controls

GSE101286 7 IPF samples vs 3 controls

5.2.2. Differentially analysis of IPF gene expression datasets

Differential analysis was performed in R in using the package “limma” (95). Genes with fold change above or equal to 1.5 and adjusted p-value less than or equal to 0.05 were considered as differentially expressed.

5.2.3. Permutation analysis to estimate significance of drug-disease connectivity across datasets and celllines.

To correct for multiple testing problem introduced by conducting connectivity analysis in multiple datasets, we used permutation analysis to estimate the significance of connectivity. First, we constructed a connectivity score matrix C of each CLUE compound by each IPF dataset in each cell line:

푺 ≔ 푠푖푗(푚 x 푛)

Where m is the total number of compounds and i denotes the ith compound in CLUE. n is the number of combinations from a [dataset, cellline] pair from the Cartesian product of IPF datasets and available cell lines from CLUE, and j denotes the jth combination. Next, positive and negative connectivity (C) to IPF were determined by thresholding connectivity score at 90 and -90, respectively:

1, (푠푖푗 ≥ 90)

푐푖푗 = {0, (−90 < 푠푖푗 < 90)

−1, (푠푖푗 ≤ −90)

Overall connectivity of each compound to IPF (O) is summarized as the sum of individual connectivity across all datasets and all cell lines:

푛

표푖 = ∑ 푐푖푗 푗

Permutation was performed by randomly shuffling rows of the connectivity matrix C, so that compounds names were randomly assigned. Then, the permutated overall compound-to-IPF connectivity O’ scores were calculated, and we recorded incidences where 표푖 ≤ 표푖 ’, which indicates the observed compound to IPF connectivity is no larger than random:

1, (표 ≤ 표 ′) 푓 = { 푖 푖 푖푝 0, (표푡ℎ푒푟푤𝑖푠푒)

We repeated the permutation tests for 100,000 times and estimated significance as the frequency of F overall all permutations:

100000 1 푝 = ∑ 푓 푖 100000 푖푝 푝

100

5.3. Result

5.3.1. Differential analysis of 8 IPF datasets

We collected 8 datasets, as summarized in Error! Reference source not found., comparing gene expression between lung tissue derived from IPF patients and those from healthy control were used for this research. Two of the datasets returned no differentially expressed genes after FDR, and thus were excluded from further analysis. From each of the remaining datasets, a gene signature containing up to 150 most up-regulated and down-regulated genes was extracted and used in connectivity mapping.

5.3.2. Connectivity analysis and permutation tests

We used the ConnectivityMap platform for LINCS L1000 data (CLUE.io) to find drugs with gene expression profiles anticorrelated with those of IPF (“connected” to IPF). We applied a “greedy” approach to capture the highest number of compounds connected to IPF by selecting compounds with at least 90 connectivity score in any one of the cell lines. This approached returned 1000+ compounds that were connected with IPF in at least once, as shown in Figure 5-2 and Figure 5-3.

A problem with this approach is that it introduces many compounds whose connection to IPF were observed only occurred a few times out of a maximum of 48 incidences (6 datasets by 8 cell lines per dataset). This indicates the gene expression perturbation due to technical variation in the data is present and affects our CLUE analysis. Under the assumption that IPF pathogenesis- related gene expression patterns are commonly present in our selected 6 IPF datasets, we used permutation analysis to estimate the significance of IPF-disease connectivity and filter out false positives. 82 compounds were significant with p-value cut off at 0.05. These compounds include

101

ATPase inhibitors such as digoxin and ouabain; opioid receptor antagonists such as BNTX and JTC-

801; Bromodomain (BRD) Inhibitors such as Bi-2536 and TG-101328 and survivin inhibitor YM-

155.

Figure 5-2: Heatmap of compounds with positive IPF connectivity reported at least once.

Connectivity of CLUE compounds (row) to 6 IPF datasets in 8 cell lines (columns) are shown. Red color indicates a positive connectivity of compounds to IPF, and green color indicates negative connectivity.

102

Figure 5-3: Heatmap of 82 compounds that were significantly connected with IPF.

Connectivity of significant CLUE compounds (row) to 6 IPF datasets in 8 cell lines (columns) are shown. Red color indicates a positive connectivity of compounds to IPF, and green color indicates negative connectivity. Compounds selection was based on estimated p-value less than 0.05.

5.3.3. Prioritization based on drug targets dysregulated in IPF

Our prioritized compounds include multiple drug classes and target a total of 156 gene products.

These include kinases regulating receptors for VEGF, FGF, platelet-derived growth factor (PDGF), opioid and hormone signaling pathways, ATP binding proteins and histone deacetylases (HDAC).

These pathways and gene targets have been shown to be involved or implicated in IPF. Thus, we further prioritized the IPF drug candidates by looking for drugs whose targets were uniformly differentially expressed in at least two IPF datasets. 30 out of 82 drug candidates have at least one targets that were dysregulated in IPF (Figure 5-3). A cluster of 7 drug compounds containing mostly corticosteroids was removed from further analysis because corticosteroid treatments in

103

IPF are associated with substantial morbidity and not recommended (165). Literature search on

Pubmed revealed that 17 out or 24 of these drug candidates were potentially therapeutic (Error!

Reference source not found.) based on 1) drug inhibits pulmonary fibrosis in mouse model; 2) drugs of same MOA inhibits pulmonary fibrosis in mouse model or 3) drugs targeting the same gene inhibits pulmonary fibrosis in mouse model.

Figure 5-4: Network of 31 drug candidates whose target were dysregulated in IPF.

Network of candidate drugs and their targets in the context of IPF. Drug candidates are represented as turquoise nodes, and drug targets are shown as green (down-regulated in IPF), red (up-regulated in IPF) and grey (not differentially expressed in IPF) nodes.

104

Table 5-2: Literature annotation for 24 drug candidates based on Pubmed search.

Compound Target Note

BI-2536 PLK1 Brd4 inhibitor JQ1, administered in a therapeutic dosage, can inhibit the profibrotic effects of IPF LFs and attenuates bleomycin-induced TG-101348 JAK1 lung fibrosis in mice. (166, 167) ouabain ATP1A1 Ouabain and digoxin inhibited transforming growth factor-β (TGF-β)- induced Rho activation, stress fiber formation, serum response factor digoxin ATP1A1 activation, and the expression of smooth muscle α-actin, collagen-1, and fibronectin.(168) bufalin ATP1A1 cinobufagin ATP1A1, ATP1A2 digitoxigenin ATP1A1 digitoxin ATP1A1 proscillaridin ATP1A1 gossypol MCL1 Gossypol inhibits TGF-β-induced myofibroblast differentiation and TGF-β bioactivity; Irradiated mice treated with gossypol had obatoclax MCL1 significantly reduced fibrosis outcomes(169, 170) malonoben EGFR, PDGFRA Treatment with PDGF RTKIs markedly attenuated the development of pulmonary fibrosis in excellent correlation with clinical, histological, and computed tomography results. (171)

BIBU-1361 EGFR An EGFR inhibitor, gefitinib treatment attenuated fibrotic lung remodeling due to the inhibition of lung fibroblast proliferation.(172) nintedanib FGFR4, KDR, FDA approved IPF drug PDGFRA sunitinib RET, KDR, PDGFRA sunitinib is efficacious in inhibiting established pulmonary fibrosis in the bleomycin-induced mouse model.(173, 174)

BX-912 CDK2, KDR, PDK1 Targeting Hypoxia-Inducible Factor-1α/Pyruvate Dehydrogenase Kinase 1 Axis by Dichloroacetate Suppresses Bleomycin-induced Pulmonary Fibrosis.(175) tegaserod HTR2A, HTR2, HTR4 HTR2A/B antagonist improved lung function and histology and decreased collagen content in mouse bleomycin-induced pulmonary fibrosis. (176)

105

YM-155 BIRC5 YM155 Inhibition of survivin enhances susceptibility of a subset of IPF fibroblasts to apoptosis.

NSC-663284 CDC25A, CDC25B, CDC25C penfluridol CACNA1G prostratin PRKCE

QL-XII-47 BMX terfenadine CHRM2, CHRM3

VU-0418946-1 HIF1A

5.4. Discussion

In this study, we systematically mined information on IPF-induced gene expression profiles and

prioritized drugs that could reverse IPF-induced biological perturbations in a generalizable manner.

We also induced permutation analysis to connectivity mapping to control multiple testing

problems. To our knowledge, this is the first report on large-scale and systematic in silico drug

screening for IPF.

From more than 20,000 small molecules in the LINCS database, we have successfully prioritized

24 high potential compounds, among them, 17 were previously showed to be involved or

implicated in IPF. As almost all the candidates we have discovered in this study are either FDA-

approved or are currently in advanced clinical trials for a number of diseases, rapid translation of

these compounds is feasible. Further, application of our approach earlier in the IPF drug discovery

pipeline may help to avert late stage clinical trial failures. Both of these can trim years off the drug

discovery process for IPF.

106

Summary and future directions

During this thesis study, I have developed a pipeline for single-cell RNA seq, a generalizable workflow for connectivity-based drug screening and a web database for single-cell data.

Using these tools and other bioinformatics approaches, I have identified (1) detailed cell types and subtypes during mouse lung development; (2) expression-based subgroups in IPF with various disease severity, and robust gene signatures that could help with diagnosis and prediction of severe IPF; and (3) potential therapeutic drugs for CF and repurposing drug candidates for IPF.

These works could be improved in the future in primarily three aspects: (1) gain additional insights into lung development through interrogate lineage progression between cell type and subtypes; (2) experimentally validate high potential drug candidates identified through the computational screening; and (3) evaluate the performance of the general connectivity-based drug screening workflow using data from additional diseases.

107

References

1. in Functional Ultrastructure: An Atlas of Tissue Biology and Pathology. (Springer, Vienna, 2005), pp. 224–225. 2. M. Herriges, E. E. Morrisey, Lung development: orchestrating the generation and regeneration of a complex organ. Development 141, 502-513 (2014). 3. A. Ochoa-Espinosa, M. Affolter, Branching morphogenesis: from cells to organs and back. Cold Spring Harb Perspect Biol 4, (2012). 4. D. M. Silva, C. Nardiello, A. Pozarska, R. E. Morty, Recent advances in the mechanisms of lung alveolarization and the pathogenesis of bronchopulmonary dysplasia. Am J Physiol Lung Cell Mol Physiol 309, L1239-1272 (2015). 5. A. M. Goss et al., Wnt2/2b and beta-catenin signaling are necessary and sufficient to specify lung progenitors in the foregut. Dev Cell 17, 290-298 (2009). 6. J. C. Schittny, Development of the lung. Cell and tissue research 367, 427-444 (2017). 7. R. J. Metzger, O. D. Klein, G. R. Martin, M. A. Krasnow, The branching programme of mouse lung development. Nature 453, 745-750 (2008). 8. H. Ohuchi et al., FGF10 acts as a major ligand for FGF receptor 2 IIIb in mouse multi- organ development. Biochemical and biophysical research communications 277, 643-649 (2000). 9. M. Weaver, N. R. Dunn, B. L. Hogan, Bmp4 and Fgf10 play opposing roles during lung bud morphogenesis. Development 127, 2695-2704 (2000). 10. E. E. Morrisey et al., Molecular determinants of lung development. Ann Am Thorac Soc 10, S12-16 (2013). 11. D. E. Surate Solaligue, J. A. Rodriguez-Castillo, K. Ahlbrecht, R. E. Morty, Recent advances in our understanding of the mechanisms of late lung development and bronchopulmonary dysplasia. Am J Physiol Lung Cell Mol Physiol 313, L1101-l1153 (2017). 12. P. Lindahl et al., Alveogenesis failure in PDGF-A-deficient mice is coupled to lack of distal spreading of alveolar smooth muscle cell progenitors during lung development. Development 124, 3943-3953 (1997). 13. N. A. James S. Hagood, Systems biology of lung development and regeneration: current knowledge and recommendations for future research. Wiley Interdisciplinary Reviews: Systems Biology and Medicine 5, 125-133 (2013). 14. A. T. Kho et al., Transcriptomic analysis of human lung development. American journal of respiratory and critical care medicine 181, 54-63 (2010). 15. V. Besnard et al., Maternal synchronization of gestational length and lung maturation. PloS one 6, e26682 (2011).

108

16. Y. Xu, J. A. Whitsett, in Pediatric Biomedical Informatics - Computer Applications in Pediatric Research. (2012), chap. 17, pp. 309-334. 17. B. Treutlein et al., Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature 509, 371-375 (2014). 18. Y. Wang et al., Pulmonary alveolar type I cell population consists of two distinct subtypes that differ in cell fate. Proceedings of the National Academy of Sciences of the United States of America 115, 2407-2412 (2018). 19. R. Jain et al., Plasticity of Hopx(+) type I alveolar cells to regenerate type II cells in the lung. Nat Commun 6, 6727 (2015). 20. M. E. Ardini-Poleske et al., LungMAP: The Molecular Atlas of Lung Development Program. Am J Physiol Lung Cell Mol Physiol 313, L733-L740 (2017). 21. A. Kekevian, M. E. Gershwin, C. Chang, Diagnosis and classification of idiopathic pulmonary fibrosis. Autoimmun Rev 13, 508-512 (2014). 22. G. Raghu et al., An official ATS/ERS/JRS/ALAT statement: idiopathic pulmonary fibrosis: evidence-based guidelines for diagnosis and management. American journal of respiratory and critical care medicine 183, 788-824 (2011). 23. L. Richeldi et al., Efficacy and safety of nintedanib in idiopathic pulmonary fibrosis. N Engl J Med 370, 2071-2082 (2014). 24. T. E. King, Jr. et al., A phase 3 trial of pirfenidone in patients with idiopathic pulmonary fibrosis. N Engl J Med 370, 2083-2092 (2014). 25. T. E. King, Jr., A. Pardo, M. Selman, Idiopathic pulmonary fibrosis. Lancet 378, 1949- 1961 (2011). 26. D. L. Clarke, L. A. Murray, B. Crestani, M. A. Sleeman, Is personalised medicine the key to heterogeneity in idiopathic pulmonary fibrosis? Pharmacology & therapeutics 169, 35- 46 (2017). 27. L. Richeldi, H. R. Collard, M. G. Jones, Idiopathic pulmonary fibrosis. Lancet 389, 1941- 1952 (2017). 28. R. P. Naikawadi et al., Telomere dysfunction in alveolar epithelial cells causes lung remodeling and fibrosis. JCI Insight 1, e86704 (2016). 29. Y. Xu et al., Single-cell RNA sequencing identifies diverse roles of epithelial cells in idiopathic pulmonary fibrosis. JCI Insight 1, e90558 (2016). 30. B. Ley, H. R. Collard, T. E. King, Jr., Clinical course and prediction of survival in idiopathic pulmonary fibrosis. American journal of respiratory and critical care medicine 183, 431-440 (2011). 31. M. Selman et al., Accelerated variant of idiopathic pulmonary fibrosis: clinical behavior and gene expression pattern. PloS one 2, e482 (2007). 32. D. S. Kim, Acute exacerbations in patients with idiopathic pulmonary fibrosis. Respir Res 14, 86 (2013).

109

33. N. Hambly, C. Shimbori, M. Kolb, Molecular classification of idiopathic pulmonary fibrosis: personalized medicine, genetics and biomarkers. Respirology (Carlton, Vic.) 20, 1010-1022 (2015). 34. L. J. Vuga et al., C-X-C motif chemokine 13 (CXCL13) is a prognostic biomarker of idiopathic pulmonary fibrosis. American journal of respiratory and critical care medicine 189, 966-974 (2014). 35. L. A. Murray et al., Targeting interleukin-13 with tralokinumab attenuates lung fibrosis and epithelial damage in a humanized SCID idiopathic pulmonary fibrosis model. Am J Respir Cell Mol Biol 50, 985-994 (2014). 36. D. M. Habiel, C. Hogaboam, Heterogeneity in fibroblast proliferation and survival in idiopathic pulmonary fibrosis. Frontiers in pharmacology 5, 2 (2014). 37. D. M. Habiel, C. M. Hogaboam, Heterogeneity of Fibroblasts and Myofibroblasts in Pulmonary Fibrosis. Current pathobiology reports 5, 101-110 (2017). 38. G. Trujillo et al., TLR9 differentiates rapidly from slowly progressing forms of idiopathic pulmonary fibrosis. Sci Transl Med 2, 57ra82 (2010). 39. C. Daccord, T. M. Maher, Recent advances in understanding idiopathic pulmonary fibrosis. F1000Research 5, (2016). 40. K. Konishi et al., Gene expression profiles of acute exacerbations of idiopathic pulmonary fibrosis. American journal of respiratory and critical care medicine 180, 167-175 (2009). 41. I. V. Yang et al., Expression of cilium-associated genes defines novel molecular subtypes of idiopathic pulmonary fibrosis. Thorax 68, 1114-1121 (2013). 42. D. J. DePianto et al., Heterogeneous gene expression signatures correspond to distinct lung pathologies and biomarkers of disease severity in idiopathic pulmonary fibrosis. Thorax 70, 48-56 (2015). 43. K. I. Kaitin, Deconstructing the drug development process: the new face of innovation. Clin Pharmacol Ther 87, 356-361 (2010). 44. J. Avorn, The $2.6 billion pill--methodologic and policy considerations. N Engl J Med 372, 1877-1879 (2015). 45. A. Denis, L. Mergaert, C. Fostier, I. Cleemput, S. Simoens, A comparative study of European rare disease and orphan drug markets. Health Policy 97, 173-179 (2010). 46. R. Valdez, L. Ouyang, J. Bolen, Public Health and Rare Diseases: Oxymoron No More. Prev Chronic Dis 13, E05 (2016). 47. R. Margolis et al., The National Institutes of Health's Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data. J Am Med Inform Assoc 21, 957-958 (2014). 48. T. Barrett et al., NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 41, D991-995 (2013). 49. D. Szklarczyk et al., The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45, D362-D368 (2017).

110

50. R. A. Hodos, B. A. Kidd, K. Shameer, B. P. Readhead, J. T. Dudley, In silico methods for drug repurposing and pharmacology. Wiley interdisciplinary reviews. Systems biology and medicine 8, 186-210 (2016). 51. J. Bajorath, Molecular Similarity Concepts for Informatics Applications. Methods Mol Biol 1526, 231-245 (2017). 52. A. K. Chavali et al., Metabolic network analysis predicts efficacy of FDA-approved drugs targeting the causative agent of a neglected tropical disease. BMC systems biology 6, 27 (2012). 53. V. Martinez, C. Navarro, C. Cano, W. Fajardo, A. Blanco, DrugNet: network-based drug- disease prioritization by integrating heterogeneous data. Artif Intell Med 63, 41-49 (2015). 54. L. Yang, P. Agarwal, Systematic drug repositioning based on clinical side-effects. PloS one 6, e28025 (2011). 55. H. Ye, Q. Liu, J. Wei, Construction of drug network based on side effects and its application for drug repositioning. PLoS One 9, e87864 (2014). 56. A. P. Chiang, A. J. Butte, Systematic evaluation of drug-disease relationships to identify leads for novel drug uses. Clin Pharmacol Ther 86, 507-510 (2009). 57. J. Lamb et al., The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929-1935 (2006). 58. J. Lamb et al., A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Cell 114, 323-334 (2003). 59. K. B. Gerald, Nonparametric statistical methods. Nurse Anesth 2, 93-95 (1991). 60. A. Subramanian et al., A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171, 1437-1452 e1417 (2017). 61. C. Liu et al., Compound signature detection on LINCS L1000 big data. Mol Biosyst 11, 714-722 (2015). 62. A. K. Perl, S. E. Wert, A. Nagy, C. G. Lobe, J. A. Whitsett, Early restriction of peripheral and proximal cell lineages during formation of the lung. Proceedings of the National Academy of Sciences of the United States of America 99, 10482-10487 (2002). 63. J. A. Whitsett, S. E. Wert, T. E. Weaver, Alveolar surfactant homeostasis and the pathogenesis of pulmonary disease. Annual review of medicine 61, 105-119 (2010). 64. T. J. Desai, D. G. Brownfield, M. A. Krasnow, Alveolar progenitor and stem cells in lung development, renewal and cancer. Nature 507, 190-194 (2014). 65. E. E. Morrisey, B. L. Hogan, Preparing for the first breath: genetic and cellular mechanisms in lung development. Dev Cell 18, 8-23 (2010). 66. D. Warburton et al., Lung organogenesis. Current topics in developmental biology 90, 73- 158 (2010). 67. B. L. Hogan et al., Repair and regeneration of the respiratory system: complexity, plasticity, and mechanisms of lung stem cell function. Cell Stem Cell 15, 123-138 (2014).

111

68. S. E. McGowan, D. M. McCoy, Fibroblast growth factor signaling in myofibroblasts differs from lipofibroblasts during alveolar septation in mice. Am J Physiol Lung Cell Mol Physiol 309, L463-474 (2015). 69. Y. Du, M. Guo, J. A. Whitsett, Y. Xu, 'LungGENS': a web-based tool for mapping single- cell gene expression in the developing lung. Thorax 70, 1092-1094 (2015). 70. J. M. Liebler et al., Combinations of differentiation markers distinguish subpopulations of alveolar epithelial cells in adult lung. Am J Physiol Lung Cell Mol Physiol 310, L114-120 (2016). 71. N. L. Bray, H. Pimentel, P. Melsted, L. Pachter, Near-optimal probabilistic RNA-seq quantification. Nature biotechnology 34, 525-527 (2016). 72. C. Fontanillo, R. Nogales-Cadenas, A. Pascual-Montano, J. De las Rivas, Functional analysis beyond enrichment: non-redundant reciprocal linkage of genes and biological terms. PloS one 6, e24289 (2011). 73. K. Dahlin et al., Identification of genes differentially expressed in rat alveolar type I cells. Am J Respir Cell Mol Biol 31, 309-316 (2004). 74. N. Demling et al., Promotion of cell adherence and spreading: a novel function of RAGE, the highly selective differentiation marker of human alveolar epithelial type I cells. Cell and tissue research 323, 475-488 (2006). 75. J. M. Shannon, L. D. Nielsen, S. A. Gebb, S. H. Randell, Mesenchyme specifies epithelial differentiation in reciprocal recombinants of embryonic lung and trachea. Dev Dyn 212, 482-494 (1998). 76. K. Spilsbury et al., Isolation of a novel macrophage-specific gene by differential cDNA analysis. Blood 85, 1620-1629 (1995). 77. N. A. Begum et al., Human MD-1 homologue is a BCG-regulated gene product in monocytes: its identification by differential display. Biochemical and biophysical research communications 256, 325-329 (1999). 78. R. Mittal et al., Fcgamma receptor I alpha chain (CD64) expression in macrophages is critical for the onset of meningitis by Escherichia coli K1. PLoS pathogens 6, e1001203 (2010). 79. E. El Agha, S. Bellusci, Walking along the Fibroblast Growth Factor 10 Route: A Key Pathway to Understand the Control and Regulation of Epithelial and Mesenchymal Cell- Lineage Formation during Lung Development and Repair after Injury. Scientifica (Cairo) 2014, 538379 (2014). 80. Y. Zhang et al., A Gata6-Wnt pathway required for epithelial stem cell development and airway regeneration. Nature genetics 40, 862-870 (2008). 81. Y. Yin et al., An FGF-WNT gene regulatory network controls lung mesenchyme development. Developmental biology 319, 426-436 (2008). 82. M. C. Eblaghie, M. Reedy, T. Oliver, Y. Mishina, B. L. Hogan, Evidence that autocrine signaling through Bmpr1a regulates the proliferation, survival and morphogenetic behavior of distal lung epithelial cells. Developmental biology 291, 67-82 (2006).

112

83. Z. Wang, W. Shu, M. M. Lu, E. E. Morrisey, Wnt7b activates canonical signaling in epithelial and vascular smooth muscle cells through interactions with Fzd1, Fzd10, and LRP5. Mol Cell Biol 25, 5022-5030 (2005). 84. D. M. Ornitz, Y. Yin, Signaling Networks Regulating Development of the Lower Respiratory Tract. Cold Spring Harbor Perspectives in Biology 4, (2012). 85. J. Handl, J. Knowles, D. B. Kell, Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 3201-3212 (2005). 86. B. Ley, H. R. Collard, Risk prediction in idiopathic pulmonary fibrosis. American journal of respiratory and critical care medicine 185, 6-7 (2012). 87. A. Betensley, R. Sharif, D. Karamichos, A Systematic Review of the Role of Dysfunctional Wound Healing in the Pathogenesis and Treatment of Idiopathic Pulmonary Fibrosis. J Clin Med 6, (2016). 88. F. J. Martinez et al., The diagnosis of idiopathic pulmonary fibrosis: current and future approaches. Lancet Respir Med 5, 61-71 (2017). 89. M. Selman et al., Gene expression profiles distinguish idiopathic pulmonary fibrosis from hypersensitivity pneumonitis. American journal of respiratory and critical care medicine 173, 188-198 (2006). 90. S. Y. Kim et al., Classification of usual interstitial pneumonia in patients with interstitial lung disease: assessment of a machine learning approach using high-dimensional transcriptional data. Lancet Respir Med 3, 473-482 (2015). 91. Y. Bauer et al., A novel genomic signature with translational significance for human idiopathic pulmonary fibrosis. Am J Respir Cell Mol Biol 52, 217-231 (2015). 92. E. B. Meltzer et al., Bayesian probit regression model for the diagnosis of pulmonary fibrosis: proof-of-principle. BMC Med Genomics 4, 70 (2011). 93. X. Peng et al., Plexin C1 deficiency permits synaptotagmin 7-mediated macrophage migration and enhances mammalian lung fibrosis. FASEB J 30, 4056-4070 (2016). 94. F. Pedregosa et al., Scikit-learn: Machine Learning in Python. J Mach Learn Res 12, 2825- 2830 (2011). 95. M. E. Ritchie et al., limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research 43, (2015). 96. J. Chen, E. E. Bardes, B. J. Aronow, A. G. Jegga, ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic acids research 37, W305-311 (2009). 97. S. Ayme, Orphanet, an information site on rare diseases. Soins 672, 46-47 (2003). 98. J. Piñero et al., DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research 45, D833-D839 (2017). 99. E. Balestro et al., Immune Inflammation and Disease Progression in Idiopathic Pulmonary Fibrosis. PloS one 11, e0154516 (2016).

113

100. R. Breitling, P. Armengaud, A. Amtmann, P. Herzyk, Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 573, 83-92 (2004). 101. C. UniProt, UniProt: a hub for protein information. Nucleic acids research 43, D204-212 (2015). 102. H. B. Schiller et al., Time- and compartment-resolved proteome profiling of the extracellular niche in lung injury and repair. Mol Syst Biol 11, 819 (2015). 103. J. A. Lowe, P. Jones, D. M. Wilson, Network biology as a new approach to drug discovery. Curr Opin Drug Discov Devel 13, 524-526 (2010). 104. V. J. Craig, L. Zhang, J. S. Hagood, C. A. Owen, Matrix metalloproteinases as therapeutic targets for idiopathic pulmonary fibrosis. Am J Respir Cell Mol Biol 53, 585-600 (2015). 105. X. Wang et al., Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nature biotechnology 30, 159-166 (2011). 106. A. Azuma et al., Role of E-selectin in bleomycin induced lung fibrosis in mice. Thorax 55, 147-152 (2000). 107. M. Myllärniemi et al., Upregulation of activin-B and follistatin in pulmonary fibrosis - a translational study using human biopsies and a specific inhibitor in mouse fibrosis models. BMC Pulm Med 14, 170 (2014). 108. C. E. Shannon, A Mathematical Theory of Communication. Bell System Technical Journal 27, 379-423 (1948). 109. E. Krafft et al., Transforming growth factor beta 1 activation, storage, and signaling pathways in idiopathic pulmonary fibrosis in dogs. J Vet Intern Med 28, 1666-1675 (2014). 110. K. Oda et al., Profibrotic role of WNT10A via TGF-β signaling in idiopathic pulmonary fibrosis. Respir Res 17, 39 (2016). 111. L. Wang, Y. Wang, T. Yang, Y. Guo, T. Sun, Angiotensin-Converting Enzyme 2 Attenuates Bleomycin-Induced Lung Fibrosis in Mice. Cell Physiol Biochem 36, 697-711 (2015). 112. N. M. Patel et al., Pulmonary arteriole gene expression signature in idiopathic pulmonary fibrosis. Eur Respir J 41, 1324-1330 (2013). 113. J. F. Dekkers, C. K. van der Ent, J. M. Beekman, Novel opportunities for CFTR-targeting drug development using organoids. Rare Dis 1, e27112 (2013). 114. D. M. Cholon et al., Potentiator ivacaftor abrogates pharmacological correction of DeltaF508 CFTR in cystic fibrosis. Science translational medicine 6, 246ra296 (2014). 115. G. Veit et al., Some gating potentiators, including VX-770, diminish DeltaF508-CFTR functional expression. Science translational medicine 6, 246ra297 (2014). 116. P. W. Phuan et al., Potentiators of Defective DeltaF508-CFTR Gating that Do Not Interfere with Corrector Action. Mol Pharmacol 88, 791-799 (2015). 117. J. P. Clancy, CFTR potentiators: not an open and shut case. Science translational medicine 6, 246fs227 (2014). 114

118. M. P. Boyle et al., A CFTR corrector (lumacaftor) and a CFTR potentiator (ivacaftor) for treatment of patients with cystic fibrosis who have a phe508del CFTR mutation: a phase 2 randomised controlled trial. Lancet Respir Med 2, 527-538 (2014). 119. J. T. Dudley et al., Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease. Science translational medicine 3, 96ra76 (2011). 120. M. Sirota et al., Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci Transl Med 3, 96ra77 (2011). 121. F. Iorio et al., Discovery of drug mode of action and drug repositioning from transcriptional responses. Proc Natl Acad Sci U S A 107, 14621-14626 (2010). 122. F. Iorio, A. Isacchi, D. di Bernardo, N. Brunetti-Pierri, Identification of small molecules enhancing autophagic function from drug network analysis. Autophagy 6, 1204-1205 (2010). 123. X. A. Qu, R. C. Gudivada, A. G. Jegga, E. K. Neumann, B. J. Aronow, Inferring novel disease indications for known drugs by semantically linking drug action and disease mechanism relationships. BMC Bioinformatics 10, (2009). 124. X. A. Qu, J. M. Freudenberg, P. Sanseau, D. K. Rajpal, Integrative clinical transcriptomics analyses for new therapeutic intervention strategies: a psoriasis case study. Drug Discov Today 19, 1364-1371 (2014). 125. S. Ramachandran, S. R. Osterhaus, P. H. Karp, M. J. Welsh, P. B. McCray, Jr., A genomic signature approach to rescue DeltaF508-cystic fibrosis transmembrane conductance regulator biosynthesis and function. Am J Respir Cell Mol Biol 51, 354-362 (2014). 126. J. Cheng, L. Yang, V. Kumar, P. Agarwal, Systematic evaluation of connectivity map for disease indications. Genome Med 6, (2014). 127. G. Hu, P. Agarwal, Human disease-drug network based on genomic expression profiles. PLoS One 4, e6536 (2009). 128. X. A. Qu, D. K. Rajpal, Applications of Connectivity Map in drug discovery and development. Drug Discov Today 17, 1289-1298 (2012). 129. F. Stanke et al., The CF-modifying gene EHF promotes p.Phe508del-CFTR residual function by altering protein glycosylation and trafficking in epithelial cells. European journal of human genetics : EJHG 22, 660-666 (2014). 130. Broad-Institute. (2014). 131. C. Li et al., Lysophosphatidic acid inhibits cholera toxin-induced secretory diarrhea through CFTR-dependent protein interactions. The Journal of experimental medicine 202, 975-986 (2005). 132. C. Li et al., Spatiotemporal coupling of cAMP transporter to CFTR chloride channel function in the gut epithelia. Cell 131, 940-951 (2007). 133. T. Barrett et al., NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res 35, D760-765 (2007).

115

134. J. Ooms, The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects. ArXiv e-prints. 2014. 135. J. Dalton, O. Kalid, M. Schushan, N. Ben-Tal, J. Villa-Freixa, New model of cystic fibrosis transmembrane conductance regulator proposes active channel-like conformation. J Chem Inf Model 52, 1842-1853 (2012). 136. A. Sali, T. L. Blundell, Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234, 779-815 (1993). 137. S. Kim et al., PubChem Substance and Compound databases. Nucleic Acids Res 44, D1202-1213 (2016). 138. G. M. Morris et al., AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. J Comput Chem 30, 2785-2791 (2009). 139. C. Moon et al., Compartmentalized accumulation of cAMP near complexes of multidrug resistance protein 4 (MRP4) and cystic fibrosis transmembrane conductance regulator (CFTR) contributes to drug-induced diarrhea. J Biol Chem 290, 11246-11257 (2015). 140. J. F. Dekkers et al., in Nat Med. (United States, 2013), vol. 19, pp. 939-945. 141. S. Pankow et al., F508 CFTR interactome remodelling promotes rescue of cystic fibrosis. Nature 528, 510-516 (2015). 142. J. F. Dekkers et al., A functional CFTR assay using primary cystic fibrosis intestinal organoids. Nat Med 19, 939-945 (2013). 143. W. K. O'Neal et al., Gene expression in transformed lymphocytes reveals variation in endomembrane and HLA pathways modifying cystic fibrosis pulmonary phenotypes. American journal of human genetics 96, 318-328 (2015). 144. A. Gaggar et al., Matrix metalloprotease-9 dysregulation in lower airway secretions of cystic fibrosis patients. Am J Physiol Lung Cell Mol Physiol 293, L96-L104 (2007). 145. F. Ratjen, C. M. Hartog, K. Paul, J. Wermelt, J. Braun, Matrix metalloproteases in BAL fluid of patients with cystic fibrosis and their modulation by treatment with dornase alpha. Thorax 57, 930-934 (2002). 146. S. D. Sagel, R. K. Kapsner, I. Osberg, Induced sputum matrix metalloproteinase-9 correlates with lung function and airway inflammation in children with cystic fibrosis. Pediatr Pulmonol 39, 224-232 (2005). 147. A. Gaggar et al., The role of matrix metalloproteinases in cystic fibrosis lung disease. Eur Respir J 38, 721-727 (2011). 148. J. R. Dunkelberger, W. C. Song, Complement and its role in innate and adaptive immune responses. Cell Res 20, 34-50 (2010). 149. D. P. Nichols, J. F. Chmiel, Inflammation and its genesis in cystic fibrosis. Pediatr Pulmonol 50 Suppl 40, S39-56 (2015). 150. J. F. Dekkers et al., Characterizing responses to CFTR-modulating drugs using rectal organoids derived from subjects with cystic fibrosis. Science translational medicine 8, 344ra384 (2016).

116

151. V. Ogilvie et al., Differential global gene expression in cystic fibrosis nasal and bronchial epithelium. Genomics 98, 327-336 (2011). 152. D. A. Stoltz et al., Cystic fibrosis pigs develop lung disease and exhibit defective bacterial eradication at birth. Science translational medicine 2, 29ra31 (2010). 153. M. M. Massip Copiz, T. A. Santa Coloma, c- Src and its role in cystic fibrosis. Eur J Cell Biol 95, 401-413 (2016). 154. R. Fiorotto et al., The cystic fibrosis transmembrane conductance regulator controls biliary epithelial inflammation and permeability by regulating Src tyrosine kinase activity. Hepatology 64, 2118-2134 (2016). 155. A. Y. Bhagirath et al., Cystic fibrosis lung environment and Pseudomonas aeruginosa infection. BMC Pulm Med 16, 174 (2016). 156. S. Kannan et al., Src kinase Lyn is crucial for Pseudomonas aeruginosa internalization into lung cells. Eur J Immunol 36, 1739-1752 (2006). 157. W. Guo et al., CD24 activates the NLRP3 inflammasome through c-Src kinase activity in a model of the lining epithelium of inflamed periodontal tissues. Immun Inflamm Dis 2, 239-253 (2014). 158. A. Rimessi et al., Mitochondrial Ca2+-dependent NLRP3 activation exacerbates the Pseudomonas aeruginosa-driven inflammatory response in cystic fibrosis. Nat Commun 6, 6201 (2015). 159. R. G. Iannitti et al., IL-1 receptor antagonist ameliorates inflammasome-dependent inflammation in murine and human cystic fibrosis. Nat Commun 7, 10791 (2016). 160. J. W. Park et al., Type III Secretion System of Pseudomonas aeruginosa Affects Matrix Metalloproteinase 12 (MMP-12) and MMP-13 Expression via Nuclear Factor κB Signaling in Human Carcinoma Epithelial Cells and a Pneumonia Mouse Model. J Infect Dis 214, 962-969 (2016). 161. J. W. Park et al., Pathophysiological changes induced by Pseudomonas aeruginosa infection are involved in MMP-12 and MMP-13 upregulation in human carcinoma epithelial cells and a pneumonia mouse model. Infect Immun 83, 4791-4799 (2015). 162. J. F. Chmiel, M. W. Konstan, J. S. Elborn, Antibiotic and anti-inflammatory therapies for cystic fibrosis. Cold Spring Harb Perspect Med 3, a009779 (2013). 163. D. P. Nichols, M. W. Konstan, J. F. Chmiel, Anti-inflammatory therapies for cystic fibrosis-related lung disease. Clin Rev Allergy Immunol 35, 135-153 (2008). 164. R. Fiorotto et al., Src kinase inhibition reduces inflammatory and cytoskeletal changes in ΔF508 human cholangiocytes and improves CFTR correctors efficacy. Hepatology, (2017). 165. K. R. Flaherty et al., Steroids in idiopathic pulmonary fibrosis: a prospective assessment of adverse reactions, response to therapy, and survival. Am J Med 110, 278-282 (2001). 166. X. Tang et al., Assessment of Brd4 inhibition in idiopathic pulmonary fibrosis lung fibroblasts and in vivo models of lung fibrosis. Am J Pathol 183, 470-479 (2013).

117

167. M. S. Stratton, S. M. Haldar, T. A. McKinsey, BRD4 inhibition for the treatment of pathological organ fibrosis. F1000Res 6, (2017). 168. J. La et al., Regulation of myofibroblast differentiation by cardiac glycosides. Am J Physiol Lung Cell Mol Physiol 310, L815-823 (2016). 169. J. L. Judge et al., The Lactate Dehydrogenase Inhibitor Gossypol Inhibits Radiation- Induced Pulmonary Fibrosis. Radiat Res 188, 35-43 (2017). 170. R. M. Kottmann et al., Pharmacologic inhibition of lactate production prevents myofibroblast differentiation. Am J Physiol Lung Cell Mol Physiol 309, L1305-1312 (2015). 171. A. Abdollahi et al., Inhibition of platelet-derived growth factor signaling attenuates pulmonary fibrosis. J Exp Med 201, 925-935 (2005). 172. K. Miyake et al., Epidermal growth factor receptor-tyrosine kinase inhibitor (gefitinib) augments pneumonitis, but attenuates lung fibrosis in response to radiation injury in rats. J Med Invest 59, 174-185 (2012). 173. X. Huang et al., Sunitinib, a Small-Molecule Kinase Inhibitor, Attenuates Bleomycin- Induced Pulmonary Fibrosis in Mice. Tohoku J Exp Med 239, 251-261 (2016). 174. S. L. Ashley et al., Targeting Inhibitor of Apoptosis Proteins Protects from Bleomycin- Induced Lung Fibrosis. Am J Respir Cell Mol Biol 54, 482-492 (2016). 175. J. Goodwin et al., Targeting Hypoxia-Inducible Factor-1alpha/Pyruvate Dehydrogenase Kinase 1 Axis by Dichloroacetate Suppresses Bleomycin-induced Pulmonary Fibrosis. Am J Respir Cell Mol Biol 58, 216-231 (2018). 176. M. Konigshoff et al., Increased expression of 5-hydroxytryptamine2A/B receptors in idiopathic pulmonary fibrosis: a rationale for therapeutic intervention. Thorax 65, 949-955 (2010).

118