The Development of a Statistical Model to Study How the Deletion of PD-1 Promotes Anti-Tumor Immunity

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Diallo, Alos Burgess. 2021. The Development of a Statistical Model to Study How the Deletion of PD-1 Promotes Anti-Tumor Immunity. Master's thesis, Harvard University Division of Continuing Education.

Citable link https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37367694

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA The Development of a Statistical Model to Study How the Deletion of PD-1 Promotes Anti-Tumor

Immunity

Alos Diallo

A Thesis in the Field of Bioinformatics

for the Degree of Master of Liberal Arts in Extension Studies

Harvard University

March 2021

1 2

Copyright 2021 [Alos Diallo] Abstract

T-cells are an essential component to the immune system, but they do not act alone and are instead a component in the body’s immune system. PD-1 and its ligands

PD-L1 and PD-L2 play an important role in the regulation of T-cells which are incredibly important to the treatment of . Tumors have been able to hijack the PD-1 inhibitory pathway to evade our body’s immune response. PD-1 pathway blockade, therefore, can serve as an important approach for cancer immunotherapy. However, we do not fully understand the mechanism by which PD-1 regulates anti-tumor immunity. With the datasets derived from experiments by the Sharpe Lab, we hope to answer two important questions. First, what does PD-1 regulate in a cell intrinsic manner compared to bystander effects on other cells in the tumor micro-environment? Second, which expression changes predict response as opposed to resistance to tumor clearance when

PD-1 is present or deleted? We hypothesize that a cell intrinsic loss of PD-1 is necessary for improved T-cell fitness and effector functions. This project aims to help answer these questions in three ways. First, the development of an RNA-sequencing pipeline allows the researchers in the lab to analyze the results of datasets that are generated. Second, conducting pathway analyses provides a broader picture of the landscape, including provide a more complete picture of the tumor micro-environment by indicating which pathways and cellular processes are enriched for the affected by the deletion of PD-1. Third, the development of a statistical model which makes predictions on which gene expression changes predict response as opposed to resistance.

3 The results of the statistical model indicate which genes are more closely related to PD-1 when it is deleted versus when it is present. This will help us better understand immune response as opposed to resistance to PD-1 cancer immunotherapy, and its effects on tumor growth. This model therefore provides a valuable tool to the community that would allow researchers to probe the gene expression landscape around PD-1.

4 Dedication

This work is dedicated to my parents, Abou Diallo and Sue Burgess. My parents have been a driving force and a positive influence on me and my success. They met in

Senegal where my mother worked as an American Peace Corps volunteer, and my father worked as an instructor. Both have had to continuously make sacrifices and work hard for the good of our family. This often times meant one of them had to work at night. As a child, I watched them sacrifice selflessly for my sister and me. My father, more specifically, was unable to finish his own education so that he could build a better life for us. I saw him give up his dream of finishing college in order to work and provide for our family. I have never once heard him complain or ask for anything in return. Selflessness is a quintessential quality in my father, it is a part of who he is as a man. He was present for me to provide encouragement and advice whenever I considered giving up. From him,

I learned that duty, honor and respect are part of a code that one should live by. From my mother, I learned to be curious and inquisitive. She enjoys going through records at churches, libraries, and city offices so she can research our family genealogy. She is always trying to find difficult to obtain ingredients to recipes she would like to replicate because she thinks we would enjoy them. Her inquisitive nature even led to travel abroad during a time when women were not expected nor encouraged to do so. Traveling the US, to Europe, and finally to Africa. I have become a better scientist by taking after my mother’s curious nature and I have gleaned what it means to be a man by observing my

v father. I am and will be forever grateful and appreciative for their love, advice, influence and support of me.

ARMA virumque cano, Troiae qui primus ab oris Italiam, fato profugus, Laviniaque venit litora, multum ille et terris iactatus et alto vi superum saevae memorem Iunonis ob iram –Aneid - Vergil

vi Acknowledgments

I would like to thank my advisor, Professor Arlene Sharpe, whose support and guidance makes this project possible. The Sharpe Lab Post-Doctoral Fellow, Kristen

Pauken, has provided not only scientific advice on immunology but also logistic assistance for which I am thankful for. Kelly Burke, who is also a Post-Doctoral Fellow in the Sharpe lab has provided thoughtful advice and edits during this process, helping me to get the paper to its finished state. I would like to acknowledge Vikram Juneja, as he generated the data that I used for the thesis, without which the project would not be possible. Sarah Hillman, who is the administrative assistant to the Chair Arlene Sharpe has helped me with setting up invaluable meetings and provided general logistics support, for which I am thankful. I would also like to thank Steven Denkin, my research advisor, who helped me through the entire thesis process. He provided encouragement and thoughtful feedback in both the development of my thesis as well as with my research proposal. I want to thank the Thesis Coordinator, Gail Dourian, for helping me with the administrative thesis process. I feel it is important to also thank the two people who taught me what I know about statistical modeling. Andrey Sivachenko and Victor

Farutin. I will always be grateful for what you taught me and for your encouragement. I would especially like to thank my parents Sue Burgess and Abou Diallo for their encouragement and support. Through their love I have garnered the strength to succeed.

Finally, I would like to thank my wife, Jiaoyuan Elisabeth Diallo for helping me with the

vii edits to my thesis. She helped me with all of the proofreading for this work. In addition she encouraged me and supported me when I really needed it most.

viii Table of Contents

Dedication ...... v

Acknowledgments...... vii

List of Tables ...... xi

List of Figures ...... xii

Chapter I. Introduction ...... 1

Cancer Immunotherapy ...... 3

Machine learning in medicine ...... 5

Goals ...... 9

Research Problem ...... 11

Chapter II. Materials and Methods ...... 13

RNA Sequencing analysis...... 19

Pathway Analysis ...... 24

The Statistical Model ...... 27

Chapter III. Results ...... 37

RNA Sequencing analysis...... 41

Pathway Analysis ...... 55

The Statistical model...... 69

Chapter IV. Discussion ...... 73

Limitations ...... 80

Future work ...... 81

ix Chapter V. Conclusion ...... 83

Chapter VI. Appendix ...... 84

Chapter VII. References...... 91

x List of Tables

Table 1. CD8 Δ PD-1 dataset generated by Vikram Juneja ...... 14

Table 2. Tamox Δ PD-1 dataset generated by Vikram Juneja...... 16

Table 3. Genes provided by the Sharpe lab used for quality control (gold standard)

………………...... 45

Table 4. Tamox Δ PD-1: Results of differential expression using list of genes provided by Sharpe lab ...... 51

Table 5. CD8 Δ PD-1: Results of differential expression using list of genes provided by Sharpe lab ...... 51

Table 6. DESeq result for indicator genes ...... 52

Table 7. Top differentially expressed genes for CD8 Δ PD-1 and Tamox Δ PD-1 .53

Table 8. Results for the Broad GSEA tool...... 58

Table 9. Results of analysis using Ingenuity Pathway Analysis ...... 67

Supplementary Table 1. Summary statistics for Tamox Δ PD-1 ...... 84

Supplementary Table 2. Summary Statistics for CD8 Δ PD-1 ...... 85

Supplementary Table 3. Top 100 genes from Tamox Δ PD-1 compared for the 3 methods…………………...... 86

Supplementary Table 4. Top molecular networks from IPA ...... 89

xi List of Figures

Figure 1. Schematic of RNA-seq Analysis pipelines ...... 11

Figure 2. Boxplots and PCA plots of the counts for each experiment ...... 18

Figure 3. Artificial neuron ...... 31

Figure 4. The neural network architecture ...... 33

Figure 5. Histograms of expression values for both Tamox Δ PD-1 and CD8 Δ PD-

1……………...... 39

Figure 6. t-SNE plot for CD8 Δ PD-1 ...... 40

Figure 7. QC failures from FastQC report for AC1449 from the CD8 Δ PD-1 dataset………...... 43

Figure 8. FastQC results from sample TILs 9 PD-1 from Tamox Δ PD-1 ...... 44

Figure 9. Overlap of top 100 Genes for both datasets ...... 46

Figure 10. Principal Component analysis of Tamox Δ PD-1 ...... 47

Figure 11. Principal Component analysis of Tamox Δ PD-1 ...... 48

Figure 12. Principal Component analysis of CD8 Δ PD-1 Dataset ...... 48

Figure 13. Heatmap of the top differentially expressed genes...... 55

Figure 14. Pie charts with overviews of top functional hits for the Molecular

Function database in Panther ...... 60

Figure 15. Results from PANTHER...... 62

Figure 16. Sample of the results from Enrichr for CD8 Δ PD-1 ...... 64

Figure 17. Sample of the results from Enrichr for Tamox Δ PD-1 ...... 65

xii

Figure 18. Training and Test error for neural network...... 70

Figure 19. Boxplots displaying accuracy and error for statistical modeling methods

on the Validation set ...... 71

Figure 20. Model predictions for Oxidative and GSEA results ...... 72

Supplementary Figure 1. Histograms of Log transformed Expression values ...... 88

Supplementary Figure 2. Boxplots of statistical model performance metrics ...... 90

xiii

xiv

Chapter I.

Introduction

The body’s immune response is made up of a complex system of cells which respond to threats to the body. The adaptive immune system is comprised of humoral and cell mediated responses. In the adaptive immune response, antigen presenting cells, such as classical dendritic cells or activated monocytes, ingest an antigen through endocytosis and present the antigen to T-cells as part of a Major Histocompatibility Complex (MHCI-

I or MHC-II) molecule (Murphy, Travers, Walport, & Janeway, 2012). T-cells are part of the adaptive immune response with helper T-cells and cytotoxic T-cells (Murphy et al.,

2012).

There are two main classes of T lymphocytes, each expressing one of two cell- surface CD8 and CD4. Cytotoxic T-cells express CD8 whereas helper cells express CD4 (Murphy et al., 2012). These cells are generated in the thymus. CD8+ T- cells are a specific type of T lymphocyte which acts against cells infected with intracellular microorganisms such as viruses (Murphy et al., 2012). CD8+ T-cells identify infected cells by recognizing antigens which are expressed on the cell’s surface, derived from a virus (Murphy et al., 2012). These antigens are then recognized by T-cell receptors on CD8+ T-cells (Murphy et al., 2012). Once the cell is recognized, it is then killed by the T-cell (Murphy et al., 2012). Cytotoxic T-cells can also recognize a specific antigen expressed by a cancer cell and can kill tumor cells expressed by the tumor antigen (Murphy et al., 2012).

1

T-cells require two signals to go from being naïve T-cells to being activated: (1) antigen recognition through the T-cell receptor, and (2) co-stimulatory signaling. T-cells have both costimulatory and coinhibitory checkpoint pathways, with costimulatory pathways promoting the activation of naïve T-cells and coinhibitory pathways limiting T- cell activation (Murphy et al., 2012). Multiple inhibitory receptors on T-cells have been described, including CTLA-4, LAG-3, and PD-1 to name a few (Freeman & Sharpe,

2012). CD8+ T-cell responses have to be controlled, otherwise this could lead to abnormal killing of cells and the production of pro-inflammatory cytokines (Sharpe &

Pauken, 2017).

PD-1 plays an important role in regulating CD8+ T-cell responses. PD-1 is an inhibitory receptor expressed during activation by T-cells (Sharpe & Pauken, 2017). PD-

1 inhibitory signals are important to maintain peripheral tolerance and also serve as a mechanism by which tumors can evade host tumor specific T-cell responses (Gong,

Chehrazi-Raffle, Reddi, & Salgia, 2018). Upon activation, PD-1 is expressed on all conventional CD4+ and CD8+ T-cells (Sharpe, 2017). PD-1 regulates T-cell differentiation and effector responses. During chronic antigen stimulation, such as during chronic infection or cancer, PD-1 is highly expressed on T-cells and these cells are dysfunctional due to inhibition by PD-1 (Sharpe, 2017). This has led to the development of therapies targeting PD-1 to block these inhibitory signals and thereby stimulate T-cell responses (Sharpe, 2017).

PD-1 however does not act alone in regulating T-cell response. PD-1 together with its two ligands PD-L1 (CD274, B7-H1) and PD-L2 (CD273, B7-DC) serve as an important check on T-cell activation (Sharpe & Pauken, 2017). There are many

2 mechanisms used by tumors to block immune responses, including, but not limited to the use of PD-1. PD-1’s ligand PD-L1 plays an important role in protecting immune responses in tissues (Sharpe & Pauken, 2017). PD-L1 can be found on a whole range of cells from T-cells and B-cells, to macrophages and dendritic cells, as well as non- hematopoietic cells including epithelial and endothelial cells (Sharpe & Pauken, 2017).

Higher levels of inflammation have been shown to increase expression of PD-L1 (Sharpe

& Pauken, 2017). PD-1’s other ligand PD-L2 is predominantly expressed on dendritic cells, macrophages and B-cells (Sharpe & Pauken, 2017). Both have been shown to be expressed on tumor cells, with PD-L1 being more common, and more frequently associated with inflammatory responses (Sharpe & Pauken, 2017). Both are regulated heavily by cytokines. Providing researchers with a new tool to understand how PD-1 regulates T-cell response and in turn improve our ability to treat cancer.

Cancer Immunotherapy

The number of new cancer patients diagnosed each year continues to make up a significant part of the patient population in the U.S. with roughly 1.7 million new cases in

2018 alone (Siegel, Miller, & Jemal, 2019). Cancer is also one of the leading causes of death worldwide with roughly 600,000 deaths just in the U.S., despite a recent decline, from new types of therapies (Siegel et al., 2019). Targeted cancer therapy is something patients have only been able to benefit from these 10-15 years. Up until then, the way we treat cancer has been restricted to three avenues: surgery, radiation, and/or chemotherapy

(Dan et al., 2012) (Institute, 2015). These general approaches lacked any genetic or immune component.

3

Some of the most effective modern treatments for cancer include genetically targeted therapies, hormone therapy, stem cells, precision medicine, and immunotherapy

(Urruticoechea et al., 2010). Each of these has its benefits and drawbacks.

Immunotherapy involves using the immune system to target tumors through the regulation of genes such as PD-1 (Gabriel, 2007). There are many clinical trials involving checkpoint inhibitors like PD-1. Checkpoint inhibitors are not always a viable therapy for treating patients, depending on the tumor type or other medical conditions. In 2011, the

FDA approved the first cancer immunotherapy, ipilimumab, an antibody that blocks

CTLA-4. PD-1 cancer immunotherapy is now approved for over 20 types of cancer including melanoma, lung, and bladder cancer (Ribas, 2012) (FDA, 2015). Cancer immunotherapy can potentially have significant advantages over traditional treatments.

For example, immunotherapy is less invasive than surgery, leading to faster recovery times and/or fewer off target effects such as those caused by chemotherapy

(Urruticoechea et al., 2010). However, there are side effects as patients may exhibit flu- like symptoms such as fatigue, muscle and/or joint pain (NCI, 2019). In addition, there could be more severe side effects which mimic autoimmunity such as pneumonitis, colitis, and myocarditis (NCI, 2019). Immunotherapy is still at its infancy and there are many aspects of how the immune system attacks cancer which we have yet to understand.

Some of these existing questions are related to how inhibitory regulators like PD-1 and

CTLA-4 help manage tumor clearance. For example, it appears that tumor clearance is more successful when both PD-1 and CTLA-4 are blocked (Sharpe, 2017). A full picture of the molecular pathways which are triggered by inhibitory checkpoint molecules is something that we need to better understand (Sharpe, 2017). It would also be helpful to

4 understand the cell-type specific roles or functions of checkpoint molecules (Sharpe,

2017).

Machine learning in medicine

Functional studies can be extremely costly and time-consuming, such as those involving knockout experiments. This project focuses on ways to inform such analyses using machine learning to better understand the mechanisms behind PD-1 checkpoint inhibition. Therefore, I would like to review the use of neural networks in the analysis of immune responses since PD-1 inhibition serves as a mechanism for cancer immunotherapy (Gong et al., 2018). Below I review how PD-1 is currently studied, in terms of conditional knockouts and how statistical models like neural networks are employed to gain mechanistic understanding and help provide viable targets for downstream analysis.

Machine learning has made significant inroads into the study of medicine (Deo,

2015). In particular, machine learning approaches applied to Pathology and Radiology have been very successful, because the problems have been predominantly confined to image classification (Khosravi, Kazemi, Imielinski, Elemento, & Hajirasouliha, 2017).

Perhaps the reason this is the case is because algorithms employed in machine learning are also really adept at image classification. Much of this work was made possible by the

TensorFlow software package developed by Google which allows researchers to easily construct artificial neural networks. Most of this analysis follows a similar pipeline. First, images are obtained by a pathologist or radiologist categorized by the disease (Khosravi et al., 2017). Second, a training set of images is manually labeled as being benign or malignant (Khosravi et al., 2017). Third, this curated list of images is fed into the neural 5 network in order to train it to distinguish between the different classifications (Khosravi et al., 2017). In many cases, the classification results have shown to have a higher accuracy rate compared to those of a clinician alone (Esteva et al., 2017). Building upon the success of TensorFlow software, Google has developed ImageNet, a collection of images used to train neural networks (Krizhevsky, 2012). ImageNet makes the process of training a neural network easier as it does not require the user to already have large collections of images normally needed to train a model (Krizhevsky, 2012). This has made it easier to use neural networks in medicine and research (Russakovsky et al.,

2015). Machine learning algorithms are really effective at solving problems when the amount of data is extremely large (in the range of gigabytes or more) (Hastie, Tibshirani,

& Friedman, 2009). There are many different types of machine learning algorithms, but for larger datasets rarely have any been shown to perform as well as neural networks (Ian

Goodfellow, 2016). Neural networks were designed and modeled after a biological and are comprised of nodes (neurons) and edges (synapses). In a neural network, information is transferred between nodes through the edge connections. New connections are formed, and old ones are removed based on specific needs of the network. Just as there is an action potential in a synapse which controls the electrical signal between neurons, neural networks have activation functions and weights which control how information is passed between neurons (Ian Goodfellow, 2016). This structure can be very simple, where there is a single neuron, or it can be very complex, with more than a million neurons. The limits for the network architecture are often the computer system upon which the network lives. The logic for how the network behaves scales with its complexity. This work, however, is not limited to cancer images.

6

A few seminal studies have been carried out which have successfully employed machine learning in medicine. In 2016, the development of a model to automatically detect diabetic retinopathy using roughly 11,000 retinal images resulted with 96% accuracy in detection of disease (Gulshan et al., 2016). In 2017, a model to screen breast cancer using 230,000 pathology images was developed (Yun Liu, 2017). Later that year, the development of a phone application allowed patients to take pictures of areas of their skin in order to determine if the image is a melanoma (Esteva et al., 2017). This phone application provided patients with an easy-to-use interface where they can get advice on whether or not they should see a physician. Yet another advancement was the development of a tool for automated image segmentation of cells which were healthy

(Sadanandan, Ranefall, Le Guyader, & Wählby, 2017). The development of methods such as automated image segmentation is important as they help broaden what can be done using neural networks in medicine and in biology.

In addition to the use of neural networks for image analysis, there are also strides made in their use for the analysis of gene expression data. Machine learning in general is not new to gene expression. Classical statistical tools such as DEseq are widely used in genomics and employ many of the methods from machine learning (Love, Huber, &

Anders, 2014). Neural networks, which are a different type of technique from DEseq, have not been widely used in genomics until more recently. One example of this is in the ability to use neural networks to reduce the amount of data that a researcher has to focus on (Lin, Jain, Kim, & Bar-Joseph, 2017). Traditionally, this has been done using principal component analysis (PCA) or other classical dimensionality reduction techniques such as t-SNE and UMAP (Hastie et al., 2009). With the advent of ever larger

7 datasets, there has been a need for more powerful statistical tools than what is available today. This is even more important because genetic testing has become a staple in modern cancer treatment (Burke, 2002). Gene expression data can be derived from both clinical and non-clinical samples. This is in large part due to the availability of tissue samples and more routine genetic testing of cancer biopsies. To help with the processing of clinical samples many hospital departments have data analysis support on site.

The use of neural networks has often been limited to the analysis of basic science and not directly tied to patient care for most of the published applications involving gene expression analysis. The two major areas where neural networks been heavily used in gene expression studies are for 1) image analysis, and 2) single cell analysis (Zheng &

Wang, 2019) (Miao et al., 2019) (Ching et al., 2018) (Moen et al., 2019). Both types of experiments are uniquely positioned to benefit from the use of neural networks as they contain large amounts of data (millions of data points). Bulk RNA-seq experiments are limited in the number of experimental conditions being analyzed. Neural networks have been employed in the analysis of single cell RNA sequencing (SCRNA-seq). This is because while a bulk RNA-seq experiment might be comprised of a handful of replicates, a SCRNA-seq experiment could be comprised of tens of millions of cells, each cell being an experiment. These differences bring their own inherent challenges. One challenge facing SCRNA-seq (single cell methods in general) has to do with batch effects

(Korsunsky et al., 2019). Many different cells from different tissues are often compared.

These cells can be normalized differently as the amount of RNA extracted from each cell is usually inconsistent. This leads to the introduction of bias which can be difficult to detect and to correct for. Recently, neural networks among other methods such as

8

Harmony have been shown to be successful in addressing these biases (Korsunsky et al.,

2019) (Way & Greene, 2018). One other example is in an area which intersects both immunology and genetics where neural networks are beginning to successfully be applied to the prediction of peptides using SCRNA-seq data (Graham et al., 2018).

Goals

Within the field of medicine, machine learning has been broadly applied to image classification for Pathology and Radiology, with limited focus on precision medicine. To provide a sense of scale there were more than 400 papers published in 2016 and 2017 alone which feature the use of machine learning for the analysis of medical images (Yi,

Walia, & Babyn, 2018). We are hoping to expand on what has been done with machine learning but in the area of modeling the gene expression of the micro-tumor environment.

We are also proposing an application which does not involve image analysis. The analyses proposed for insights into PD-1 function could lead to better therapies for patients suffering from cancer. Although PD-1 inhibition serves as a mechanism for cancer immunotherapy, we do not fully understand the mechanism by which PD-1 regulates anti-tumor immunity (Gong et al., 2018). With the dataset provided by Dr.

Arlene Sharpe’s lab, we hope to answer two questions.

1) What does PD-1 regulate in a cell intrinsic manner compared to bystander

effects on other cells in the tumor micro-environment?

2) Which gene expression changes predict response as opposed to resistance to

tumor clearance when PD-1 is present or deleted?

This project contributes to that work in three ways. First, the development of an RNA- sequencing pipeline allows the lab to analyze the results of the datasets which address 9 these questions (Figure 1). Second, conducting pathway analyses on the datasets provides a broader picture of the gene expression landscape. This should provide a better picture of the tumor micro-environment. Third, through the development of a statistical model which makes predictions on which gene expression changes predict response as opposed to resistance. This aims to expand on what is known by providing more detail surrounding PD-1 inhibition in relation to tumor suppression.

The results from the RNA-seq analysis in Pipeline A (figure 1) provide the input for different pathway analysis tools. This helps narrow down the list of pathways to target. These networks can then be manipulated and modeled using neural networks providing insight and substantive downstream targets for mechanistic analyses. It is the hope that these targets will help save time, money and provide a way to measure how likely potential targets will be effective. This level of certainty is missing in current disease models of PD-1 inhibition. Ideally, this work would lead to improved outcomes for patients involved in these therapies. The results of both of these aims will be used in the construction of a statistical model namely a neural network that can be used to make predictions based on this data.

10

Figure 1. Schematic of RNA-seq Analysis pipelines

Pipeline A is a workflow which uses a different program for each step in the process. The Star Aligner aligns reads to the genome. SamTools sorts reads that were mapped. HTSeq-Count is used to count reads. DeSeq determines which genes are differentially expressed. Pipeline B uses Salmon to map transcripts and for counting those transcripts. It then uses DESeq to determine differential expression. Pipeline C uses CLC Genomics Workbench to process the reads and EdgeR to determine differential expression.

Research Problem

We hypothesize that a cell intrinsic loss of PD-1 was necessary for improved T- cell fitness and effector functions. To study this, mice were subcutaneously implanted with MC38 adenocarcinoma tumors. Then, they were broken into three groups, wild type mice (our control), CD8-specific deletion mice (CD8 Δ PD-1) where PD-1 is permanently deleted on all CD8 T-cells, and a tamoxifen-inducible deletion (Tamox Δ

PD-1) where PD-1 is deleted on half of all cells in mice. RNA sequencing (RNA-seq) was then performed on these mice in order to generate the datasets used in this analysis.

With the results of the RNA-seq analysis, a general pathway analysis is carried out.

Pathway analysis is a type of bioinformatics analysis where sets of genes are associated with certain gene ontologies (GO). There are many tools which can be used to perform

11 this task most of which contain GO databases or pathway maps, such as Kegg or David.

GO is a controlled vocabulary that is used to describe the biology of a gene product

(The Consortium, 2018). GO is broken up into three categories of vocabulary (ontologies). The first is molecular function, the second is biological process, and the third is cellular component. Molecular function encompasses all of the molecular level activities performed by a gene product (The Gene Ontology Consortium, 2018).

There are functions which perform at a molecular level such as catalysis. Biological processes are larger in scope and are performed by multiple molecular components, an example of this would be biosynthesis (The Gene Ontology Consortium, 2018). Cellular component is the third GO and are locations in/ on a cell as opposed to a process, such as centrioles (The Gene Ontology Consortium, 2018). Pathway analysis tools also tend to contain pathway maps which are network diagrams that show molecular interactions involving genes of interest (Kanehisa & Goto, 2000). In addition, they can also contain curated information on pathways and GO which have been compiled by research staff

(these tend to exist in commercial software). Performing pathway analysis ties in relevant information that could help in providing a more complete picture of what is happening in the cell’s micro-environment. The results of the RNA-seq and pathway analysis were used in the development of a statistical model in the form of a neural network. This was developed to give the Sharpe lab a tool to use in their quest to understand how PD-1 regulates cancerous tumors. In essence the lab would have a pretrained model that could then be extended without the need to analyze the initial data.

12

Chapter II.

Materials and Methods

The first dataset characterized the CD8-specific deletion where PD-1 was permanently deleted on all CD8 T-cells for the lifetime of the mouse (CD8 Δ PD-1). In this dataset, the "Cre negative" (Cre-) are considered wild type controls, and the "Cre positive" (Cre+) samples are the mice lacking PD-1 (PD-1 deleted on T-cells) (see Table

1). All of the mice are PD-1 f/f (contain two copies of the loxP-flanked PD-1 deletion), and the presence or absence of Cre is what dictates deletion. CD8+ T-cells were sorted from tumors and tumor-draining lymph nodes (tumor-dLN) from mice at day 10, post- tumor cell implantation. There were 4 mice each for the Cre- and Cre+ tumors additionally, there were 5 mice where samples were taken from draining lymph nodes

(dLN), however these samples were excluded from the analysis. Based on how sorting was completed (bulk CD8+ T-cells, not sorted on antigen specific or even antigen experienced), the bulk of these cells should be naive regardless of which mouse they come from.

13

Table 1. CD8 Δ PD-1 dataset generated by Vikram Juneja

Sample Mouse # KP added meta data: KP added: Primary vs RNAquant Barcode ul to Cre status secondary challenge combine 1. AC2228 - primary 242 A2 8.3 2. AC1449 + Primary 248 A3 8.1 3. AC4858 - Primary 208 A4 9.6 4. AC4861 + Primary 335 A5 6.0 5. AC4868 + Primary 133 A6 15.0 6. AC4862 + Primary 262 A7 7.6 7. AC9394 - Secondary 177 A8 11.3 8. AC9392 - Secondary 77 A9 20.0 9. AC9391 + Secondary 174 B2 11.5 10. AC9393 + Secondary 164 B3 12.2 11. AC3944 - Primary 149 B4 13.4 12. AC7204 - Primary 117 B5 17.1 13. AC3947 - Primary 85 B6 20.0 14. AC3942 - Primary 117 B7 17.1 15. AC3946 + Primary 217 B8 9.2 16. AC7199 + Primary 161 B9 12.4 17. AD1496 - Secondary 117 C2 17.1 18. AD1494 + Secondary 134 C3 14.9 19. AD3331 + Primary 241 C4 8.3 20. AD3330,52 - Primary 65 C5 20.0 21. AD3333 + Primary 173 C6 11.6 22. AD3351 - Primary 59 C7 20.0 23. AD3354 + Primary 56 C8 20.0 24. AD3353 - Primary C9 20.0 25. PD1KO1 PD1 KO Primary 103 D2 19.4 26. PD1KO2 PD1 KO Primary 117 D3 17.1 27. AC5393 ? 215 D4 9.3 28. AC3868 dLN ? 469 D5 4.3 29. AC3944 dLN - Primary 274 D6 7.3 30. AC7204 dLN - Primary 601 D7 3.3 31. AC3947 dLN - Primary 342 D8 5.8 32. AC7199 dLN + Primary 377 D9 5.3 These mice had PD-1 deleted on CD8 cells. The samples were taken from mice that were either PD-1 negative (Cre+) or PD-1 positive (Cre-). dLN data were not used in this analysis.

The second dataset characterizes the tamoxifen-inducible deletion where PD-1 was deleted on half of all cells in the mouse, starting at day 7 and ending on day 11

(Tamox Δ PD-1) (see Table 2). The cells from Tamox Δ PD-1 are tumor inflating lymphocytes (TIL) which is an immune cell that has migrated from the blood into the tumor to attack the cancer. Since this is the model where we only have 50% deletion of

PD-1, we sorted PD-1 positive and PD-1 negative CD8+ T-cells from the tumors separately and performed separate RNA-seq on these samples. Again, the Cre negative mice were considered wild-type controls, and the Cre positive mice had PD-1 deleted (all

14 mice contain two copies of the loxP-flanked PD-1 gene, so the presence or absence of

Cre dictates deletion). In the Cre negative mice, both PD-1 positive and PD-1 negative T- cells have the potential to express the PD-1 . CD8 Δ PD-1 has higher PD-1 expression than the Tamox Δ PD-1. The samples labeled "Cre-" and "Cre+" correspond to the tumor-infiltrating CD8+ T-cell population. They are also further designated as

"PD-1-" or "PD-1+" in table 2 as part of the sample name. There are 7 Cre- and 6 Cre+ tumors. 6 dLN samples are sequenced and come from a combination of both Cre+ and

Cre- mice (Data not included in the analysis).

15

Table 2. Tamox Δ PD-1 dataset generated by Vikram Juneja.

Sample KP added based # of cells Barcode ul to combine on file names: cre status 1. TILs 1 PD1+ + 4452 A1 20 2. TILs 2 PD1+ - 6530 A2 15.3 3. TILs 3 PD1+ + 6542 A3 15.3 4. TILs 4 PD1+ + 4879 A4 20 5. TILs 5 PD1+ - 16897 A5 5.9 6. TILs 6 PD1+ - 4340 A6 20 7. TILs 7 PD1+ + 3378 A7 20 8. TILs 8 PD1+ - 52646 A8 1.9 9. TILs 9 PD1+ + 5928 B1 16.9 10. TILs 10 PD1+ - 14629 B2 6.8 11. TILs 11 PD1+ - 36408 B3 2.7 12. TILs 12 PD1+ - 8365 B4 12.0 13. TILs 13 PD1+ + 8697 B5 11.5 14. TILs 14 PD1+ + 14265 B6 7.0 15. TILs 15 PD1+ - 4931 B7 20 16. TILs 1 PD1- + 24424 B8 4.1 17. TILs 2 PD1- - 2475 C1 20 18. TILs 3 PD1- + 3079 C2 20 19. TILs 4 PD1- + 3973 C3 20 20. TILs 5 PD1- - 1132 C4 20 21. TILs 6 PD1- - 1342 C5 20 22. TILs 7 PD1- + 2979 C6 20 23. TILs 8 PD1- - 3342 C7 20 24. TILs 9 PD1- + 2909 C8 20 25. TILs 10 PD1- - 1923 D1 20 26. TILs 11 PD1- - 1496 D2 20 27. TILs 13 PD1- + 6095 D3 16.4 28. TILs 14 PD1- + 8004 D4 12.5 29. dLN 3 + 50,000 D5 4 30. dLN 4 + 50,000 D6 4 31. dLN 5 - 50,000 D7 4 32. dLN 8 - 50,000 D8 4

The samples were taken from mice that were either PD-1 negative or PD-1 positive. These mice had PD-1 deleted using a tamoxifen-inducible deletion allowing for the deletion of the gene at day 7. + indicates Cre+ PD-1- (negative) and – indicates Cre- PD-1+ (positive).

Preliminary analyses were conducted on the two datasets to determine their makeup and if anything in the datasets stood out that may inhibit further analysis. This was conducted using the results from running the datasets through the CLC Genomics workbench (pipeline C in Figure 1). Normally, preliminary analyses are conducted after running FastQC and/ or Qualimap, which is a tool for performing quality control on RNA sequencing reads (Andrews) (Okonechnikov, Conesa, & García-Alcalde, 2016). In the

16 case of this experiment, preliminary analyses were conducted beforehand because we needed to determine if these datasets could be utilized to answer our research questions.

Summary statistics were compiled using the summary function calculated in R. Boxplots were used to determine the distributions of the two datasets (Figure 2). The raw counts were log normalized to aid in visual inspection of the boxplots. This is because RNA-seq count data from mice and (among other organisms) has been shown to follow a

Negative Binomial distribution (Anders, Reyes, & Huber, 2012) (Soneson, Love, &

Robinson, 2016). This causes the data to follow a skewed distribution, making it difficult to use methods such as a t-test, linear model, or ANOVA. Of note, PCA was performed on the raw counts to get a better picture for how variance is distributed in the data. PCA plots the variance by principal component and provides a picture for how the variance is distributed which helps to inform the downstream analyses. In an ideal scenario, the majority of variance in a dataset would be restricted to the first few principal components. This is what we observe with our datasets (Figure 2). It is also important to determine the frequency of counts for the two datasets, which helps provide insight into the data’s structure. Are there outliers? Are there strange patterns in the data (for example is the data multimodal)? These questions can be addressed by looking at a histogram. In this experiment, histograms depict the frequency of genes with X number of reads. Next, it was also important to determine if there were clusters based on counts, or Cre+/ Cre-. If clusters are detected, it would help determine if there are any obvious trends in the data that would be visible without having to model the data. The visual inspection of clusters was assessed using t-SNE as it is suited to high dimensional data. T-SNE does a better job of determining differences on both small and large distances than PCA would this is

17 because unlike PCA you are not just assessing variance but using Gaussians to help determine local distance. (Maaten, 2014). T-SNE is also unbiased in the sense that it does not require us to know how many clusters to look for as opposed to methods like Kmeans

(Maaten, 2014). After these preliminary analyses were completed, we felt the results were adequate to proceed with the project.

Figure 2. Boxplots and PCA plots of the counts for each experiment

A: Boxplot of log normalized expression values for CD8 Δ PD-1. B: Boxplot of log normalized expression values for Tamox Δ PD-1. C: Principal component variance plot for CD8 Δ PD-1. D: Principal component variance plot for Tamox Δ PD-1.

18

RNA Sequencing analysis

RNA sequencing, including library preparation was performed according to

Kadoki et al. (Kadoki et al., 2017). The resulting read files were tested for quality control

(QC) using the FastQC program (Wingett & Andrews, 2018). This analysis was meant to uncover issues in the sequence files that could point to problems with the library prep or sequencing itself (e.g., contamination). Normally, this would be performed before the preliminary analysis, however, given the scope of this project, it was important to perform some preliminary data analysis first. This ensures that the dataset would support the ability to ask the questions posed in this thesis project. Once this was completed, the read files were processed through the following protocol. Genes were first mapped to the reference genome, then those reads were counted, and finally the genes were fit to a mathematical model to determine which genes were differentially expressed. The goal in the end was to be able to compare different pipelines, to provide the lab with a standard operating procedure. A schematic of the different pipelines that are compared can be seen in Figure 1. In each case, data is mapped to the mouse genome. Pipeline A uses the Star

Aligner, HTSeq, Samtools, and DESeq, Pipeline B uses Salmon tools and DESeq, and

Pipeline C uses the Qiagen CLC Genomics Workbench. The lab has previously used

Qiagen’s CLC Genomics Workbench (Qiagen, 2019) to perform their RNA-seq analysis and therefore, this was used as the benchmark pipeline. The Qiagen’s CLC Genomics

Workbench provides an easy to use MS Windows based program which is a standalone program that handles the entire workflow (Qiagen, 2019). The department has since lost support for CLC Genomics Workbench and so alternatives are needed.

19

First, in each case, reads were mapped to the current version of the mouse genome

(mm10, Ensembl release 67) (Hunt et al., 2018). The dominant ways to map reads are:

Hash based methods, suffix arrays, and Burrows-Wheeler transformations (Lindner &

Friedel, 2012). Hash based methods work by using a key-value lookup table system similar to that of a phonebook. In this system, k-mers of a given size are chosen and used to build a lookup table for each m-mer mapped to a given position (Lindner & Friedel,

2012). This method tends to have speeds of 0.1-1.0 second/query which are usually too slow for modern RNA-seq datasets (Lindner & Friedel, 2012). Suffix arrays is a string that is broken up into substrings or suffixes. A table is constructed of those substrings and then sorted for efficiency (Lindner & Friedel, 2012). Both Salmon and STAR aligner employ the suffix array method. Finally, a Burrows-Wheeler transformation can map reads by using a compressed form of suffix arrays (Lindner & Friedel, 2012).

When aligning/mapping the reads to the reference genome (mm10), each pipeline uses a separate program. The alignment program STAR aligner is a fast and accurate tool for aligning reads to the reference genome (Dobin et al., 2013). STAR is widely used and was designed for aligning RNA sequencing data. Salmon maps reads to a transcriptome

(Patro, Duggal, Love, Irizarry, & Kingsford, 2017). This is performed using the transcriptomics pipeline provided by the software program. The benefit to Salmon is in the speed and it is the fastest of the programs. Salmon does require reads to quasi-map to a transcriptome; because of this, a file containing transcripts must be provided (Patro et al., 2017). The implications of this is that it might not be possible to find novel transcripts, thus making this tool not particularly suitable for the lab.

20

Second, the mapped reads are sorted and counted to determine the expression level for each gene. The different pipelines employ separate methods for achieving this.

In the case of Pipeline A (STAR Aligner), the resulting read files were sorted (to aid in counting) using SAMTools (Li et al., 2009). Then using HTSeq-Count, mapped reads are counted using a set of options: union, intersection strict, and intersection nonempty. For union, if a read overlaps a given gene in the entity, it will be mapped to that gene

(Anders, Pyl, & Huber, 2015). If a read maps to more than one gene, it is considered ambiguous and discarded (Anders et al., 2015). Intersection strict works in the same way except that it is assigned to the gene with the largest amount of sequence mapped (Anders et al., 2015). There is also an issue when a read is mapped across a gap or if the read is not fully mapped to a gene, in which case, the count is discarded (Anders et al., 2015).

Intersection nonempty is similar to union unless a read is mapped to more than one gene where it counts it as belonging to the gene with the highest overlap (Anders et al., 2015).

Pipeline B (Salmon) works in a very different manner. Instead of adding up the reads for each gene, it uses a statistical model. The algorithm employs a quasi-mapping technique

(Patro et al., 2017). This is made up of two operations: the first operation is an “online phase” which estimates the initial expression level and parameters for the model. The second operation is an offline phase that refines the model and its estimates that were calculated using the online phase (Patro et al., 2017). This results in a probabilistic model of the sequencing experiment which can be used to count genes mapped to the genome.

Pipeline C (CLC Genomics Workbench) is able to count reads using the main RNA sequencing tool and therefore does not require a separate program.

21

Third, the count data are processed to determine which genes are differentially expressed, the last step in a RNA-seq analysis. This is usually handled by: DESeq (Love et al., 2014), edgeR (M. D. Robinson, McCarthy, & Smyth, 2010), or Cufflinks (Trapnell et al., 2012). For this analysis, DESeq and edgeR are used to determine which genes are differentially expressed. The pipeline A (STAR Aligner) and pipeline B (Salmon) use

DESeq while CLC Genomics Workbench uses edgeR to determine differential expression. Both DESeq and edgeR models counts to a generalized linear model (glm) in this case, to the Poisson distribution (Jiang & Wong, 2009). The Poisson distribution is a really good model to use since it closely resembles the types of counts in RNA-seq data, which is due to “The Law of small numbers” (Love, 2013) (Härdle & Vogt, 2015). The law states: (1) if the probability of an event, p(x) is small (<.01%), but the number of observations is large (N>10,000), and (2) the product of those two values is small (a number between 0 and 10), then, (3) the probability density function for the sample follows a Poisson distribution (Charlier, Bortkiewicz, & Greenwood, 1947). In our case, the probability of a gene being expressed is low compared with the number of genes.

Once the count data were processed by either DESeq or edgeR, a list of differentially expressed genes were used for pathway analysis. Of note, Cufflinks was not utilized for this analysis as it requires normalization techniques which have been shown to not be as effective at controlling variance across replicates when compared to DESeq, edgeR, and

Trimmed mean of M-values normalization (TMM) (Dillies et al., 2012) (Mark D.

Robinson & Oshlack, 2010). With Cufflinks, we would have had to transform the data into a log normal distribution to determine which genes are differentially expressed

(Trapnell et al., 2012). This can lead to quantification issues due to the large number of

22 not expressed genes. Fitting data to a normal distribution does not work well with a dataset that has zero inflation (Soneson et al., 2016).

The different pipelines are compared to determine which genes are differentially expressed for a given condition, Cre+ and Cre- in our case. This was done using DESeq for Pipelines A and B, while Pipeline C used edgeR. This protocol was performed in order to determine which would be a better fit for the lab and more specifically, these two datasets. The work to run these different pipelines was performed using the O2 computing cluster at Harvard Medical School (DANUSER, 2010). Code and data for each pipeline can be found on Github: https://github.com/alosdiallo/PD1_Model_Project_code, as well as the Harvard

Dataverse: https://doi.org/10.7910/DVN/5KVIYV.

Once the datasets are processed through the three different pipelines, benchmarking is performed on all pipelines in two ways. First, the top 100 genes were compared in order to determine how similar the different pipelines are. This was done to determine how much overlap exists. Then, gene expression is compared between the different methods using a known set of genes with known expected differential expression patterns when PD-1 is deleted. These two analyses provide an objective measure for comparison.

The top 100 genes are also compared visually using a heatmap which employs hierarchical clustering to arrange the rows and columns based on the z-score. The hierarchical clustering algorithm allows for the row positions to remain the same even if the data is clustered again. This is because the hierarchical clustering algorithm is not stochastic. This allows these findings to be consistent and reproducible.

23

Pathway Analysis

Pathway analysis was carried out on the results from pipeline A to try to understand how gene expression changes affect the network of interactions. The results of pathway analyses provide a clearer picture of the PD-1 tumor micro-environment that encompasses associated pathways, biological processes, and regulation. This was also carried out to provide a higher fidelity statistical model along with network maps to incorporate into the model. Four pathway analysis database tools were used for this portion of the project: The Gene Set Enrichment and Analysis (GSEA) version 4.0.2

(Subramanian et al., 2005), Protein Analysis Through Evolutionary Relationships

(Panther) (Mi, Muruganujan, Casagrande, & Thomas, 2013) (Clark et al., 2003), Enrichr

(Kuleshov et al., 2016) (Chen et al., 2013), and Ingenuity Pathway Analysis (IPA) by

Qiagen (Yu, Gu, & Yi, 2016). These were chosen to try to obtain the greatest amount of information using the least number of tools. There are hundreds of tools out there which could have been used but the tools chosen had to have data for mice, be current, have a good breadth of data (database wise), and be likely to exist in the next 5 years. While all of the tools ask similar types of questions, they do not all contain the same data subsets and they test for significance differently (Khatri, Sirota, & Butte, 2012). Each tool uses a type of rank test when determining significance. GSEA uses a modified Kolmogorov-

Smirnov test, Panther uses the Mann-Whitney U test, Enrichr uses a combination Fisher’s exact test and Benjamini-Hochberg correction, whereas IPA only uses a Fisher’s exact test (Huang, Sherman, & Lempicki, 2008).

GSEA is a tool used to determine statistically significant genes for specific biological function, location, and/or regulation (Subramanian et al., 2005).

24

When using GSEA, the Molecular Signatures Database (MSigDB) was utilized, focusing in on the Hallmark, C2, C5, and C7 databases (Liberzon et al., 2011). These databases were chosen because of their breath of content. GSEA provides a list of genes, along with a way of determining significance for a given biological term also known as a Geo term.

Total gene counts for all conditions are input into the tool. The naming convention for

GSEA is slightly different from what is used by most gene ontology tools. The GSEA tool begins the gene name sets with the database name or with “GS”, which is taken from the NCBI Gene Expression Omnibus using the GEO series accession number

(Subramanian et al., 2005). The MSigDB is a collection of predefined gene sets, which share pathways, functions, chromosome location, cellular location as well as other properties (Liberzon et al., 2011). The Hallmark Database was computationally generated and contains gene sets which represent specific biological processes or states (Liberzon et al., 2015). It is designed as the starting point for any use of MSigDB (Liberzon et al.,

2011). The C2 database was derived from a curation of online pathway databases and biomedical literature (Liberzon et al., 2011). The database is made up of two sets: the

Chemical and Genetic Perturbation (CGP) dataset, and the Canonical Pathways (CP) dataset (Liberzon et al., 2011). The C5 database is based on Gene Ontology (GO) terms and therefore, all of the sets have a corresponding GO term associated with them (Powell,

2014). These terms are broken up into the categories: molecular function (MF), cellular component (CC) or biological process (BP) (Powell, 2014). Finally, the C7 database is immunology focused (Liberzon et al., 2015). The C7 database was initially derived from microarray datasets which were deposited to the Gene Expression Omnibus (Liberzon et al., 2011) (Barrett et al., 2012). In conducting the pathway analyses for this project,

25 phenotype data was retrieved from the Mouse Genome Database (MGD) housed at the

Jackson Laboratory (Davisson MT, 1999).

The Panther web tool provides access to databases containing Gene Ontology information for given organisms much like GSEA (Mi et al., 2013). This was chosen due to its modern datasets, making up the database and its ease of use. Panther takes a list of genes as input, in this case, the list of genes chosen were differentially expressed genes that were also considered significant (with a p-value<0.05). These genes were fed into the

Panther tool for analysis which provides as output, a list of terms. Enrichr was chosen because it provides an interface with hundreds of pathway analysis tools (Chen et al.,

2013) (Kuleshov et al., 2016). This tool uses a Fisher’s exact test that is corrected using false discovery rate (FDR) (Khatri et al., 2012). Like Panther, Enrichr also takes a gene list (the same list was used in this case). Enrichr has the ability to pull in data from external sources such as Kegg and Reactome as well the ability to export the results as text or an image (Kanehisa & Goto, 2000) (Jassal et al., 2020).

Finally, IPA was use it is a commercially available software program with a repository of manually curated datasets and visualization tools. The main benefit to using

IPA is the ability to generate networks which include the genes in your list along with those in the database. IPA takes as input, a list of differentially expressed genes. The networks in IPA provide insight into how your genes of interest interact with pathways, such as gene regulatory networks. In addition to networks, IPA contains GO databases and information of already published interactions. For this work, the IPA core analysis, downstream regulator search, and upstream regulator search tools were used.

26

The Statistical Model

Most machine learning methods are employed to address one of two types of problems: regression and classification. Regression is any task where the output is a function f: n (Ian Goodfellow, 2016). Typically, regression is used to uncover trends in dataℝ →andℝ make predictions about future events. Linear Regression is a special case of regression where you take a vector of inputs: ^ = ( 1, 2, ... , p), where

and predict an output generated by a model: 𝑋𝑋 𝑇𝑇 𝑋𝑋 𝑋𝑋 𝑋𝑋 𝑋𝑋 ∈

ℝ 𝑌𝑌 = 0 + =1 j j (1) 𝑃𝑃 � � ∑𝑗𝑗 � (Hastie et al., 2009). A classification𝑌𝑌 is 𝐵𝐵a type of model𝑋𝑋 𝐵𝐵 that discriminates between multiple categories or classes. Classification problems are specific tasks where the computer is asked to specify which category an input is from (Ian Goodfellow, 2016).

This is accomplished by producing a𝑘𝑘 function : n {1, … } (Ian Goodfellow, 2016).

When = ( ), the model will assign an input𝑓𝑓 describedℝ → by a𝑘𝑘 vector to a class ( ) (Ian

Goodfellow,𝑦𝑦 𝑓𝑓 𝑥𝑥 2016). An example of a classification problem is the interpretation𝑥𝑥 of𝑦𝑦 handwritten digits. The classification model must process an image of the written digit and correctly classify it as a number from 0-9. Another example of a classification model is one used to determine whether a cat or a dog is shown in a given image.

Classification models can be well suited to RNA-seq analysis and gene expression studies because of our interest in classifying sets of genes. Due to the amount of data, and the need for flexibility in the number of predicted outcomes, a neural network was chosen as the type of classification model to use. One benefit in using neural networks is that they have been shown to produce more accurate results when the amount of data is large

(millions of comparisons) (Ian Goodfellow, 2016). They are also flexible in that they do

27 not rely on binary outcomes which is usually used in other more traditional machine learning algorithms (Ian Goodfellow, 2016). A neural network acts like an artificial brain that learns how to discriminate between classes much like the way a person learns. As data are processed through the model, connections between nodes in the network change to better predict the desired outcome. For many neural networks, we assume that the network is expected to know what the correct outcome is so that it can be properly trained. A neural network model is constructed by taking the output from DEseq, and the result from the different pathway analysis tools. This model is constructed to (1) be able to make predictions for future data developed by the lab, and (2) provide a more reliable source for downstream targets. In this analysis, we want to understand how the knock-out

(ko) of PD-1 affects the gene expression pattern when compared with wild type (wt) mice. This allows us to include other sets of genes which may or may not be of interest, including genes, and genes that are not expressed (very low or close to zero counts).

This model is tuned using a “training, test, validation set approach” (Hastie et al.,

2009). A portion of each dataset (Tamox Δ PD-1 and CD8 Δ PD-1) is partitioned into the training set, test set, and validation set. With the training set, the model is trained to determine which parameters work best. This model is then tested on the test set. Finally, the validation set is used to determine the performance of the best model. The training set is usually tuned using either cross validation or bootstrapping. Cross validation is a method used to tune a statistical model where data are split up into different sections, training on one section and testing on the remaining. A common type of cross validation is k-fold cross validation, where a dataset is divided into k equal parts or folds. A model

28 is tuned on one part and then that model is tested on the remaining data. This is repeated until the data has tuned on each fold. On the other hand, the bootstrap method relies on the generation of pseudo datasets. Bootstrapping works in the following way: for a given dataset of size , random samples are drawn from that dataset with replacement

(numbering𝑋𝑋 in size𝑙𝑙 𝐾𝐾). This means that the probability of seeing any particular observation is the same𝑙𝑙 , every time an observation is added to the pseudo dataset. This constitutes a bootstrap sample population. For example, if is made up of 100 observations, the new dataset created will also be made up 𝑋𝑋of 100 observations where each data point could be a repeated observation from the real dataset. This is then repeated many times, generating a lot of data that can be used to tune the model.

RNA-sequencing data for each of the mice was processed using pipeline A (Star

Aligner) (see Figure 1). The results were randomly assigned to one of 4 groups: A training set, a test set, a validation set and finally the rest of the data. The training data was generated by taking the genes which were differentially expressed (adjusted p-value

<0.05), genes which were shown to regulate PD-1 (found by pathway analysis) and genes from literature searches. We also included genes which are known to not be involved in

PD-1 expression, cell cycle genes, and genes which were lowly expressed. This list is comprised of roughly 1,500 genes. This approach limits the possibility of overfitting.

This data are fed into a deep convolutional neural networks (CNNs) (Ian

Goodfellow, 2016). CNNs are neural networks which contain grid-like topology that employ a mathematical operation called convolution. This is a special type of linear operation on two functions ( ( ) and ( )) that produces a third function that shows how two functions are related,𝑓𝑓 𝑥𝑥for example𝑔𝑔 𝑥𝑥 ( )( ) (Ian Goodfellow, 2016). A

𝑓𝑓 ∗ 𝑔𝑔 𝑥𝑥 29 simplified case takes an input ( ) whereas a two-dimensional convolution takes the form of: 𝑥𝑥 𝑡𝑡

( , ) = ( )( , ) = ( , ) ( , ) (2)

𝑚𝑚 𝑛𝑛 (Ian Goodfellow, 𝑆𝑆2016)𝑖𝑖 𝑗𝑗 . Data𝐼𝐼 ∗ are𝐾𝐾 randomly𝐼𝐼 𝑗𝑗 Σ sampledΣ 𝐼𝐼 𝑚𝑚 𝑛𝑛from𝐾𝐾 𝑖𝑖 th−e 𝑚𝑚training𝑗𝑗 − 𝑛𝑛 and test sets (like a bootstrap sample) (Efron & Tibshirani, 1993). These results are then tested using cross validation, which was used to tune the model. To carry this out, the sampled data are assigned to pseudo training and test sets. The process is repeated to generate 500 pseudo datasets for the model to learn from, which helps to produce a more accurate picture of the error rate. These pseudo datasets are considered bootstrap samples (Efron &

Tibshirani, 1993). The cross entropy error equation is used as a proxy of model performance by the formula:

- ˆ log ˆ (3) 𝑘𝑘 𝑘𝑘=1 (Hastie et al., 2009). In this equation,∑ 𝑝𝑝 is𝑚𝑚 the𝑚𝑚 probability𝑝𝑝 𝑚𝑚𝑚𝑚 for a given class, is a given node or case node in the network, and𝑝𝑝 is the distinct states of the network𝑚𝑚 (Hastie et al.,

2009). Neural networks can use a process𝑘𝑘 called gradient descent as a way to help internally tune to the model (Hastie et al., 2009). This process requires the setting of what is termed the “learning rate”. This parameter tells the program how small of a change in the model to make. In this analysis, a learning rate of 0.0001 was determined by using cross validation on the training set (Stefan Fritsch 2019) (Chollet, 2015). All networks are made of nodes (transformation function) and edges (bias, weight) (Ian Goodfellow,

2016). Nodes and edges can take on values or act like “gates” letting information through or not. These “gates” are a combination of the weight and the transformation function. A weight function is essentially a coefficient estimate that helps calculate the regression in a

30 particular node to help tell the arrow where to point. In part, it helps determine how strong the connection with another neuron should be. An example of this can be seen in

Figure 3, where the arrow points out of the neuron to a new space. The transformation function is often the logistic function, refer to Figure 3 (Ian Goodfellow, 2016). From this figure, the transformation function can be seen where denotes the intercept and and

𝑜𝑜 are vectors that contain the weights and input values.𝑏𝑏 Once the optimal model has𝜔𝜔 been obtained,𝜒𝜒 the model is then tested on the validation set which is made up of a subset of genes not previously seen by the model.

Figure 3. Artificial neuron

An example of an artificial neuron for a convolutional neural network generated for this project in Microsoft Power Point with ideas from Goodfellow el al. (Ian Goodfellow, 2016).

In summary, information is fed into the network which can be a pixel, a piece of information, or in this case, a gene expression value. If the value is larger than some 31 threshold (0.001, value commonly used for this type of analysis), then a connection between nodes is made (Stefan Fritsch 2019) (Chollet, 2015). Through the process of forming these connections, the program “learns” how to discriminate between classes

(Cre+ and Cre-), the same way it learns to tell the difference between the image of a cat or a dog when given images of many cats and dogs. The program associates specific features with dogs and other features with cats. All of this is managed by an algorithm. In this analysis, the algorithm used to train the neural network is resilient backpropagation with weight backtracking (rprop+) (Martin Riedmiller, 1993). Backpropagation uses the chain rule by calculating the derivatives (gradients) of the error value with respect to the weight for the last layer (Ian Goodfellow, 2016). These gradients are then used to calculate the weights of the second to last layer (Ian Goodfellow, 2016). This process of calculating the weights is repeated until the weights for all layers have been calculated

(Ian Goodfellow, 2016). The gradient is then subtracted from the weight to reduce overall error (Ian Goodfellow, 2016). Rprop+ is a type of backpropagation method that is used to update the weights on the network, and the parameters on the connections between nodes in the network. Rprop+ algorithm has proved to be faster and more accurate for text such as sequencing data (Martin Riedmiller, 1993). Our neural network is comprised of 10 layers with each layer comprising of 20 nodes, determined and tuned using cross validation (Figure 4). All of the code used to build the statistical model is written in R

(Team, 2014). Visualization of the data is handled by the ggplot2 r package (Wickham,

2016). The neuralnet, tensorflow, and Keras R packages are used to construct the network

(Stefan Fritsch 2019) (Mart et al., 2016) (Chollet, 2015).

32

Figure 4. The neural network architecture

This is the network schematic of the neural network used in this model generated using the neuralnet package. The inputs are the different columns in the datasets and the output is whether the gene is more important to Positive (Cre+) or Negative (Cre-).

The neural network model was then tested against several other traditional machine learning methods to get a better sense of its performance. This was done using the validation dataset. This comparison is something that is normally conducted for an analysis like this, to ensure that our model performs at least as well as common approaches. The particular methods chosen for the comparison represent a broad distribution of current state of the art methods including: Logistic Regression (LR), K-

Nearest Neighbors (KNN), Naïve Bayes (NB), Linear Discriminant Analysis (LDA), and

Quadradic Discriminant Analysis (QDA). Logistic regression is very closely related to a linear model whereas the remaining methods are all Bayesian. LR was developed to be able to model the posterior probabilities of classes by the linear function . In this case,

𝑘𝑘 𝑥𝑥

33

0 the outcome for is modeled as = 1 + resulting in the general form of LR 𝑖𝑖𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶 − 𝑌𝑌 𝑌𝑌 � shown in 4.1. 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖

( ) = (4.1) 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋 𝑒𝑒 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋 𝑝𝑝 𝑥𝑥 ( ) 1+ 𝑒𝑒 ( ( )) = + (4.2) 𝑝𝑝 𝑥𝑥 𝐿𝐿𝐿𝐿𝐿𝐿 1−𝑝𝑝 𝑥𝑥 𝛽𝛽0 𝛽𝛽1𝑋𝑋 (Hastie et al., 2009). Applying the logistic function to 4.1 results in the log-odds form shown in 4.2. The log-odds or logit form provides the odds of an observation as a member of class . This was done using the R function glm in the stats package (Team,

2014). 𝑘𝑘

The remaining methods are Bayesian methods which calculate the probability of an event based on prior knowledge. The general case can be modeled using the following equation:

( | ) ( | ) ( ) P = ( ) (5.1) 𝑃𝑃 𝐵𝐵 𝐴𝐴 𝑃𝑃 𝐴𝐴 𝐴𝐴 𝐵𝐵 𝑃𝑃 𝐵𝐵 ( ) ( = | = ) = ( ) (5.2) 𝜋𝜋𝑘𝑘𝑓𝑓𝑘𝑘 𝑥𝑥 𝐾𝐾 𝑃𝑃 𝑌𝑌 𝑘𝑘 𝑋𝑋 𝑥𝑥 ∑𝑖𝑖=1 𝜋𝜋𝑖𝑖𝑓𝑓𝑖𝑖 𝑥𝑥 (Hastie et al., 2009). In 5.1, we see the general form of Bayes theorem, however, the more useful definition is the form in 5.2. ( = | = ) is the probability density function for observations from the th class.𝑃𝑃 𝑌𝑌While𝑘𝑘 𝑋𝑋 represents𝑥𝑥 the posterior probability

𝑘𝑘 that an observation comes from the𝑘𝑘 class, is the response𝜋𝜋 variable and can take on the

th values of Cre+ and Cre-. ( ) is large when𝑌𝑌 the probability that an observation in the

𝑓𝑓𝑘𝑘 𝑥𝑥 class , whereas ( ) is small (meaning we misclassify the observation), when the𝑘𝑘

𝑘𝑘 probability𝑋𝑋 ≈ 𝑥𝑥 is low. Essentially𝑓𝑓 𝑥𝑥 , when we properly predict the class that the observation belongs to, the value of ( ) is large, and when we do not properly predict the class that

𝑘𝑘 the observation belongs 𝑓𝑓to, the𝑥𝑥 value of ( ) is small.

𝑓𝑓𝑘𝑘 𝑥𝑥 34

KNN is a special case of Bayes theorem. The number of neighbors to use are chosen based on, picking a value of and a test observation , the algorithm𝐾𝐾 picks the

0 points closest to , represented by 𝐾𝐾 . The algorithm then estimates𝑥𝑥 the conditional 𝑘𝑘

0 0 probability for class𝑥𝑥 as a fraction of𝑁𝑁 points whose outcome is . This is given by the formula: 𝑗𝑗 𝐽𝐽

( = | = ) = ( = ) (6) 1 0 0 𝐾𝐾 ∑𝑖𝑖∈𝑁𝑁 𝑖𝑖 (Hastie et al., 2009). For 𝑃𝑃this𝑌𝑌 analysis𝐽𝐽 𝑋𝑋 ,𝑥𝑥 values of K are𝐼𝐼 𝑦𝑦 10, 𝑗𝑗20, 150, and 300. This was determined using cross validation and coded using the R function knn in the package class (Ripley, 2002a).

Unlike KNN, with Naïve Bayes you cannot specify the sample size. NB is particularly applicable when the feature space is large, because in those cases, density estimation is impractical. NB makes some assumptions: for a given class = , the features are independent (Hastie et al., 2009). Usually, the features are𝐺𝐺 not actually𝑗𝑗

𝑘𝑘 independent𝑋𝑋 . However, we can assume that the features are independent so it makes it easier to derive. In this case, NB takes the form:

(X) = ( ) (7.1) 𝑝𝑝 𝑗𝑗 𝑘𝑘=1 𝑗𝑗𝑗𝑗 𝑘𝑘 (Hastie et al., 2009). This has some𝑓𝑓 other advantages∏ 𝑓𝑓 𝑋𝑋 as well. For example, the class marginal densities can be estimated independently using one dimensional kernel

𝑗𝑗𝑗𝑗 density estimates. This𝑓𝑓 means that if you have many you can parallelize the computation of them which saves a lot of time. You can𝛽𝛽′𝑠𝑠 also use a univariate Gaussian when representing the margins, which is also efficient. Applying the same logit transformation as LR, you can obtain NB in the form of:

+ ( ) (7.2) 𝑝𝑝 𝛼𝛼𝑙𝑙 ∑𝑘𝑘=1 𝑔𝑔𝑙𝑙𝑙𝑙 𝑋𝑋𝑘𝑘 35

(Hastie et al., 2009) The benefit to this form is you end up with the log odds estimation.

This makes it far easier to manage and interpret as this is a more simplified probability.

LDA and QDA are both very similar methods but perform better under different situations. LDA works best when the decision boundary is linear, this is because LDA uses a Gaussian density function. Here, we assume that the different classes have a common covariance matrix, = (Hastie et al., 2009). In this case, we have two

𝑘𝑘 classes: Cre+ denoted by andΣ CreΣ∀-𝑘𝑘 denoted by . We can use the log-odds to show how the classes compare: 𝑘𝑘 𝑙𝑙

( | ) ( + ) ( ) ( ) log ( | ) = log - + (8.1) Pr 𝐺𝐺=𝑘𝑘 𝑋𝑋=𝑥𝑥 𝜋𝜋𝑘𝑘 1 𝑇𝑇 −1 𝑇𝑇 −1 Pr 𝐺𝐺=𝑙𝑙 𝑋𝑋=𝑥𝑥 𝜋𝜋𝑙𝑙 2 𝜇𝜇𝑘𝑘 𝜇𝜇𝑙𝑙 Σ 𝜇𝜇𝑘𝑘 − 𝜇𝜇𝑙𝑙 𝑥𝑥 Σ 𝜇𝜇𝑘𝑘 − 𝜇𝜇𝑙𝑙 ( ) = - + log (8.2) 𝑇𝑇 −1 1 𝑇𝑇 −1 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 (Hastie et al., 2009). This𝛿𝛿 results𝑥𝑥 𝑥𝑥 inΣ an𝜇𝜇 equation2 𝜇𝜇 Σ which𝜇𝜇 is 𝜋𝜋linear in x. More importantly, this indicates that the decision boundary between (Cre+) and (Cre-) as denoted by

( = | = ) = ( = | = ) is linear in𝑘𝑘 , in dimensions.𝑙𝑙 QDA takes a form𝑃𝑃𝑃𝑃 𝐺𝐺 (9) very𝑘𝑘 𝑋𝑋 similar𝑥𝑥 𝑃𝑃to𝑃𝑃 (8.2𝐺𝐺 ) except𝑙𝑙 𝑋𝑋 that𝑥𝑥 the decision𝑥𝑥 boundary𝑝𝑝 is not expected to be linear. If we assume that are not the same then the cancellations that we get in 8.1 do

𝑘𝑘 not happen leaving the quadradicΣ components in x from here we can derive QDA:

( ) = - | | - ( + ) ( ) + log (9) 1 1 𝑇𝑇 −1 𝑘𝑘 2 𝑘𝑘 2 𝑙𝑙 𝑙𝑙 𝑘𝑘 (Hastie et al., 2009)𝛿𝛿 . 𝑥𝑥The decisionΣ boundary𝑥𝑥 𝜇𝜇 Σ between𝑥𝑥 − 𝜇𝜇 the two𝜋𝜋 classes (Cre+) and (Cre-) can be described by the quadratic equation { : ( ) = ( )} (Hastie𝑘𝑘 et al., 2009)𝑙𝑙. This

𝑘𝑘 𝑙𝑙 should allow for the addition of more classes𝑥𝑥 in𝛿𝛿 the𝑥𝑥 future𝛿𝛿 without𝑥𝑥 a loss in productivity.

LDA and QDA are run using the lda and qda functions from the MASS R package,

(Ripley, 2002b). These methods taken in conjunction with our model provide a broad benchmark for machine learning algorithms applied to this dataset.

36

Chapter III.

Results

Preliminary analyses were carried out on both the Tamox Δ PD-1 and CD8 Δ PD-

1 datasets. This helps to determine what types of statistical tools can be used to answer the questions asked. These analyses also provide a clearer picture of the data as a whole.

The average fastq file size (raw files from sequencing facility) for CD8 Δ PD-1 is around

1.8GB whereas Tamox Δ PD-1 is roughly 500MB, which is approximately 3.6 times smaller compared to CD8 Δ PD-1. Normally, it would be expected that if all things were equal, then the files would be similar in size. This difference is also evident in the median number of expressed reads where CD8 Δ PD-1 has 11,289,353 reads and Tamox Δ PD-1 has 3,470,463 reads. A breakdown of the average number of reads along with summary statistics for each file can be found in Supplementary Table 1 (Tamox Δ PD-1) and

Supplementary Table 2 (CD8 Δ PD-1). This indicates that it might be difficult to compare data for the two datasets. This is because we want to be able to determine if the inducible deletion of PD-1 (Tamox Δ PD-1) makes an actual difference in terms of gene expression when compared with PD-1 that is permanently deleted on CD8 cells. These differences in number of reads and file size may be due to technical artifacts, such as insufficient input

RNA, leading to too few reads found in the inducible deletion dataset, rather than a difference in biology. These differences make it hard to compare the two datasets.

Next, we visually examined the two subsets for Tamox Δ PD-1 and CD8 Δ PD-1.

The purpose of this is to get a sense for how comparable the two experiments are. In one

37 case, PD-1 is permanently deleted on CD8 cells (CD8 Δ PD-1) whereas in the other experiment, PD-1 is deleted from 7-11 days (Tamox Δ PD-1). In order to determine what effects this might have it is important to know whether the data are similar enough to make a fair comparison. There are a few simple analyses that were conducted to determine this. The first thing we did was to look at boxplots of normalized expression values for the two experiments (Figure 2). Figure 2A and 2B show boxplots for

Cre+/Cre- subsets for the two experiments. They indicate that the distributions are comparable for the different subsets (Figures 2A and 2B).To make it easier to visualize, the data were transformed to Log normal (Figures 5 and Supplemental Figure 1). If you look at the data without log normalization, such as in the histograms found in Figure 5, the data look very skewed. This has the additional effect of making it difficult to discern any noticeable differences in the distributions. In both cases, the boxplots show that the

Cre+ subset has a higher average expression (Figures 2A and 2B). In both the Tamox Δ

PD-1 and CD8 Δ PD-1 datasets, the means of the samples after being transformed into log normal distributions are between 2 and 6 reads (Figures 2A and 2B). In addition, the

CD8 Δ PD-1 dataset has an overall higher average expression for both subsets Cre+ (6 reads) and Cre- (5.4 reads) (Figures 2A and 2B). For Tamox Δ PD-1, however, Cre+ has a mean value of roughly 3.5 reads and Cre- has a mean of 3.8 reads indicating that they are comparable (Figures 2A and 2B).

38

Figure 5. Histograms of expression values for both Tamox Δ PD-1 and CD8 Δ PD-1

A: Histogram of CD8 Δ PD-1 data using 500 bins. B: Histogram of Tamox Δ PD-1 data using 500 bins.

Then, principal component analysis was used to determine if the spread of variance in the dataset is homogeneous across different dimensions. If the data is homogenous, this would make the data difficult to model. In an ideal case, the first few principal (first 3) components would contain 60% or more of the total variance in the dataset. In fact, many classical methods such as clustering and logistic regression, work best when the variance is dominant in the first few principal components. If the first few principal components do not account for the majority of the variance, then it would likely point to the predictors being independent making it difficult, if not impossible to model.

Figures 2C and 2D indicate that a large amount (>60%) of the variance for both datasets is attributed to principal component one. This indicates that we will likely be able to fit a model which can describe the data. If the data were too homogeneous it would be much more difficult to fit a model to. Concentrated variance in the first few principal

39 components as shown in Figures 2C and 2D are something I would expect unless there was something wrong with the data. A visual inspection of the untransformed histograms in Figure 5 shows that the data are skewed towards the right (most values are on the left with larger values on the right) in both datasets (Tamox Δ PD-1 and CD8 Δ PD-1). These results are consistent with other gene expression datasets for mice and humans (Trapnell et al., 2012) (Park, Yoon, Kim, & Kim, 2019).

Figure 6. t-SNE plot for CD8 Δ PD-1

The plot shows the first two t-SNE dimensions for the CD8 Δ PD1 experiment. Genes are colored based on read count. For example, genes with more than 10,000 reads are colored orange.

Next, we wanted to determine whether there were any obvious patterns that could be seen in the data which may be non-linear. T-SNE was used for this because although

PCA is great at describing areas of large variance, t-SNE is an excellent tool for

40 examining closely related points and/or non-linear patterns. The t-SNE results indicate that there are no obvious clusters which separate out based on gene expression values

(see Figure 6). Looking at Figure 6 shows several protrusions from the main cluster body of reads, but nothing that separates completely from the main cluster. We tried to color the data based on the gene expression level. The results however do not indicate any real separation or clustering by expression. This indicates that the lack of distinguishable clusters is evident even when the data is binned by count value. In our data this implies that a simple statistical model like the use of a logistic function would have a difficult time classifying the data.

RNA Sequencing analysis

The results of the FastQC Program indicate that there are issues with the datasets

(CD8 Δ PD-1 and Tamox Δ PD-1). The issues found for CD8 Δ PD-1 are with kmer content, overrepresented sequences, sequence duplication levels, per sequence GC content, and per base sequence content. These results were also mirrored in the Tamox Δ

PD-1 dataset. Many of these tests failed or threw warnings. This is indicated in the results file provided by FastQC. For each test, there is an indicated result of pass/fail/warning.

An example of a test that failed, can be seen in Figure 7. Figure 7A shows the per base sequence content, in this case, it shows that we have bias at a given base position indicating a biased library or contamination. Ideally, we would expect to see all of the bases (represented by colored lines) with a straight line without the peaks or fluctuations in the line. According to Andrews et al., this is only a significant problem (where you would have to re-do the experiment) if the peak goes up to 80 (Andrews). The peaks in this dataset reach a height of 50. In Figure 7B, the per sequence GC content indicates that 41 there is likely contamination in the library. This is because the two lines in the plot are separated from each other at the top. In an ideal example, the lines would be almost indistinguishable from one another with no visible separation. Figure 7C shows the sequence duplication levels, which determines how unique the sequences in the library are (Andrews). In these experiments, there are visible peaks (Figure 7C) whereas in an ideal case, these should be flat lines (Andrews). In this sequence, the results show that roughly 12% of the library comes from duplicated sequences. Observe the peaks (blue on top of red) at “>10%” in the plot indicates the percentage of sequence duplicates. Figure

7D-F again show sequence duplication, indicating which sequences are overrepresented.

Figure 7D shows bias for sequences at specific locations (Andrews), in this case, from positions 33 to 43. All of these results point to issues with either the library prep, the sequencing, or contamination (Figure 7). The base quality, however, is high and it is therefore still prudent to continue with the analysis. This is because base quality is the single most important measure for determining how trustworthy the results are from the sequencer. Examination of the results indicate that all of the samples for both CD8 Δ PD-

1 and Tamox Δ PD-1 experiments have high quality reads. An example from Tamox Δ

PD-1 can be seen in Figure 8 where the blue lines and bars are entirely in the green section of the plot. This is indicative of good quality reads. There are no yellow bars which protrude to the orange or red sections, which would indicate poor quality reads.

42

Figure 7. QC failures from FastQC report for AC1449 from the CD8 Δ PD-1 dataset

This is an example which is representative of the failures found during the QC analysis. These results come from sample AC1449. A: Per base sequence content. B: Per sequence GC content. C: Sequence duplication levels. D: Kmer content. E: Kmer content table results. F: Overrepresented sequences.

43

Figure 8. FastQC results from sample TILs 9 PD-1 from Tamox Δ PD-1

Representation of FastQC results for the experiment from sample TILs 9 PD-1.. These files are QC files for the fastq results. The colors indicate the quality at a given position. Green indicates good, yellow is poor, and red is bad. Notice, none of the fastq files contained reads which were in the red section.

Once the read files are examined for quality using FastQC, three pipelines were compared using various benchmarks. This process provides a mechanism to develop and configure the optimal RNA-sequencing pipeline. The results of running each pipeline shows that there is not a complete overlap between the methods (Figure 9). Experiment results analyzed with the different pipelines would not result in the exact same findings, since each of the pipelines employ slightly different algorithms (see Materials and

Methods), this result is to be expected. Pipeline B had the lowest overlap with each of the other methods (see Figure 9), which makes sense as it is the only method that maps to the 44 transcriptome as opposed to the genome. Pipeline A and Pipeline C show the most overlap when benchmarked, with 52 out of 100 genes being shared for the Tamox Δ PD-

1 dataset and 82 out of 100 genes shared for the CD8 Δ PD-1 dataset (see Figure 9). This discrepancy between the two experiments could be caused by issues with the inducible dataset (Tamox Δ PD-1). Tamoxifen was administered over a 5-day period. This resulted in roughly 50% of the cells having the deletion of PD-1. This means that there are cells taken from PD-1 mice which still contain PD-1. To make it more straightforward to analyze, the data from these samples were removed from the analysis. The samples removed still contain PD-1 after the deletion. Another form of benchmarking was done by taking a known list of genes, in this case genes with known expression patterns for use as the Gold standard (indicator genes) (see Table 3). This allows us to then compare the three pipelines against the known results (Table 3). The results of these benchmark tests indicate that Pipeline A matches their expectations the best, and what is previously known by the lab (refer to Tables 4 and 5).

Table 3. Genes provided by the Sharpe lab used for quality control (gold standard)

Gene PD1- PD1+ Cd160 DOWN UP Lag3 DOWN UP Tigit DOWN UP Tcf7 UP DOWN Havcr2 DOWN UP Grzmb UP UP Benchmark genes used to determine the performance of the different pipelines. These genes had known gene expression patterns when PD-1 is present versus when PD-1 is deleted.

45

Figure 9. Overlap of top 100 Genes for both datasets

Venn Diagrams of overlap for the different pipelines used. A: The overlap of the top 100 genes for the Tamox Δ PD-1 dataset. B: The overlap of the top 100 genes for the CD8 Δ PD-1 dataset.

With the output from DESeq, it is possible to show how different groups in the

Tamox Δ PD-1 experiment compare by plotting this output using PCA. In Figure 10, it is hard to see a difference between the Cre+ and Cre- samples (mice). If you break the data up, as seen in Figure 11, it is then possible to gain a better understanding of how the samples compare with one another. For example, by looking at comparisons such as those shown in Figures 11A, B and C. This allows us to see all the different combinations of data we have for the Tamox Δ PD-1 dataset. We can then focus on samples which were Cre+ PD-1- vs Cre- PD-1+ (Figure 11A). Overall, Figure 11A and Figure 12B show good separation between PD-1+ and PD-1- mice for both Tamox Δ PD-1 and CD8 Δ PD-

1 datasets.

46

Figure 10. Principal Component analysis of Tamox Δ PD-1

Plot of the first two principal components with all groups remaining. These are data from the Tamox Δ PD-1 experiment. The blue are Cre+ whereas red are Cre-.

47

Figure 11. Principal Component analysis of Tamox Δ PD-1

A: is a plot of samples of Cre+PD1- vs Cre-PD1+. B: is a plot of samples of Cre+PD1- vs Cre+PD1+. C: is a plot of samples of Cre-PD1- vs Cre-PD1+. The blue are Cre+ whereas the red are Cre-.

Figure 12. Principal Component analysis of CD8 Δ PD-1 Dataset

A: Plot of the first two principal components with all groups remaining. B: The data after misclassified groups were removed. The blue are Cre+ and the red are Cre-.

48

There are likely potential issues that can explain the clustering we see in the PCA plots. One reason why replicates (Figure 10) might be misclustered is the inability to

monitor the deletion in some of the samples. Specifically, for the samples where

Tamoxifen (Tamox Δ PD-1) was added for 5-days with the intent to delete PD-1 in about

50% of the cells. This was done in order to compare PD-1 expression in T-cells and PD-1

deleted T-cells within the same tissue. This means that PD-1 could be present even

though the samples were Cre+. For example, sample 1 “TILs 1 PD1+” has a Cre status of

+. Cre+ should indicate that PD-1 was knocked out on CD8 cells, however in this case,

the sample was still PD-1 positive. Figure 11 allows us to see if there is a difference

between experimental conditions (knockout vs wildtype) in the Cre+ samples. We are

also able to examine the differences between the knockout and wildtype (PD-1+)

samples. Once the groups were broken up there are clear boundaries separating the Cre+

from Cre- samples. This makes it possible to see if similar experiments cluster together

and if they are able to separate from other groups, such as Cre+ from Cre-. Figure 11A

depicts Cre+PD1- vs. Cre-PD1+. In this panel, there is clear separation between the

knockout (PD1-) and the wildtype (PD1+). This is the same for Figure 11B and 11C

where the different groups each have clear separation. Figure 12 shows the CD8 Δ PD1

dataset. Figure 12A shows the PCA results where all the groups are included. It is

possible to see that a few of the samples do not cluster with the correct group, such as

mouse AD3354. Figure 12B shows the results of the PCA analysis with the mis-clustered

groups removed. Once this was done, the data becomes easily separated into different

clusters. This was possible because of the number of replicates in this dataset (most

49 analyses are limited to 3 replicates per condition) (Korpelainen, Tuimala, Somervuo,

Huss, & Wong, 2015).

Table 4 and 5 show the median expression for the list of benchmarked genes.

Table 4 specifically shows the results from the Tamox Δ PD-1 experiment. These genes represent a set of known gene expression patterns for which the lab knew what to expect when PD-1 has been knocked out. Looking at the resulting count files (out.sorted.counts files from HTSeq Count), the average size of files from CD8 Δ PD-1 is 1.4MB which is almost three times as large as the count files from Tamox Δ PD-1 (0.5MB). This will likely cause issues for the analysis due to the poor read depth of Tamox Δ PD-1. Poor read depth / sequence depth makes it difficult to properly gauge differential gene expression. Ideally, both experiments would have the same read depth, at a level around

20 million reads per sample (Korpelainen et al., 2015). This is not that case for CD8 Δ

PD-1 (11,289,353 reads) and Tamox Δ PD-1 (3,470,463 reads). The trend in counts overall appear to be similar for the different methods (pipelines). The gene Tigit contains

0 counts for CLC, this is likely due to the way CLC counts are read. Table 5 shows results from CD8 Δ PD-1 which does have counts above zero for the CLC pipeline. The results for these experiments had higher read counts indicated by a larger average read file size. The results from Tables 4 and 5, indicate that STAR (pipeline A) more closely aligns with what the lab would expect given past experiments using CLC (pipeline C).

50

Table 4. Tamox Δ PD-1: Results of differential expression using list of genes provided by Sharpe lab CLC Median value Condition Cd160 Lag3 Tigit Tcf7 Havcr2 Gzmb PD1- 78 16 0 69 57 1028 PD1+ 216 58 0 3 583 3742 STAR Median value Condition Cd160 Lag3 Tigit Tcf7 Havcr2 Gzmb PD1- 70 15 0 238 57 1025 PD1+ 216 54 28 57 538 3711 Salmon Median value Condition Cd160 Lag3 Tigit Tcf7 Havcr2 Gzmb PD1- 76 15.374 0 238 10 1007 PD1+ 159.834 43.368 19.995 56 334 3566 The values are the average gene expression values for PD-1+ vs PD-1- for the Tamox Δ PD-1 dataset. The genes listed are based on a set of known gene expression patterns used to gauge method performance (Gold Standard).

Table 5. CD8 Δ PD-1: Results of differential expression using list of genes provided by Sharpe lab CLC Median value Condition Cd160 Lag3 Tigit Tcf7 Havcr2 Gzmb Cre- 178 40 3 1 395 2398 Cre+ 336 157 17 2 1312 6701 STAR Median value Condition Cd160 Lag3 Tigit Tcf7 Havcr2 Gzmb Cre- 175 42 41 526 373 2358 Cre+ 333 138 144 483 1298 6581 Salmon Median value Condition Cd160 Lag3 Tigit Tcf7 Havcr2 Gzmb Cre- 164.421 31.247 11.974 414 0 2208 Cre+ 275.517 110.832 45.458 370 57.751 6342.999 The values are the median gene expression values for PD-1+ vs PD-1- for CD8 Δ PD-1. These values were compiled from the three pipelines. The genes listed are based on a set of known gene expression patterns used to gauge method performance (Gold Standard).

51

Table 6 shows the results of DEseq for the benchmarked list of genes. DEseq provides information on the average counts per gene, the p-value, the corrected p-value, the log2FoldChange, the uncertainty associated with the fold change (lfcSE), and the

Wald statistic (p-value and adj-pvalue) (Love et al., 2014) (Ignatiadis, Klaus, Zaugg, &

Huber, 2016). The p-value is calculated using the ratio between the fold change and the

Wald statistic (Love et al., 2014). These statistics will be fed into the neural network to aid in training the model which can then be used on the full datasets. For the indicator genes (gold standard), only Tcf7 was still significant when looking at the adjusted p- value. Average counts for each of the genes was high, and therefore it is possible to make comparisons to aid in the benchmarking.

Table 6. DESeq result for indicator genes CD8 Δ PD-1 deletion Tamox Δ PD-1 deletion Avg Avg Avg Avg Gene adj- adj- counts in counts in p-value counts in counts in p-value Name pvalue pvalue Cre+ Cre- Cre+ Cre- Cd160 353.92 159.25 0.24 0.96 194.14 106 0.13 1 Tcf7 430.5 543.92 0 0.04 133.21 210.5 0.01 1 Lag3 154.5 45.67 0.08 0.96 58.71 32.64 0.14 1 Tigit 162.92 45.92 0.36 0.96 20.93 22.36 0.77 1 HavcR2 1498.92 452.92 0.3 0.96 520.14 363.36 0.46 1 Gzmb 6795.08 2726.58 0.98 1 3381.07 1582.64 0 0.31 Table containing the results of running DEseq and extracting the output for the indicator genes used to in the quality control assessment.

Table 7 shows the list of the top differentially expressed genes that are highly expressed in Cre+ and in Cre-. There are more genes highly expressed in Cre+ CD8 Δ

PD-1 which makes sense given that these were from CD8 Δ PD-1 where PD-1 is deleted

52 on all cells. These genes were then used in the pathway analysis to determine the gene expression landscape.

Table 7. Top differentially expressed genes for CD8 Δ PD-1 and Tamox Δ PD-1

A) CD8 Δ PD-1 Genes UP in Cre+ Genes UP in Cre- Cmah Zfp326 Ssh2 Hnrnpab Ppp6r2 Mme Gzmc Rcor3 Slamf7 Gramd1a Map11 Sfmbt2 Gpr55 Hivep1 Nlk mt-Co1 Ccr5 Gm37975 Ifitm1 Actn1 Kmt5b Fam120a Slc16a6 Gm9752 Ccr9 Lef1 Erbin mt-Nd6 Ccl5 AC148016.1 Sell Klf7 mt-Nd2 Papola Il12rb1 Cacna2d4 Gzma Zfp36l2 Txk Ahnak Gzmk Fbxo17 Klhl9 Ly6c2 Satb1 H2-K1 Arntl Sbsn Klf4 mt-Nd5 Ybx3 Trac Adgrg1 Tet1 Tcf7 Il7r Il18r1 Nfatc1 Syt11 Gm5525 Sptbn1 Ncf4 Thg1l Nkg7 Tox Ly6a2 Rpl29 Osbpl3 Rasd2 Pde4d Ppp2r2c Gm13653 Jaml Gm47586 Galm Cd3g Igkc Gm4945 Adrb1 Osgin1 Glp1r Nr4a2 Maf AC124578.3 Tmprss6 AC135016.1 1810053B23Rik Mrgpre 1700017B05Rik Gm16303 Gm5857 Gm12229 Gm45873 E330021D16Rik Prpf18 Rapgef4 Coch Gm6652 Gm48383 Itga4 Cmah B) Tamox Δ PD-1 Genes UP in Cre+ Genes UP in Cre- Mxd1 Mpg Gm20324 Gm38236 Upk1a Zfp758 Tcrg-C4 Klra6 Tle1 Gorab Cybb Tgfbi Slirp Gm44091 Gm11346 Gstm5 Csf1r Atf3 Gm4430 AC133451.2 Gm42722 Prss36 Mesd Ppif Ncmap Gm47643 Trav21-dv12 Gm2238 Sparc Ift27 Zc3h12b Gm24924 Lipo3 Nucb2 Ms4a7 C1qa S1pr5 Auts2 Cxcl11 Gm25265 Galc Apoe Fam209 Tnfrsf26 1810062G17Rik Zfp456 Fcgr4 H2-Aa Gm48570 Gm8251 Cela1 Thumpd2 Sirpa Cd74 Cbx2 Klra13-ps Gm20681 Gm48998 H2-DMb1 H2-Ab1 H2-Eb1 Fyco1 Rpl35 The top differentially expressed genes for Cre+ and Cre- of the CD8 Δ PD-1 and Tamox Δ PD-1 datasets.

53

Figure 13 shows how the top differentially expressed genes differ for the two

experiments. Z-score values are used to generate the heat map which provides a

normalization of the counts to make it easier to compare the experiments. A z-score is a

distance measure which indicates the number of standard deviations away from the mean

a data point is. The rows and columns have also been clustered as well by hierarchical

clustering (which is the default for the heatmap function in R). Figure 13A shows the

results for Cre+ vs Cre- subsets for the CD8 Δ PD-1 dataset. We can see a clear

difference in the two groups, indicating that it should be possible to model the differences

between Cre+ and Cre-. Figure 13B is also showing similar result for Tamox Δ PD-1.

The data in the Tamox Δ PD-1 experiment appear to be more variable than the data in

CD8 Δ PD-1 which is expected based on earlier findings. In this case, we also have

information on instances where the experiment did not knockout PD-1, or when it was

not expected to be knocked out, yet PD-1 was not present. Benchmark tests including comparing the gene expression against the list of known expression patterns, the percent overlap as shown by the Venn diagrams, the heat maps, and the PCA results after running

DESeq were all utilized to make the determination that the results from pipeline A will be used for the remainder of this analysis.

54

Figure 13. Heatmap of the top differentially expressed genes.

A: A heatmap taken from the list of the top differentially expressed genes at a cut-off of 0.05 for CD8 Δ PD-1. Showing the comparison of Cre+ vs Cre-. B: A heatmap taken from the list of the top differentially expressed genes at a cut-off of 0.05 for Tamox Δ PD- 1. Showing results from Cre+PD1+, Cre-PD1-, Cre+PD1+, Cre-PD1+.

Pathway Analysis

Pathway analysis was performed on the two datasets using four different tools,

GSEA, Panther, Enrichr, and IPA. The first pathway analysis tool was the Broad

Institute’s GSEA program. This was chosen in part because the lab has used it and contributed to building it in the past, and it includes immune centric databases. It is also one of the most widely used and respected pathway analysis software programs available today. The GSEA databases used were Hallmark, C2, C5, and C7 (Liberzon et al., 2011)

(Liberzon et al., 2015). This analysis tool was utilized for both the Tamox Δ PD-1 and

55

CD8 Δ PD-1 datasets using the Cre+ and Cre- subsets. The top GO terms from this analysis are displayed in Table 8, which is broken up by database and experiment with the results of Tamox Δ PD-1 on the left and CD8 Δ PD-1 on the right. Examination of the results of the analysis in the Tamox Δ PD-1 dataset with the Hallmark database produced more GS (Gene Symbol) terms enriched for Cre- (20) than for Cre+ with only one term enriched (see Table 8). Whereas, the CD8 Δ PD-1 dataset produced similar levels of GS terms enriched for both Cre- and Cre+. There are some terms which are shared between the two experiments including oxidative phosphorylation, Heme metabolism, MYC targets, and Androgen response (Table 8). Many of these GS terms and the genes associated with them could make viable targets for future analyses. For example, oxidative phosphorylation has been shown to be a target in cancer therapy (Ashton,

McKenna, Kunz-Schughart, & Higgins, 2018). Many other GS terms could be associated

with the growth of tumors. Angiogenesis for example, is the process of forming new

blood vessels and therefore could be involved in tumor formation. There are other GS

terms which are involved in providing blood to cells and given the environment around

the CD8 T-cells, it might be reasonable to target these pathways and/or systems for

further study with the model. The C2 database returned many GS term results involved in

cancer as well as in the immune response. An example of this is the result: YANG

BREAST CANCER ESR1 BULK DN, which is from early primary breast cancer (Yang

et al., 2006). Other results of interest are from datasets for thyroid cancer and liver

cancer. It is possible that the genes which are overexpressed in the datasets could be

useful in a statistical model which studies all . The results of the analysis with the

C5 database however were less specific to cancer and were instead dominated by

56 development GS terms. There were GS terms involved in immune cell regulation, more specifically, Negative regulation such as T-cell differentiation and Lymphangiogenesis.

Lymphangiogenesis is important because of the involvement in tumor growth. The C7 database which was developed by the Haining lab returned distinct results for the Tamox

Δ PD-1 and CD8 Δ PD-1 datasets (see Table 8) (Godec et al., 2016). There were more

GO terms enriched for Cre- in the Tamox Δ PD-1 dataset whereas the CD8 Δ PD-1 dataset returned more results for Cre+. All of the results returned in this database are immune related.

57

Table 8. Results for the Broad GSEA tool.

Data compiled from the top GO terms from the Hallmark, C2, C5, and C7 databases for both Tamox Δ PD-1 and CD8 Δ PD-1 datasets. This includes both Cre+ and Cre- subsets.

58

The second tool used, Panther, is a web-based tool for classifying genes based on family and subfamily. Families are groups of evolutionarily related proteins whereas subfamilies are proteins with the same function. Molecular function, biological process, and pathway information are also available. We looked at whether there was a difference in molecular function for the differentially expressed genes in each of the two subsets as well as those genes for the different experiments (i.e., the knockout vs the wild-type).

Figure 14 shows pie charts with the results of the analysis for CD8 Δ PD-1 and Tamox Δ

PD-1. The largest GO term for both experiments by far is “binding”. In the case of CD8

Δ PD-1 dataset, “binding” was larger in the Cre- subset whereas in the Tamox Δ PD-1 dataset, it was larger in the Cre+ subset (Figures 14 B and C). The number of genes available however for Tamox Δ PD-1 was much lower which could have played a role in this difference. The second largest GO term is “catalytic activity”. In both CD8 Δ PD-1 and Tamox Δ PD-1, catalytic activity was down in the Cre+ subset (Figures 14 A and C).

The next highest enriched GO term is “molecular transducer activity” which is higher in

Cre+ for both experiments (Figures 14 A and C). For CD8 Δ PD-1 the same top 6 GO terms dominate the pie chart (Figures 14 A and B). This is in contrast to Tamox Δ PD-1 where the Cre- pie chart has 6 GO terms (Figure 14D), and Cre+ has only 4 (Figure 14C).

Missing GO terms are “structural molecular activity” and “molecular function regulator”.

59

Figure 14. Pie charts with overviews of top functional hits for the Molecular Function database in Panther

A: CD8 Δ PD-1 results high in Cre+. B: CD8 Δ PD-1 results high in Cre-. C: Tamox Δ PD-1 results high in Cre+. D: Tamox Δ PD-1 results high in Cre-

60

Next, we looked at the top GO terms produced by Panther for the different subsets

in a more detailed manner. Figure 15A shows the results using the Panther

Overrepresentation Test and the GO ontology Database for mice on the Tamox Δ PD-1

dataset. All the results pertain to antigen processing except for one, “negative regulation

of intracellular signal transduction”. This process is defined as any process which stops,

prevents, or reduces how frequent intracellular transduction takes place (Bult et al.,

2018). This is interesting given that the PD-1 receptor inhibits signaling pathways

downstream of TCR and CD28. Two of the other top GO terms are both related to MHC

Class II which makes sense given these are cancer datasets where PD-1 has been deleted,

likely having an effect on this. It also makes sense that antigen processing would come up

given the environment where these cells were collected. In this system, we are looking at

the immune response to a tumor and how the body responds once PD-1 has been deleted.

The abundance of antigen processing GO terms was not seen in CD8 Δ PD-1, but many

of the resulting GO terms do involve immune processes. Many more results were

returned for the CD8 Δ PD-1 dataset.

The number of GO terms found to be enriched for CD8 Δ PD-1 was not the same

for both Cre+ and Cre-. Twenty-three GO terms where shared among Cre+ and Cre-,

whereas 26 are found to be unique to Cre+ and 260 are unique to Cre-. This can be seen

in the Venn diagram shown in Figure 15. Figure 15B shows the complete list of shared

GO terms. The majority involve the regulation of a cellular processes such as regulation of cell death (GO:0010941) and immune response (GO:0002684). These play an important role in cancer and the immune system’s response to cancer cells.

61

Figure 15. Results from PANTHER

A: Genes high in Cre- for the Tamox Δ PD-1 experiment. There were not enough genes which were statistically significant in the Cre+ dataset. Venn diagram which shows the number of GO terms shared and unique for CD8 Δ PD-1. B: Shared Geo terms between Cre+ and Cre- for CD8 Δ PD-1. C: Geo terms that are high in Cre- for CD8 Δ PD-1. D: Geo terms high in Cre+ for CD8 Δ PD-1

62

In total, there are 49 GO terms which are significant in Cre+ and 283 GO terms

which are significant in Cre-. Many of the top terms involve processes which affect or

involve the immune system. CD8 Δ PD-1 Cre+ enriched GO terms that are unique to

Cre+ number 26, whereas there are 260 Cre- only GO terms (Figure 15C). For Cre+

(Figure 15D) many of the top GO terms involve immune responses such as T-cell

activation and differentiation, as well as leukocyte activation and differentiation. The top

GO terms for Cre+ are more focused around cellular processes including cell death, developmental process, and cell differentiation. These results make sense as the knockout of PD-1 leads to overstimulation, T-cell death, and altered T-cell differentiation.

The next tool used was Enrichr which provides results from over 40 databases.

Samples of these results can be seen in Figure 16 and Figure 17. The remainder of the

results can be found on the Harvard Dataverse (see Materials and Methods) site

associated with this thesis (Altman, 2007). Two widely used databases are chosen for

display: Kegg and Reactome. Figure 16 shows the results from the CD8 Δ PD-1 database,

with Cre+ results on the left and Cre- results on the right. The Reactome database for the

Cre+ subset shows an over expression of the CD3 protein complex for the term:

“phosphorylation of CD3 and TCR zeta chains”. CD3 has been recently used as a target

for biospecific antibodies (CD3 bsAb) (Benonisson et al., 2019). As a result of their

experiment, the number of inflammatory macrophages in the microtumor environment

increased (Benonisson et al., 2019). Additionally, this has been shown to also increase

the number of CD4+ and CD8+ T-cells, which contributes to direct killing of melanoma

cells (in vitro) (Benonisson et al., 2019). “T-cell receptor signaling” was found as a result

in the IPA, Enrichr and Panther databases. One of the top terms found in Enrichr,

63

“Costimulation by the CD28 family Homo sapiens” is also worth noting as PD-1 is

known to attenuate CD28 costimulation (Kamphorst et al., 2017).

Figure 16. Sample of the results from Enrichr for CD8 Δ PD-1

Results taken from Enrichr for the CD8 Δ PD-1 dataset. The two databases shown are Reactome and Kegg. The bar plots represent the top pathways with the length of the bar indicating significance.

Figure 17 shows the results from Enrichr taken from the Tamox Δ PD-1 dataset.

The figure shows a sample of the data obtained from the Reactome and Kegg databases.

Of interest are Lysosphingolipids (LPLs) and LPA receptors, which are important as they have been shown to play an important role in the immune system and are secreted by cancer cells (Rolin & Maghazachi, 2011). They have been shown to be involved in the

regulation of the growth of tumor cells and can increase the invasiveness of tumor cells

(Rolin & Maghazachi, 2011). The top terms for the Kegg database were “Base excision

repair” and “Osteoclast differentiation”. These two terms are responsible for correcting 64

DNA damage due to oxidation, and the degradation of the bone matrix, respectfully

(Krokan & Bjørås, 2013) (Kim & Kim, 2016). These results support the hypothesis that knocking out PD-1 will affect immune related function. Enrichr provides a rich set of results from many popular databases and therefore complements the other programs used in this analysis.

Figure 17. Sample of the results from Enrichr for Tamox Δ PD-1

Results taken from Enrichr for the Tamox Δ PD-1 dataset. The two databases shown are Reactome and Kegg. The bar plots represent the top pathways with the length of the bar indicating significance. The terms with grey bars suggest less significance compared to the terms in color.

The final pathway analysis program used in this project is the Ingenuity’s

Pathway Analysis tool, IPA. Unlike the other programs, IPA is a commercial program that has been manually curated. These results include networks such as those shown in

Supplementary Table 4, and/or top pathways and regulators shown in Table 9. Table 9

65 displays the top canonical pathways and upstream regulators for both Tamox Δ PD-1 and

CD8 Δ PD-1 datasets. The two experiments had different top upstream regulators for

Cre+ and Cre- subsets. For CD8 Δ PD-1, the top regulators for Cre+ are IL2, TBX21,

CD3, CSF3, and IL21 (Table 9A). Many of the regulators have been shown to affect the immune response to tumors. For example, IL2 in conjunction with CTLA-4 have been shown to induce long term cancer regression (Caudana et al., 2019). They have also been used as a therapy for melanoma and renal cell carcinoma when PD-1 has been inhibited

(Buchbinder et al., 2019). In addition to IL2, CD3 is overrepresented for IPA and

Enrichr. This may provide stronger evidence for the relevance of CD3 for this dataset.

The top upstream regulators for Cre- are dominated by interleukins (ILs) including IL 4,

7, and 12 (Table 9A). Interleukins have shown to be a promising tool for cancer immunotherapy (Anestakis et al., 2015) (Yoshimoto et al., 2009).

66

Table 9. Results of analysis using Ingenuity Pathway Analysis A) CD8 Δ PD-1: Top Pathways and Regulators High in Cre+ High in Cre- Name p-value Name p-value Top Canonical Pathways Th1 and Th2 Activation Pathway 1.44E-08 Th1 and Th2 Activation Pathway 8.33E-05 Systemic Lupus Erythematosus In Th2 Pathway 6.45E-08 B Cell Signaling Pathway 1.52E-04 Role of Osteoblasts, Osteoclasts and Chondrocytes in Rheumatoid T Cell Exhaustion Signaling Pathway 6.10E-06 Arthritis 3.17E-04 Role of Macrophages, Fibroblasts and Endothelial Cells in T Cell Receptor Signaling 8.81E-06 Rheumatoid Arthritis 3.28E-04 Th1 Pathway 1.62E-05 Chemokine Signaling 3.49E-04 Top Upstream Regulators IL2 1.69E-14 lipopolysaccharide 4.69E-15 TBX21 9.55E-12 IL12 (complex) 2.91E-13 CD3 7.52E-09 IL4 4.60E-13 CSF3 1.79E-08 dexamethasone 4.97E-13 IL21 1.87E-08 IL7 1.77E-12 B) Tamox Δ PD-1: Top Pathways and Regulators High in Cre+ High in Cre- Name p-value Name p-value Top Canonical Pathways Polyamine Regulation in Colon Cancer 9.80E-03 Antigen Presentation Pathway 2.41E-09 VDR/RXR Activation 3.44E-02 B Cell Development 1.97E-07 Graft-versus-Host Disease Ceramide Signaling 3.87E-02 Signaling 6.43E-07 Autoimmune Thyroid Disease Sphingosine-1-phosphate Signaling 5.11E-02 Signaling 7.00E-07 Natural Killer Cell Signaling 5.20E-02 Nur77 Signaling in T Lymphocytes 1.49E-06 Top Upstream Regulators T-cell alpha/beta receptor 4.58E-04 HRG 8.75E-11 CPEB4 4.11E-03 CIITA 3.64E-09 tridecanoic acid 4.57E-03 IL4 1.33E-06 HLA-A 5.94E-03 lipopolysaccharide 1.81E-06 STAT5A 7.27E-03 PARP9 2.53E-06 Results from IPA showing datasets Tamox Δ PD-1 and CD8 Δ PD-1 datasets including Cre+ and Cre- subsets.

On the other hand, the top upstream regulators for Tamox Δ PD-1 dataset are T- cell alpha/beta receptor, CPEB4, tridecanoic acid, HLA-A, STAT5A (Table 9B). Many of these regulators have also been shown to be involved in cancer immunotherapy. For

67 example, STAT5A has been shown to be a potential therapeutic target for prostate cancer

(Mohanty et al., 2017). The top upstream regulators for Cre- are HRG, CIITA, IL4, lipopolysaccharide, PARP9. Many of these regulators have been shown to be useful in the treatment of cancer as well. HRG for instance has been shown to inhibit tumor growth, corrects cancer-associated vascular abnormalities and regulates the immune system component of the tumor micro-environment (Francis P. Roche, 2014). CIITA has also been shown to be an effective tool in cancer immunotherapy, CIITA driven MHC class II molecules expressed on the surface of a tumor cell leads to an immune response

(Accolla, Ramia, Tedeschi, & Forlani, 2019).

Many of the top canonical pathways found for this analysis are also involved in the immune response (Table 9). For example, several of the top pathways found for CD8

Δ PD-1 Cre- are involved in Rheumatoid Arthritis, which is a type of inflammatory autoimmune disease. Another enriched pathway is the “T-Cell Exhaustion Signaling

Pathway,” which is important because PD-1 is a mediator of T-cell exhaustion (Im et al.,

2016). For Cre+ in CD8 Δ PD-1, most of the top pathways involve either the Th1 and

Th2 pathways, or T-cell signaling. For Tamox Δ PD-1, there are also many enriched pathways which involve an immune response such as “Autoimmune Thyroid Disease

Signaling” and “Nur77 Signaling in T Lymphocytes” (Ganesh, Bhattacharya, Gopisetty,

& Prabhakar, 2011) (Rydzewska, Jaromin, Pasierowska, Stożek, & Bossowski, 2018) .

Other enriched pathways involve cancer directly, such as “Polyamine Regulation in

Colon Cancer.” These pathways and networks could make useful targets for this model as they provide a list of possible downstream targets.

68

The Statistical model

The neural network model was trained using training/test set approach with a

70/30 split (training/test) made up of 500 bootstrap sample datasets (Figure 18). The

results of training the model shown is in Figure 18 indicate that the model is able to

achieve an accuracy of an error rate under 10%, with a mean error rate of about 3%. This

error rate was also very similar to the test set, which has an error rate of under 10% and a

mean error rate of roughly 3.4%. This resulting statistical model can now be applied to

new datasets by using the model object, and leveraging the work done here. New

additions to the model would require only training the last layer of the model and

therefore should not take a long time to train (Figure 4). The goal of the model is to

determine which genes where more closely associated with PD-1 expression or with the

knockout of PD-1. The output of the model is a list of genes in the given datasets with a

p-value indicating how closely correlated that gene is with either the Cre+ or Cre-. Unlike

with a differential expression analysis, this statistical model provides information on

association not just rarity. In order to determine how well this neural network model performs against the other methods, each model was benchmarked using the validation set.

69

Figure 18. Training and Test error for neural network

The training and test error computed using data from CD8 Δ PD-1. Data from the training/test sets were sampled so that the ratio was a 70/30 split. The model was run 500 times for both the training and test sets.

In order to get a better sense of model performance, it is important to examine

several metrics. A varied number of alternative models are compared against our neural

network model (NN). These models include Logistic Regression (LR), K-Nearest

Neighbors (KNN), Naïve Bayes (NB), Linear Discriminant Analysis (LDA), and

Quadratic Discriminant Analysis (QDA). Each of these models was chosen because they, a) work with high dimensional data, b) do not require data to be Gaussian, and c) are all

designed to solve classification problems. In terms of accuracy, NN performed the best

with an accuracy level of 98% on the validation set (See Figure 19A). This was followed

by LR which had a 90% accuracy rating (See Figure 19A). The neural network model also had comparatively the lowest error on the validation set with an average error rate of around 4% and as high as 7% (see Figure 19B). Naïve Bayes performed the worst out of the different models with an average error rate of around 22% (see Figure 19B). This was

70 very similar to the error for KNN, however when K was smaller, the error rate was lower.

For example, when K is 20, the error rate was roughly 9% (see Figure 19B). A more in- depth view of all the metrics tested can be seen in Supplementary Figure 2 where each model was compared based on error, accuracy, sensitivity, and specificity. This was achieved using the performance package in R (Nakagawa, Johnson, & Schielzeth, 2017).

These results indicate that given these models and the validation data, the neural network model performs better. LR was the fastest of the methods often completing in under a second. However, NN was the slowest of the methods used, often taking several minutes of more time to complete.

Figure 19. Boxplots displaying accuracy and error for statistical modeling methods on the Validation set

A: Accuracy metric using the validation set for different statistical models. B: Error rate metric using the validation set for different statistical models. Each box plot represents 100 bootstrap samples per model. 71

The model we developed can be used along with pathway analysis to gain more insight into genes which may not be differentially expressed. In Figure 20, we can see the result of overlaying the model’s predictions onto the genes enriched for a specific term.

Figure 20A shows the GSEA result for Oxidative Phosphorylation using the CD8 Δ PD-1 dataset. We can see that Cre+ is positively correlated whereas Cre- is negatively correlated. This is not directly evident however, when we look at the expression of genes

(left section of Figure 20B). Looking at the last two columns (right section of Figure

20B), most of the genes are seen as predicted by the model, to be highly correlated for the term Cre+. Notice that this is consistent with what can be seen in Figure 20A.

Figure 20. Model predictions for Oxidative Phosphorylation and GSEA results

A: The GSEA results for the enriched term Oxidative Phosphorylation using data from CD8 Δ PD-1. B: Heatmaps containing, expression values and model predictions for the genes that are enriched for the term Oxidative Phosphorylation.

72

Chapter IV.

Discussion

One of the goals of this project is to provide a new way to analyze Bulk RNA-seq data. This goal was accomplished with three aims. The goal of the first aim is to provide an unbiased comparison of popular RNA-seq pipelines. This provides the lab with a standard operating procedure for RNA-seq analysis. The goal of the second aim is to carry out pathway analyses of the RNA sequencing data, which provides a picture of the pathways and dynamics surrounding the knockout of PD-1 in mice. The goal of the third aim is to develop a statistical model that is comprised of a neural network. This model serves several purposes. Initially, to provide new insight into the relationship between the pathways enriched for PD-1 and how important each gene is to the expression of PD-1 or

PD-1 deletion. Then, to provide a metric to determine how important individual genes are to PD-1 expression. Finally, to provide a general tool which can be applied to other questions such as T-cell differentiation.

Once the goals were known it was important to get a good sense of the makeup of the data. To accomplish this a few preliminary analyses had to be conducted. The preliminary analyses allow us to answer a few important questions essential for the project. First, how does the data look overall for the two experiments CD8 Δ PD-1 and

Tamox Δ PD-1? Taking a closer look helps determine which types of statistical methods can be applied to the analysis of these datasets. This also provides guidance on how to approach the analysis. Summary statistics, and plots like histograms and box plots are important for this because they help provide a clearer picture of the data. In these

73

datasets, the results show that the data are skewed to the right, indicating that the data is

not normally distributed but actually follows a Gaussian distribution. When the data is

transformed, the means for subsets Cre- and Cre+ are similar. Second, do the two datasets look similar in terms of their files and reads? This will help answer the question of how comparable the results for the two experiments are. By looking at average file size and average number of reads per file in each experiment we can provide a clearer picture of this. The two experiments differ a great deal making it difficult to compare. Third, are there any unusual effects which might impact the analysis? Is there anything wrong that may point to issues with the sequencing or library prep? This might suggest that the results will likely be poor due to a lack of read depth or skewed from contamination. The

FastQC software package has many tools designed to answer this question. The results here indicate that there were several issues with the datasets uncovered by FastQC which impacts the read count for both datasets. Fourth, is there an issue with the variance in the data and does it look as if it will impact the results? The PCA analysis indicates that this

would not be an issue. The variance is concentrated in the first principal component.

Fifth, do the data cluster well and can we see clear separation for the different conditions?

Several analyses indicate that we can: both the heatmaps and PCA plots show this. This was however not evident in the t-SNE plot, indicating that it would take a more sophisticated model to classify the datasets. These analyses pave a way forward for the rest of the project.

The work on the development of an RNA-seq pipeline for the lab was based on several factors: ease of use, speed of the pipeline, consistency with what was found in the

past, and how closely the expression matches that of a known set of genes. CLC

74

Genomics workbench is the easiest tool to use, it runs on Windows and features a graphical user interface. However, the program is restricted to specific computers and requires an expensive up-front cost and yearly licensing. Salmon and STAR aligner are both designed to run on a Linux or UNIX cluster to run. This was ideal given the availability of the O2 Cluster at Harvard Medical School where the work was performed

(See http://rc.hms.harvard.edu for more information). They are both open source and therefore do not require a commercial license to use. In terms of speed, Salmon is the fastest of the three pipelines as it was designed for speed. However, after performing these analyses, the use of STAR aligner is recommended because of its configurability. I preferred to have the ability to control the aligner separately from the sorter and counter.

STAR aligner while not as fast as Salmon, it can be adjusted to perform just as well, especially given the resources available on O2. The tools from pipeline A of STAR aligner, Sam Tools, and HTSeq Count can all be parallelized allowing for better performance whereas CLC Genomics workbench does not allow for this. Furthermore, support is also available on campus for STAR aligner (pipeline A) and Salmon (pipeline

B), as both are used extensively on campus. Code and/or tutorials for each of the pipelines can also be found on GitHub (see links above).

The most valuable results from running each pipeline from a QC and development standpoint, is the ability to compare results where we already know of their expression patterns from the list the lab originally provided, the gold standard. In this case, we can compare what we see against a set of known gene expression patterns. This provides a valuable metric with which we can compare pipelines. Looking over the top 100 genes for the three pipelines (Figure 9) in many cases the top genes are the same, and the

75

expression values are very close. In other cases, there are some sizeable differences, for

example looking at “Actb-201” (Supplementary Table 3) in Pipeline A (HTSeq count)

and Pipeline B (Salmon) the expression level is 10,861 and for Pipeline C (CLC) the expression value is 18,177. This is a 40% difference in the counts. Then there are instances where only two of the three pipelines are the same, examples of this can been in

Supplementary Table 3. “Tmsb4x” for example has expression levels of 14,861 for both

Pipeline C and Pipeline A but it is 14,657 for Pipeline B. The issue is that Salmon has a difficult time assembling all the transcripts that make up the gene and therefore splits

Tmsb4x into two distinct transcripts which had to be programmatically combined. This is often the case when aligning to a transcriptome making it difficult to compare the results between the three different pipelines. This is an issue which could be overcome however if speed was your main priority. In both of those examples however, the genes in question were in the group of top 100 expressed genes. There are also instances where the counts are different in all three pipelines for example Rplp2 is 1,253 in Pipeline A, 1,547 in

Pipeline B, and 1,570 in Pipeline C. The differences in counts between Pipeline A, and

Pipeline B can be attributed to two things: 1) issues going from transcript ID to gene name, and 2) differences between the mathematical model from Salmon and the count mode in HTSeq count. Pipeline A and Pipeline C have similar counting results which is likely due to the two pipelines having more closely aligned counting algorithms.

Looking over the heatmaps in Figure 13, there appears to be a good separation between the different subsets. The results seem to make sense given the state of PD-1.

Aryl hydrocarbon receptor nuclear translocator 2 (Arnt2), for example in the CD8 Δ PD-1 dataset is up in Cre+ and down in Cre- which is different from the majority of the top

76

genes (Safe, Lee, & Jin, 2013). “Arnt2” is a that is involved in the

adaptive stress response and has been found to be involved in carcinogenesis and cancer progression (Safe et al., 2013). It makes sense that “Arnt2” would be up in Cre+ as the

mouse’s immune system likely is working on clearing tumors (Safe et al., 2013). It has

been shown that blocking PD-1 has led to a strong anti-tumor response (Juneja et al.,

2017). This supports the finding that knocking out PD-1 would lead to genes involved in the immune response being up-regulated when compared with when PD-1 is present.

Another example (see the heatmaps in Figure 13) of this is the G protein-coupled receptor

146 (Gpr146), which is part of a family of G protein-coupled receptors. “Gpr146” is down in Cre+ and up in Cre- in the CD8 Δ PD-1 dataset. “Gpr146” has found use as a target for drug discovery in cancer (Lappano & Maggiolini, 2011), and plays a role in cancer cell proliferation angiogenesis and metastasis (Lappano & Maggiolini, 2011). It also has been shown to transactivate with cells in many different surface receptors involved in cancer (Lappano & Maggiolini, 2011). It would therefore make sense that

“Gpr146” expression would be down when PD-1 is knocked out. In examining Figure

13, it is clear that not every gene follows the same trend. For example, “SATB1” is up in

Cre+ and down in Cre-, this would be expected as tumors from mice have been shown to have lower expression of “SATB1” when compared with wild-type mice (Abdul Hafid &

Radhakrishnan, 2019). “FBXO17” which is down in Cre+ and up in Cre- has been shown to be linked to shorter lifespans for Glioma (brain and spinal cord tumors) patients (Du et al., 2018). It therefore makes sense that “FBXO17” would have higher expression when

PD-1 is present. These results along with the results of the other top genes support the

RNA-seq experiments.

77

The results from conducting a pathway analysis on the dataset provide a more in-

depth picture of the tumor micro-environment. One example is oxidative phosphorylation

which is overrepresented in both the GSEA and Enrichr databases, which was also found

in corroborated experiments done by the Sharpe lab. Vikram Juneja demonstrated that

oxidative phosphorylation can help T-cells overcome PD-1 signaling (Juneja, 2017). It was also shown that oxidative phosphorylation can increase cytotoxicity by CD8+ cells

(Juneja, 2017). Oxidative phosphorylation is also important because, during the activation of T-cells, they can switch from oxidative phosphorylation to aerobic

glycolysis to meet energy needs (Sharpe & Pauken, 2017). The expression of PD-1

ligands in response to inflammation is normal during infection (Sharpe & Pauken, 2017).

Several interleukin cytokines were found to be important upstream regulators for both

Cre+ and Cre- subsets in the analysis. IL-2 for example was found to be overrepresented in the pathway analysis results for IPA and GSEA. It has been shown to play key roles in

T-cell expansion and can also induce PD-1 expression (Francisco, Sage, & Sharpe, 2010).

This is further supported by the overrepresentation of IL-2, cytokine- cytokine receptor

interaction, IL-7, and IL-21, from the pathway analyses. Th1 and Th2 T lymphocytes also

showed up frequently in the analyses for Cre+ and Cre- from the GSEA, C7 database, in

addition, Th1 and Th2 showed up as hits in Enrichr and IPA. These results point to the

strong interplay between the different pathways and the network of genes in the immune

system. The results from this analysis were used to aid in the construction of a more

robust statistical model.

The model provides a tool which can be combined with the output from the other

two aims to have a better way of determining which gene expression changes to predict

78

response as opposed to resistance to tumor clearance when PD-1 cancer immunotherapy.

One of the ways results from a pathway analysis could be incorporated into a statistical

model is to take genes from the analysis and ask how much they correlate with a given

0 = class 1 +. This can also be used to show the impact that the loss of PD-1 may 𝑖𝑖𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶 − 𝑘𝑘 � have on other𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 pathways. This might be particularly valuable because it has been shown

that PD-1 therapies in conjunction with the knockout of other pathways can be more

effective than the loss of PD-1 alone (Francisco et al., 2010). In the short term, however,

this tool provides another means of gaining insight from the results of pathway analyses,

both in terms of exploring the importance of regulators of PD-1 and their impact on other

aspects of the immune response. For example, what is the impact on TFR and TFH cells?

Are the genes which regulate the production of MHC-II molecules important when PD-1

is deleted? The statistical model incorporating the pathway analysis could also shed more

light on T-cell activation. For example, what is the connection between PD-1 expression and metabolic activity in T-Cells? Effector T-cells switch from oxidative phosphorylation to aerobic glycolysis to meet their energy requirements (Sharpe & Pauken, 2017). Figure

20 shows the genes involved in oxidative phosphorylation when PD-1 is knocked out.

What can the model tell us about aerobic glycolysis? This method (i.e., pipeline A, see

Figure 1) will also provide a blueprint or standard operating procedure for future

experiments. Researchers can take the code developed here and apply it to their own data.

The development of this tool and machine learning tools in general can play a valuable role in better understanding the micro-tumor environment as well as contributing to what is known about PD-1's impact on the immune response. This may allow us to come up with better treatments in the future for patients.

79

Limitations

There are several limitations to this analysis based on the tools used and data

available. The most difficult challenge facing the use of statistical modeling for RNA-seq is the development of the training set. This is because unlike with image classification there are no ground truths, only approximations. If you are trying to classify images of cats and dogs for example it can be trivial to acquire true images of both . On the other hand, it can be really difficult to obtain a true picture of a genomic signature. As neural networks evolve, it might be possible to use a method which does not rely on predefined training sets, such as a method unsupervised learning.

Perhaps an even more important limitation is with the data. In this analysis, we

had the benefit of having many replicates, but it would have been even better to have

granular information that was temporal to better map the progression of disease in a

mouse. One other limitation is with the development of a null set or a set of genes which

are not part of the gene expression patterns sought to uncover. At first, one might use

genes which are not expressed, but a lack of expression does not indicate that a gene is

not involved. Correlation in this case, is not a perfect indicator of causation. Another limitation is with the alignment of the transcripts to the genome index. As genome builds evolve, they become more accurate and so this issue will likely be less significant as time goes on. However, this will remain a limitation when adopting this model to lesser-

known species. For example, gene annotations may not be good enough for other species

and pathway maps might not be well developed.

This analysis is also limited to bulk RNA-seq, had single cell data been available,

it is possible different approaches would be required which the model in its current state

80

is not designed to handle. There are at least two reasons for this. First, because each cell

is essentially an experiment on its own, therefore, the code would have to be drastically

altered to handle the increased complexity of the problem. Second, is that the most

commonly employed single cell pipelines utilize a different type of matrix format called a

sparse matrix which the packages employed here (in their current form) are not

compatible with. A limitation is the versatility of the model in that all of the data came

from the same experiment. In the future, it would be ideal to build a model where the training, test, and validation sets were derived from a collection of datasets. Doing this would help introduce variance to the model which should make it more broadly applicable to new problems.

Future work

Currently, the model contains a binary outcome variable, but it is the intention to

change this to allow for the inclusion of different classes of genes such as cell cycle

genes. This would allow for more accurate modeling of cellular processes. This might

require the addition of a Natural Language Processing (NLP) component to the model. It

would also be useful to include a method such as semi-supervised clustering with neural

networks (Ankita Shukla, 2018). This would allow for the development of better training

sets for the model. The model code is written in R, but in the future, it would be better to

have written a wrapper for it in python to make it easier to incorporate into a web tool. It

might be interesting in the future to try to expand the model to work with temporal data,

so that information on how each mouse faired per day/time point can be incorporated.

This would likely allow researchers to ask more sophisticated questions. In terms of the

RNA-seq pipeline, long term it would be good to revisit Salmon as the tool matures, the 81 speed of the tool makes it great for larger datasets especially if human data are being used as there are fewer issues with transcript to gene matches compared to mouse data. It is also a goal in the future to work on a practical way to store and recall the model’s results

– this would require the addition of a database and a front end to the model. This would allow members of the lab to not only run their data against the model, but to also adjust model parameters and see results previously generated.

82

Chapter V.

Conclusion

Bulk RNA-seq analysis of cancer data has been confined to looking at which

genes are differentially expressed and clustering the results. While this can provide a rich

picture of the micro-tumor environment, it can often lack the granularity needed to pick

good targets for follow-up analysis (wet lab experiments). The development of this model

aims to aid in the ability to ask ever more complicated questions of RNA-Seq data. The implications are that aside from knowing which genes are most important, we are able to determine how important every gene is to the different classes. This provides a nuanced view to gene expression, and an approach that has not been shown previously in bulk

RNA-seq analysis of cancer data. With the results of this tool (model), we can now apply it to new datasets and to better probe gene expression changes around PD-1.

83

Chapter VI.

Appendix

Supplementary Table 1. Summary statistics for Tamox Δ PD-1 TILs01_Cre_pos_PD1_neg TILs03_Cre_pos_PD1_neg TILs04_Cre_pos_PD1_neg TILs07_Cre_pos_PD1_neg Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.00 Mean : 14.58 Mean : 17.16 Mean : 27.53 Mean : 15.78 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 Max. :49368.00 Max. :25596.00 Max. :32545.00 Max. :18588.00 TILs09_Cre_pos_PD1_neg TILs13_Cre_pos_PD1_neg TILs14_Cre_pos_PD1_neg TILs02_Cre_neg_PD1_pos Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.00 Mean : 13.32 Mean : 20.99 Mean : 22.17 Mean : 23.64 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 Max. :28305.00 Max. :44038.00 Max. :25133.00 Max. :66635.00 TILs05_Cre_neg_PD1_pos TILs06_Cre_neg_PD1_pos TILs08_Cre_neg_PD1_pos TILs10_Cre_neg_PD1_pos Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.00 Mean : 31.37 Mean : 16.61 Mean : 32.42 Mean : 21.72 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 Max. :70285.00 Max. :37850.00 Max. :47546.00 Max. :19739.00 TILs12_Cre_neg_PD1_pos TILs15_Cre_neg_PD1_pos CreNegative CrePositive Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.00 Mean : 38.63 Mean : 18.49 Mean : 18.79 Mean : 25.89 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 Max. :143267.00 Max. :30774.00 Max. :31939.00 Max. :57508.25 TILs11_Cre_neg_PD1_pos Min. : 0.0 1st Qu.: 0.0 Median : 0.0 Mean : 24.2 3rd Qu.: 0.0 Max. :57790.0 Summary Statistics for Tamox Δ PD-1, indicating the mean, median and quartile information based on expression values for the different samples (files).

84

Supplementary Table 2. Summary Statistics for CD8 Δ PD-1 AC3942_CreN AC3944_CreN AC3947_CreN AC7204_CreN Min. : 0.0 Min. : 0.00 Min. : 0 Min. : 0.00 1st Qu.: 128.3 1st Qu.: 70.75 1st Qu.: 89 1st Qu.: 36.75 Median : 472.6 Median : 379.00 Median : 589 Median : 197.50 Mean : 3894.3 Mean : 3574.56 Mean : 4405 Mean : 2508.85 3rd Qu.: 1597.9 3rd Qu.: 965.25 3rd Qu.: 1537 3rd Qu.: 541.50 Max. Max. :235337.4 Max. :261321 Max. :198118.00 :264373.00 AC9392_CreN AC9394_CreN AD1496_CreN AD3330_52_CreN Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00 1st Qu.: 102.8 1st Qu.: 25.0 1st Qu.: 36.75 1st Qu.: 61.75 Median : 429.5 Median : 120.5 Median : 182.00 Median : 369.00 Mean : 4022.0 Mean : 1578.5 Mean : 1906.20 Mean : 3280.80 3rd Qu.: 1026.8 3rd Qu.: 371.0 3rd Qu.: 470.25 3rd Qu.: 881.25 Max. :275505.0 Max. :135394.0 Max. :174910.00 Max. :226157.00 AD3353_CreN AC1449_CreP AC3946_CreP AC4861_CreP Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0 1st Qu.: 22.75 1st Qu.: 25.25 1st Qu.: 79.75 1st Qu.: 130.5 Median : Median : 126.00 Median : 347.50 Median : 551.0 124.00 Mean : 1474.05 Mean : 1319.53 Mean : 2643.76 Mean : 4042.6 3rd Qu.: 414.75 3rd Qu.: 306.75 3rd Qu.: 1252.50 3rd Qu.: 1504.5 Max. Max. Max. :175468.00 Max. :233236.0 :116981.00 :105454.00 AC4862_CreP AC4868_CreP AC7199_CreP AC9391_CreP Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0 1st Qu.: 89 1st Qu.: 97 1st Qu.: 103.5 1st Qu.: 113.5 Median : 375 Median : 363 Median : 324.0 Median : 443.0 Mean : 2785 Mean : 2736 Mean : 3010.5 Mean : 3922.2 3rd Qu.: 1349 3rd Qu.: 1265 3rd Qu.: 1148.0 3rd Qu.: 1359.5 Max. :122183 Max. :148213 Max. :196993.0 Max. :237922.0 AC9393_CreP AD1494_CreP AD3333_CreP AD3351_CreN Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.00 1st Qu.: 142.5 1st Qu.: 144 1st Qu.: 87.8 1st Qu.: 40.25 Median : 590.0 Median : 754 Median : 461.5 Median : 177.50 Mean : 5256.6 Mean : 5474 Mean : 4261.9 Mean : 1799.95 3rd Qu.: 2121.0 3rd Qu.: 2358 3rd Qu.: 1641.2 3rd Qu.: 504.50 Max. :301455.0 Max. :296853 Max. :323331.0 Max. :136804.00 Summary Statistics for CD8 Δ PD-1, indicating the mean, median and quartile information based on expression values for the different samples (files).

85

Supplementary Table 3. Top 100 genes from Tamox Δ PD-1 compared for the 3 methods Salmon Salmon htseq- clc_coun htseq- clc_coun Genes NumRea Genes NumRea count ts count ts ds ds 10861.13 mt-Nd2- Actb-201 10897 18177 1569 1703 1698 5 201 Rps10- Tmsb4x-203 14861 8301.679 14977 0 1693.491 1871 208 Gm28661- Rpl37a- 0 7780.919 0 972 1638.184 3097 201 201 Gm29216- Gm4149- 0 7137.887 0 1 1635.816 206 201 201 H2-Q6- Tmsb4x-201 14861 6355.321 14977 481 1628.098 586 202 Rps21- Actb-209 10897 5875.889 18177 1120 1594.545 2059 201 H2-Q7- Eef1a1-201 458 5774.539 6030 901 1580.749 2099 204 H2-K1-215 6094 5329.121 6132 Rpl39-201 115 1571.266 2177 Actg1-204 0 5150.851 5144 Trbc2-201 1327 1555.505 1583 Gm10925- 0 4641.719 0 Rplp2-203 640 1547.743 2702 201 B2m-201 4730 4522 4676 Rpl4-201 1501 1543 1570 mt-Atp6-201 0 3883.281 5234 Cfl1-201 1179 1539.508 1744 Gm7536- Tpt1-204 22 3873.329 4204 0 1532.058 0 201 Rpl41-203 135 3863.693 3207 Cd8a-202 2369 1531.134 2421 Shisa5-201 3407 3377 3493 Il2rb-201 1521 1513 924 Rpl31- H2-D1-202 2900 3058.083 3094 4 1511.643 0 ps8-201 Thy1-203 3345 3033.256 3341 Rps5-201 1584 1492.692 1594 Rps15- Rpl13a-213 77 2907.744 5248 1009 1490 1722 202 Rps16- Rpl32-201 807 2845.669 2885 37 1479.903 0 ps2-201 Ms4a4b- Rpsa-206 0 2813.73 2333 3390 1468.512 3583 207 Gm12918 Gapdh-206 0 2774.164 460 0 1463.928 0 -201 Gm28437- Rps24- 0 2674.917 0 802 1412.269 2382 201 201 Rplp1-201 803 2632.481 3196 Rpl26-201 324 1408.814 1477 Gm10269 Btg1-201 1566 2552 2683 0 1400.75 61 -201 Ccnd2- Cd52-201 2645 2518 2646 1579 1398.218 1190 201 Dusp2- Rps12l1-201 0 2466.78 0 1408 1384 1442 201 Laptm5-207 2317 2319 2328 Cd3g-201 1370 1366 1385 Gm9843-201 3 2301.188 1778 Rpl8-201 1327 1358.37 1700

86

Cd8b1- Nkg7-202 2489 2261.203 2543 1298 1328.135 1299 201 Rps13- Rpl6-208 10 2189.797 700 4 1323.623 0 ps2-201 Eef2-201 1640 2184 2260 Lcp1-206 1279 1308 751 Rps14- mt-Nd4-201 1803 2183 2333 1466 1300.688 2730 205 Rps23-ps1- Uba52- 0 2107.477 1251 16 1292.992 793 201 208 Lgals1- mt-Nd1-201 2232 2077 2233 1627 1283.743 1649 202 Ccl5-201 2037 1944 2094 Ly6a-201 1092 1275.836 1291 Rac2-201 2003 1939.695 2082 Cotl1-202 1430 1274.359 1451 Rplp0-201 655 1933.281 2027 Cd3e-201 1410 1272.691 1434 Hspa8- Gm9844-201 4 1917.227 20 62 1228.72 1248 202 Rpl37rt-202 0 1884.72 55 Rps7-205 65 1227.342 1852 Rpl18a-204 179 1867.041 1675 Rps3-201 1218 1219.626 1328 Ly6e-215 2027 1866.655 1980 Trac-202 1300 1212 52 Rps15a-ps7- 0 1844.627 0 Msn-201 1179 1193 1224 201 Rps14- Rps18-209 612 1830.468 2075 1466 1191.945 2730 201 Id2-201 2069 1803.855 1875 Cxcr6-201 1188 1189 31 mt-Cytb-201 1804 1801 1907 Rps2-203 395 1180.065 477 Rps6-ps4-201 0 1787.109 0 Pkm-201 664 1174 852 Gm14303- Rpl27- 0 1774.968 0 26 1170.07 118 201 ps3-201 Gm15501- 0 1765.094 0 Ppia-201 0 1165.574 732 202 Ms4a4b-201 3390 1757.096 3538 Icos-202 1166 1165 1258 Tubb5- Ly6c2-201 847 1720.012 1833 1196 1159 1253 201 Table containing the top 100 Genes from the STAR – HTSeq Count Pipeline compared with the results in Salmon and CLC Genomics workbench. This data comes from Tamox Δ PD-1, but the trends are comparable for CD8 Δ PD-1.

87

Supplementary Figure 1. Histograms of Log transformed Expression values

A: Log transformed expression values for Tamox Δ PD-1 samples. B: Log transformed expression values for CD8 Δ PD 1 samples.

88

Supplementary Table 4. Top molecular networks from IPA CD8 Δ PD-1 Networks Molecules in Network Top Diseases and Functions High in Cre+ ↑CCL3L3, ↑CCR5, CD3, ↑CD38, ↑CD3G, ↑CTLA4, Endocrine System Disorders, Gastrointestinal ERK1/2, ↑GZMK, ↑Ifi27l2a/Ifi27l2b, INF alpha/beta Disease, Immunological Disease 26s Proteasome, Akt, Ap1, ↑ARNTL, BCR (complex), Digestive System Development and Function, ↑CCL5, Cdk, ↑CDK6, Ck2, Creb Gastrointestinal Disease, Organ Morphology Cardiovascular Disease, Cardiovascular System ADCY, ↑ADGRG1, ↑ADRB1, AMPK, ↑ARNT2, Development and Function, Organismal Injury and chemokine, ↑CXCR6, cytokine, ERK, G protein Abnormalities ANKRD44, AR, AZGP1, ↑C15orf39, C1QTNF2, Cellular Assembly and Organization, Cellular CDC37L1, CREBBP, CTCFL, DEPDC1, ↑DMXL2 Function and Maintenance, Molecular Transport 2-mercaptoethanol, ADAMTS5, ANTXR1, ↑BAIAP3, Cancer, Organismal Injury and Abnormalities, CMIP, ↑COCH, COL4A1, DNAJC3, ERBB2, FNBP1 Reproductive System Disease T-cell alpha/beta receptor, ↑Trac Developmental Disorder, Hereditary Disorder, FOXL2, ↑MRGPRE, PAX6 Ophthalmic Disease High in Cre- Alpha catenin, ↓BRWD3, ↓CCL24, CDK4/6, Collagen Cellular Development, Embryonic Development, type I (complex), collagen type I (family), Cyclin D, Hematological System Development and Function Cyclin E, cytokine receptor, Eotaxin 26s Proteasome, Actin, ADRB, ↓AHNAK, caspase, Cancer, Hematological Disease, Organismal Injury CG, ↓CHD4, Ck2, ↓FILIP1L, ↓FOS and Abnormalities Cellular Development, Cellular Growth and ↓ABLIM1, ↓ACTN1, Akt, Alpha Actinin, CD3, Proliferation, Hematological System Development Cofilin, ↓ERBIN, estrogen receptor, F actin, ↓Foxp1 and Function Alp, BCR (complex), ↓CCR7, ↓Cmah, ↓EMB, Cancer, Cell-To-Cell Signaling and Interaction, ↓GZMA, HDL, ↓HIVEP1, Hsp70, IFN Beta Hematological System Development and Function ADORA1, ADORA2B, ↓BAZ2B, Cacna2d, Cell-mediated Immune Response, Cellular ↓CACNA2D4, COPS3, COPS6, CXCL11, Eotaxin, Movement, Hematological System Development ↓FBXO17 and Function ADA, ADORA1, AR, ATP5MF, BDNF, beta-estradiol, Cancer, Organismal Injury and Abnormalities, ↓BZW2, ↓CEP350, CXCL11, EGR4 Tissue Morphology Cardiovascular System Development and Function, Ap1, calpain, CaMKII, collagen, Collagen(s), Creb, Connective Tissue Development and Function, Ecm, ERK, Fgf, Focal adhesion kinase Organismal Development Cellular Movement, Hematological System ADORA, ADORA2B, ↓ARL5C, chemokine, CXCL11, Development and Function, Inflammatory cytokine, ↓DAPL1, EGLN, Erm, Gpcr Response ↓APBB1IP, ENAH, ↓EVL, FYB1, LCP2, PRPF40A, Cellular Assembly and Organization, Cellular SKAP1, VASP, ZDHHC17 Function and Maintenance, Protein Synthesis Tamox Δ PD-1 Networks Molecules in Network Top Diseases and Functions High in Cre+ ↑AUTS2, ↑CBX2, Ccl6, CST6, CSTF1, DHX33, Cell Cycle, Infectious Diseases, Nucleic Acid DHX57, ERCC2, ESR1, ESR2 Metabolism Cellular Development, Cellular Growth and EGR2, ↑NCMAP, ZNF106 Proliferation, Nervous System Development and Function High in Cre- Cell-To-Cell Signaling and Interaction, ↓APOE, ↓ATF3, ↓C1QA, ↓CD74, ↓Cxcl11, ↓CYBB, Hematological System Development and Function, Cyclin A, ERK1/2, ↓FCGR3A/FCGR3B, ↓HLA-DMB Immune Cell Trafficking

89

ABCB10, ADGRL1, APP, BET1L, CD44, ↓CELA1, Cancer, Embryonic Development, Organismal CTNNB1, EN1, ↓FYCO1, FZD1 Development Akt, Bombesin, CIB1, Collagen type I (complex), Drug Metabolism, Protein Synthesis, Small Creb, ↓CSF1R, cytokine, ERK, GNB1L, GSTM2 Molecule Biochemistry Developmental Disorder, Hereditary Disorder, KMT2D, ↓Rsl1 (includes others) Neurological Disease Nucleic Acid Metabolism, Small Molecule ↓Gm11346, TET2 Biochemistry, Tissue Morphology Cell-To-Cell Signaling and Interaction, Cellular levodopa, ↓Zfp758 Assembly and Organization, Developmental Disorder T-cell alpha/beta receptor, ↓Trav21-dv12 Developmental Disorder, Hereditary Disorder, CPNE7, Cyp2c23, HMGCL, ↓MS4A7, SLC35F6 Metabolic Disease

The top molecular networks for Tamox Δ PD-1 and CD8 Δ PD-1 datasets, including the diseases and functions found. The data are separated by dataset and Cre+/Cre- conditions. The networks include the genes with an up arrow meaning up regulator and a down arrow meaning down regulator.

Supplementary Figure 2. Boxplots of statistical model performance metrics

Box plots of error, accuracy, sensitivity, and specificity for each of the statistical models used to compare performance against our neural network model. Data included is from the validation set, as well as bootstrap samples. The boxplots represent 100 samples each.

90

Chapter VII.

References

Abdul Hafid, S. R., & Radhakrishnan, A. K. (2019). Palm Tocotrienol-Adjuvanted Dendritic Cells Decrease Expression of the SATB1 Gene in Murine Breast Cancer Cells and Tissues. Vaccines, 7(4), 198. doi:10.3390/vaccines7040198 Accolla, R. S., Ramia, E., Tedeschi, A., & Forlani, G. (2019). CIITA-Driven MHC Class II Expressing Tumor Cells as Antigen Presenting Cell Performers: Toward the Construction of an Optimal Anti-tumor Vaccine. Frontiers in Immunology, 10(1806). doi:10.3389/fimmu.2019.01806 Altman, M., and Gary King. (2007). A proposed standard for the scholarly citation of quantitative data. D-lib Magazine, 13.3/4. Retrieved from http://www.dlib.org/dlib/march07/altman/03altman.html Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), 166-169. doi:10.1093/bioinformatics/btu638 Anders, S., Reyes, A., & Huber, W. (2012). Detecting differential usage of from RNA-seq data. Genome research, 22(10), 2008-2017. doi:10.1101/gr.133744.111 Andrews, S. FastQC A Quality Control tool for High Throughput Sequence Data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Retrieved from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Anestakis, D., Petanidis, S., Kalyvas, S., Nday, C. M., Tsave, O., Kioseoglou, E., & Salifoglou, A. (2015). Mechanisms and applications of interleukins in cancer immunotherapy. International journal of molecular sciences, 16(1), 1691-1710. doi:10.3390/ijms16011691 Ankita Shukla, G. S. C., Saket Anand. (2018). Semi-Supervised Clustering with Neural Networks. arXiv, 1806.01547 Retrieved from https://arxiv.org/abs/1806.01547 Ashton, T. M., McKenna, W. G., Kunz-Schughart, L. A., & Higgins, G. S. (2018). Oxidative Phosphorylation as an Emerging Target in Cancer Therapy. Clinical Cancer Research, 24(11), 2482-2490. doi:10.1158/1078-0432.Ccr-17-3070 Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., . . . Soboleva, A. (2012). NCBI GEO: archive for functional genomics data sets— update. Nucleic Acids Research, 41(D1), D991-D995. doi:10.1093/nar/gks1193 Benonisson, H., Altıntaş, I., Sluijter, M., Verploegen, S., Labrijn, A. F., Schuurhuis, D. H., . . . van Hall, T. (2019). CD3-Bispecific Antibody Therapy Turns Solid Tumors into Inflammatory Sites but Does Not Install Protective Memory. Molecular Cancer Therapeutics, 18(2), 312-322. doi:10.1158/1535-7163.Mct-18- 0679 Buchbinder, E. I., Dutcher, J. P., Daniels, G. A., Curti, B. D., Patel, S. P., Holtan, S. G., . . . McDermott, D. F. (2019). Therapy with high-dose Interleukin-2 (HD IL-2) in metastatic melanoma and renal cell carcinoma following PD1 or PDL1 inhibition. J Immunother Cancer, 7(1), 49. doi:10.1186/s40425-019-0522-3

91

Bult, C. J., Blake, J. A., Smith, C. L., Kadin, J. A., Richardson, J. E., & Group, t. M. G. D. (2018). Mouse Genome Database (MGD) 2019. Nucleic Acids Research, 47(D1), D801-D806. doi:10.1093/nar/gky1056 Burke, W. (2002). Genetic Testing. New England Journal of Medicine, 347(23), 1867- 1875. doi:10.1056/NEJMoa012113 Caudana, P., Núñez, N. G., De La Rochere, P., Pinto, A., Denizeau, J., Alonso, R., . . . Piaggio, E. (2019). IL2/Anti-IL2 Complex Combined with CTLA-4, But Not PD- 1, Blockade Rescues Antitumor NK Cell Function by Regulatory T-cell Modulation. Cancer Immunology Research, 7(3), 443-457. doi:10.1158/2326- 6066.Cir-18-0697 Charlier, C. V. L., Bortkiewicz, L. v., & Greenwood, J. A. (1947). Elements of mathematical statistics, also L.v. Bortkiewicz, Table of Poisson's frequency function. Cambridge, Mass.,. Chen, E. Y., Tan, C. M., Kou, Y., Duan, Q., Wang, Z., Meirelles, G. V., . . . Ma’ayan, A. (2013). Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC bioinformatics, 14(1), 128. doi:10.1186/1471-2105-14-128 Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., . . . Greene, C. S. (2018). Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface, 15(141). doi:10.1098/rsif.2017.0387 Chollet, F. c. c. o. a. o. (2015). Keras. Retrieved from https://keras.io Clark, A. G., Glanowski, S., Nielsen, R., Thomas, P. D., Kejariwal, A., Todd, M. A., . . . Cargill, M. (2003). Inferring Nonneutral Evolution from Human-Chimp-Mouse Orthologous Gene Trios. Science, 302(5652), 1960-1963. doi:10.1126/science.1088821 Dan, L., Anthony, F., Dennis, K., Stephen, H., Jameson, J., & Joseph, L. (2012). Harrisons Manual of Medicine, 18th Edition: McGraw-Hill Professional. DANUSER, G. (2010). Orchestra: A high performance biomedical supercomputing collaborative. In. MA: NATIONAL CENTER FOR RESEARCH RESOURCES. Davisson MT, C. S., Eicher EM. . (1999). The first spontaneous mutation in the mouse Herc2 gene, MGI Direct Data Submission to Mouse Genome Database (MGD). Retrieved from http://www.informatics.jax.org. from The Jackson Laboratory http://www.informatics.jax.org Deo, R. C. (2015). Machine Learning in Medicine. Circulation, 132(20), 1920-1930. doi:10.1161/CIRCULATIONAHA.115.001593 Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., . . . Consortium, o. b. o. T. F. S. (2012). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6), 671-683. doi:10.1093/bib/bbs046 Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., . . . Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21. doi:10.1093/bioinformatics/bts635 Du, D., Yuan, J., Ma, W., Ning, J., Weinstein, J. N., Yuan, X., . . . Liu, Y. (2018). Clinical significance of FBXO17 gene expression in high-grade glioma. BMC Cancer, 18(1), 773. doi:10.1186/s12885-018-4680-3 Efron, B., & Tibshirani, R. (1993). An Introduction to the bootstrap. New York: Chapman & Hall.

92

Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, 115. doi:10.1038/nature21056 FDA. (2015). FDA approves Yervoy to reduce the risk of melanoma returning after surgery. Silver Spring, MD U.S. Food and Drug Administration Retrieved from https://web.archive.org/web/20151029103812/http://www.fda.gov/NewsEvents/N ewsroom/PressAnnouncements/ucm469944.htm Francis P. Roche, E. K. O., Magnus Essand, Lena Claesson-Welsh. (2014). 630. Histidine-Rich Glycoprotein (HRG): A Novel Gene-Therapy Effector for the Treatment of Cancer. Molecular Therapy, 22, S243-S244. doi:10.1016/S1525- 0016(16)35643-X Francisco, L. M., Sage, P. T., & Sharpe, A. H. (2010). The PD-1 pathway in tolerance and autoimmunity. Immunol Rev, 236, 219-242. doi:10.1111/j.1600- 065X.2010.00923.x Freeman, G. J., & Sharpe, A. H. (2012). A new therapeutic strategy for malaria: targeting T cell exhaustion. Nature Immunology, 13(2), 113-115. doi:10.1038/ni.2211 Gabriel, J. A. (2007). The Biology of Cancer. Hoboken, UNITED KINGDOM: John Wiley & Sons, Incorporated. Ganesh, B. B., Bhattacharya, P., Gopisetty, A., & Prabhakar, B. S. (2011). Role of cytokines in the pathogenesis and suppression of thyroid autoimmunity. Journal of interferon & cytokine research : the official journal of the International Society for Interferon and Cytokine Research, 31(10), 721-731. doi:10.1089/jir.2011.0049 Godec, J., Tan, Y., Liberzon, A., Tamayo, P., Bhattacharya, S., Butte, A. J., . . . Haining, W. N. (2016). Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation. Immunity, 44(1), 194-206. doi:10.1016/j.immuni.2015.12.006 Gong, J., Chehrazi-Raffle, A., Reddi, S., & Salgia, R. (2018). Development of PD-1 and PD-L1 inhibitors as a form of cancer immunotherapy: a comprehensive review of registration trials and future considerations. J Immunother Cancer, 6(1), 8. doi:10.1186/s40425-018-0316-z Graham, D. B., Luo, C., O’Connell, D. J., Lefkovith, A., Brown, E. M., Yassour, M., . . . Xavier, R. J. (2018). Antigen discovery and specification of immunodominance hierarchies for MHCII-restricted epitopes. Nature Medicine, 24(11), 1762-1772. doi:10.1038/s41591-018-0203-7 Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., . . . Webster, D. R. (2016). Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus PhotographsAccuracy of a Deep Learning Algorithm for Detection of Diabetic RetinopathyAccuracy of a Deep Learning Algorithm for Detection of Diabetic Retinopathy. JAMA, 316(22), 2402-2410. doi:10.1001/jama.2016.17216 Härdle, W. K., & Vogt, A. B. (2015). Ladislaus von Bortkiewicz—Statistician, Economist and a European Intellectual. International Statistical Review, 83(1), 17-35. doi:https://doi.org/10.1111/insr.12083 Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning : data mining, inference, and prediction (2nd ed.). New York, NY: Springer.

93

Huang, D. W., Sherman, B. T., & Lempicki, R. A. (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protocols, 4(1), 44-57. doi:http://www.nature.com/nprot/journal/v4/n1/suppinfo/nprot.2008.211_S1.html Hunt, S. E., McLaren, W., Gil, L., Thormann, A., Schuilenburg, H., Sheppard, D., . . . Cunningham, F. (2018). Ensembl variation resources. Database, 2018. doi:10.1093/database/bay119 Ian Goodfellow, Y. B., Aaron Courville. (2016). Deep Learning: MIT Press. Ignatiadis, N., Klaus, B., Zaugg, J. B., & Huber, W. (2016). Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature Methods, 13(7), 577-580. doi:10.1038/nmeth.3885 Im, S. J., Hashimoto, M., Gerner, M. Y., Lee, J., Kissick, H. T., Burger, M. C., . . . Ahmed, R. (2016). Defining CD8+ T cells that provide the proliferative burst after PD-1 therapy. Nature, 537(7620), 417-421. doi:10.1038/nature19330 Institute, N. C. (2015). Milestones in Cancer Research and Discovery. Retrieved from https://www.cancer.gov/research/progress/250-years-milestones Jassal, B., Matthews, L., Viteri, G., Gong, C., Lorente, P., Fabregat, A., . . . D'Eustachio, P. (2020). The reactome pathway knowledgebase. Nucleic Acids Res, 48(D1), D498-d503. doi:10.1093/nar/gkz1031 Jiang, H., & Wong, W. H. (2009). Statistical inferences for isoform expression in RNA- Seq. Bioinformatics, 25(8), 1026-1032. doi:10.1093/bioinformatics/btp113 Juneja, V. R. (2017). The role of the PD-1 pathway in the tumor microenvironment. ( Ph. D. in Medical Engineering and Medical Physics Ph.D). Massachusetts Institute of Technology, Cambridge MA. Retrieved from http://hdl.handle.net/1721.1/108961 (1721.1/7582) Juneja, V. R., McGuire, K. A., Manguso, R. T., LaFleur, M. W., Collins, N., Haining, W. N., . . . Sharpe, A. H. (2017). PD-L1 on tumor cells is sufficient for immune evasion in immunogenic tumors and inhibits CD8 T cell cytotoxicity. The Journal of Experimental Medicine, 214(4), 895-904. doi:10.1084/jem.20160801 Kadoki, M., Patil, A., Thaiss, C. C., Brooks, D. J., Pandey, S., Deep, D., . . . Chevrier, N. (2017). Organism-Level Analysis of Vaccination Reveals Networks of Protection across Tissues. Cell, 171(2), 398-413 e321. doi:10.1016/j.cell.2017.08.024 Kamphorst, A. O., Wieland, A., Nasti, T., Yang, S., Zhang, R., Barber, D. L., . . . Ahmed, R. (2017). Rescue of exhausted CD8 T cells by PD-1-targeted therapies is CD28- dependent. Science, 355(6332), 1423-1427. doi:10.1126/science.aaf0683 Kanehisa, M., & Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 28(1), 27-30. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/10592173 Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLOS Computational Biology, 8(2), e1002375. doi:10.1371/journal.pcbi.1002375 Khosravi, P., Kazemi, E., Imielinski, M., Elemento, O., & Hajirasouliha, I. (2017). Deep Convolutional Neural Networks Enable Discrimination of Heterogeneous Digital Pathology Images. EBioMedicine, 27, 317-328. doi:10.1016/j.ebiom.2017.12.026 Kim, J. H., & Kim, N. (2016). Signaling Pathways in Osteoclast Differentiation. Chonnam Med J, 52(1), 12-17. doi:10.4068/cmj.2016.52.1.12

94

Korpelainen, E., Tuimala, J., Somervuo, P., Huss, M., & Wong, G. (2015). RNA-seq data analysis : a practical approach. Boca Raton: CRC Press, Taylor & Francis Group. Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., . . . Raychaudhuri, S. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16(12), 1289-1296. doi:10.1038/s41592- 019-0619-0 Krizhevsky, A. a. S., Ilya and Hinton, Geoffrey E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Paper presented at the NeurIPS. Krokan, H. E., & Bjørås, M. (2013). Base excision repair. Cold Spring Harb Perspect Biol, 5(4), a012583. doi:10.1101/cshperspect.a012583 Kuleshov, M. V., Jones, M. R., Rouillard, A. D., Fernandez, N. F., Duan, Q., Wang, Z., . . . Ma'ayan, A. (2016). Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research, 44(W1), W90-W97. doi:10.1093/nar/gkw377 Lappano, R., & Maggiolini, M. (2011). G protein-coupled receptors: novel targets for drug discovery in cancer. Nature Reviews Drug Discovery, 10(1), 47-60. doi:10.1038/nrd3320 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., . . . Subgroup, G. P. D. P. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078-2079. doi:10.1093/bioinformatics/btp352 Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, Jill P., & Tamayo, P. (2015). The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems, 1(6), 417-425. doi:10.1016/j.cels.2015.12.004 Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P., & Mesirov, J. P. (2011). Molecular signatures database (MSigDB) 3.0. Bioinformatics (Oxford, England), 27(12), 1739-1740. doi:10.1093/bioinformatics/btr260 Lin, C., Jain, S., Kim, H., & Bar-Joseph, Z. (2017). Using neural networks for reducing the dimensions of single-cell RNA-Seq data. Nucleic Acids Research, 45(17), e156-e156. doi:10.1093/nar/gkx681 Lindner, R., & Friedel, C. C. (2012). A comprehensive evaluation of alignment algorithms in the context of RNA-seq. PLoS ONE, 7(12), e52403-e52403. doi:10.1371/journal.pone.0052403 Love, M. I. (2013). Statistical analysis of high-throughput sequencing count data. Retrieved from http://dx.doi.org/10.17169/refubium-9391 Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. doi:10.1186/s13059-014-0550-8 Maaten, L. v. d. (2014). Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15, 3221-3245. Retrieved from http://jmlr.org/papers/v15/vandermaaten14a.html Mart, #237, Abadi, n., Barham, P., Chen, J., Chen, Z., . . . Zheng, X. (2016). TensorFlow: a system for large-scale machine learning. Paper presented at the Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, Savannah, GA, USA.

95

Martin Riedmiller, H. B. (1993). A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm. Paper presented at the IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.1417 Mi, H., Muruganujan, A., Casagrande, J. T., & Thomas, P. D. (2013). Large-scale gene function analysis with the PANTHER classification system. Nature Protocols, 8, 1551. doi:10.1038/nprot.2013.092 https://www.nature.com/articles/nprot.2013.092#supplementary-information Miao, Z., Gaynor, K. M., Wang, J., Liu, Z., Muellerklein, O., Norouzzadeh, M. S., . . . Getz, W. M. (2019). Insights and approaches using deep learning to classify wildlife. Scientific Reports, 9(1), 8137. doi:10.1038/s41598-019-44565-w Moen, E., Bannon, D., Kudo, T., Graf, W., Covert, M., & Van Valen, D. (2019). Deep learning for cellular image analysis. Nature Methods, 16(12), 1233-1246. doi:10.1038/s41592-019-0403-1 Mohanty, S. K., Yagiz, K., Pradhan, D., Luthringer, D. J., Amin, M. B., Alkan, S., & Cinar, B. (2017). STAT3 and STAT5A are potential therapeutic targets in castration-resistant prostate cancer. Oncotarget, 8(49), 85997-86010. doi:10.18632/oncotarget.20844 Murphy, K., Travers, P., Walport, M., & Janeway, C. (2012). Janeway's immunobiology (8th ed.). New York: Garland Science. Nakagawa, S., Johnson, P. C. D., & Schielzeth, H. (2017). The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded. Journal of The Royal Society Interface, 14(134), 20170213. doi:doi:10.1098/rsif.2017.0213 NCI. (2019). Immunotherapy Side Effects. Retrieved from https://www.cancer.gov/about-cancer/treatment/types/immunotherapy/side-effects Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics (Oxford, England), 32(2), 292-294. doi:10.1093/bioinformatics/btv566 Park, S.-J., Yoon, B.-H., Kim, S.-K., & Kim, S.-Y. (2019). GENT2: an updated gene expression database for normal and tumor tissues. BMC Medical Genomics, 12(5), 101. doi:10.1186/s12920-019-0514-7 Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14, 417. doi:10.1038/nmeth.4197 https://www.nature.com/articles/nmeth.4197#supplementary-information Powell, J. A. C. (2014). GO2MSIG, an automated GO based multi-species gene set generator for gene set enrichment analysis. BMC bioinformatics, 15, 146-146. doi:10.1186/1471-2105-15-146 Qiagen. (2019). CLC Genomics Workbench 11: Qiagen. Retrieved from https://www.qiagenbioinformatics.com/ Ribas, A. (2012). Tumor Immunotherapy Directed at PD-1. New England Journal of Medicine, 366(26), 2517-2519. doi:10.1056/NEJMe1205943 Ripley, W. N. V. a. B. D. (2002a). Modern Applied Statistics with S (Version Fourth): Springer. Retrieved from http://www.stats.ox.ac.uk/pub/MASS4

96

Ripley, W. N. V. a. B. D. (2002b). Modern Applied Statistics with S. (Fourth ed.). New York: Springer. Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. doi:10.1093/bioinformatics/btp616 Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25. doi:10.1186/gb-2010-11-3-r25 Rolin, J., & Maghazachi, A. A. (2011). Effects of lysophospholipids on tumor microenvironment. Cancer Microenviron, 4(3), 393-403. doi:10.1007/s12307- 011-0088-1 Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211-252. doi:10.1007/s11263-015-0816-y Rydzewska, M., Jaromin, M., Pasierowska, I. E., Stożek, K., & Bossowski, A. (2018). Role of the T and B lymphocytes in pathogenesis of autoimmune thyroid diseases. Thyroid Research, 11(1), 2. doi:10.1186/s13044-018-0046-9 Sadanandan, S. K., Ranefall, P., Le Guyader, S., & Wählby, C. (2017). Automated Training of Deep Convolutional Neural Networks for Cell Segmentation. Scientific Reports, 7(1), 7860. doi:10.1038/s41598-017-07599-6 Safe, S., Lee, S.-O., & Jin, U.-H. (2013). Role of the aryl hydrocarbon receptor in carcinogenesis and potential as a drug target. Toxicological sciences : an official journal of the Society of Toxicology, 135(1), 1-16. doi:10.1093/toxsci/kft128 Sharpe, A. H. (2017). Introduction to checkpoint inhibitors and cancer immunotherapy. Immunological Reviews, 276(1), 5-8. doi:doi:10.1111/imr.12531 Sharpe, A. H., & Pauken, K. E. (2017). The diverse functions of the PD1 inhibitory pathway. Nature Reviews Immunology, 18, 153. doi:10.1038/nri.2017.108 Siegel, R. L., Miller, K. D., & Jemal, A. (2019). Cancer statistics, 2019. CA: A Cancer Journal for Clinicians, 69(1), 7-34. doi:10.3322/caac.21551 Soneson, C., Love, M., & Robinson, M. (2016). Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences [version 2; peer review: 2 approved]. F1000Research, 4(1521). doi:10.12688/f1000research.7563.2 Stefan Fritsch , F. G., Marvin N. Wright,Marc Suling,Sebastian M. Mueller. (2019). neuralnet: Training of Neural Networks (Version 1.44.2): CRAN. Retrieved from https://CRAN.R-project.org/package=neuralnet Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., . . . Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545-15550. doi:10.1073/pnas.0506580102 Team, R. C. (2014). R: A Language and Environment for Statistical Computing (Version 3.0.3). Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/ The Gene Ontology Consortium. (2018). The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research, 47(D1), D330-D338. doi:10.1093/nar/gky1055

97

Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., . . . Pachter, L. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protocols, 7(3), 562-578. Retrieved from http://dx.doi.org/10.1038/nprot.2012.016 Urruticoechea, A., Alemany, R., Balart, J., Villanueva, A., Vinals, F., & Capella, G. (2010). Recent Advances in Cancer Therapy: An Overview. Current Pharmaceutical Design, 16(1), 3-10. doi:http://dx.doi.org/10.2174/138161210789941847 Way, G. P., & Greene, C. S. (2018). Bayesian deep learning for single-cell analysis. Nature Methods, 15(12), 1009-1010. doi:10.1038/s41592-018-0230-9 Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis: Springer-Verlag. Retrieved from https://ggplot2.tidyverse.org} Wingett, S. W., & Andrews, S. (2018). FastQ Screen: A tool for multi-genome mapping and quality control. F1000Research, 7, 1338-1338. doi:10.12688/f1000research.15931.2 Yang, F., Foekens, J. A., Yu, J., Sieuwerts, A. M., Timmermans, M., Klijn, J. G., . . . Jiang, Y. (2006). Laser microdissection and microarray analysis of breast tumors reveal ER-alpha related genes and pathways. Oncogene, 25(9), 1413-1419. doi:10.1038/sj.onc.1209165 Yi, X., Walia, E., & Babyn, P. (2018). Generative Adversarial Network in Medical Imaging: A Review. arXiv e-prints. Retrieved from https://ui.adsabs.harvard.edu/abs/2018arXiv180907294Y Yoshimoto, T., Morishima, N., Okumura, M., Chiba, Y., Xu, M., & Mizuguchi, J. (2009). Interleukins and cancer immunotherapy. Immunotherapy, 1(5), 825-844. doi:10.2217/imt.09.46 Yu, J., Gu, X., & Yi, S. (2016). Ingenuity Pathway Analysis of Gene Expression Profiles in Distal Nerve Stump following Nerve Injury: Insights into Wallerian Degeneration. Frontiers in Cellular Neuroscience, 10(274). doi:10.3389/fncel.2016.00274 Yun Liu, K. G., Mohammad Norouzi ,George E. Dahl ,Timo Kohlberger ,Aleksey Boyko ,Subhashini Venugopalan,Aleksei Timofeev,Philip Q. Nelson,Gregory S. Corrado,Jason D. Hipp ,Lily Peng , Martin C. Stumpe. (2017). Detecting Cancer Metastases on Gigapixel Pathology Images. CoRR, abs/1703.02442. Retrieved from http://arxiv.org/abs/1703.02442 Zheng, J., & Wang, K. (2019). Emerging deep learning methods for single-cell RNA-seq data analysis. Quantitative Biology, 7(4), 247-254. doi:10.1007/s40484-019-0189- 2

98