Genetics: Early Online, published on July 27, 2019 as 10.1534/genetics.119.302463

Predicting phenotypic diversity from molecular and genetic data

Tom Harel, Naama Peshes-Yaloz, Eran Bacharach, Irit Gat-Viks* School of Molecular Cell Biology and Biotechnology, Department of Cell Research and Immunology, The George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel

* Corresponding author: [email protected] (IGV)

ABSTRACT Despite the importance of complex phenotypes, an in-depth understanding of the combined molecular and genetic effects on a phenotype has yet to be achieved. Here we introduce InPhenotype, a novel computational approach for complex phenotype prediction, where -expression data and genotyping data are integrated to yield quantitative predictions of complex physiological traits. Unlike existing computational methods, InPhenotype makes it possible to model potential regulatory interactions between gene expression and genomic loci without compromising the continuous nature of the molecular data. We applied InPhenotype on synthetic data, exemplifying its utility for different data parameters, as well as its superiority compared to current methods in both prediction quality and the ability to detect regulatory interactions of and genomic loci. Finally, we show that InPhenotype can provide biological insights on both mouse and yeast datasets. KEYWORDS complex traits; genetics; gene expression; computational modeling

nderstanding the mechanisms underlying complex have relied on regression models, such as LASSO and Ridge diseases presents a substantial challenge. Among the regression (Takagi et al. 2014), elastic net (CAMELOT) (Chen et U most successful and widely used approaches are genome- al. 2009), and bayesian mixed regression (Bhattacharjee and wide association studies (GWAS), in which genotyping (or DNA- Sillanpää 2011). However, owing to the large number of sequencing) information is utilized to systematically investigate expressed genes and SNP variables, it is not feasible to include the genetic basis of phenotypic diversity (Visscher et al. 2017). all potential interactions within these regression-based models. Such studies are focused on genetic information that is To account for those interactions, alternative phenotype- relevant to every tissue, condition and time point; however, prediction approaches have been based on decision tree genetic data are static and thus cannot capture dynamic, models of either a single tree (Lee et al. 2006) or multiple trees epigenetic or environmental factors. An alternative strategy is (Chen and Zhang 2013), with two main caveats. First, the to use high-throughput molecular data, such as mRNA- internal nodes of the trees store the split functions based on sequencing, to uncover relationships between molecular and gene-expression and genotyping data, while each leaf node of phenotypic diversity in a population of individuals (Asyali et al. a tree provides the most probable answer (Figure S1). Such 2006). This alternative is valuable in two ways: first, it is models are prone to overfitting since they represent any types applicable to both qualitative traits (through classification of interaction (gene-gene, gene-locus, and locus-locus methods) and quantitative traits (through regression methods), interactions). Secondly, since these methods typically discretize and secondly, the molecular data naturally encapsulate a the expression data within the split nodes, such approaches do variety of underlying epigenetic, environmental and not realize the full potential of the quantitative expression developmental effects. However, since this strategy requires measurements. prior selection of a specific tissue and experimental condition, Here, we model quantitative clinical outcomes using the it is of limited utility in the case of in-vivo clinical outcomes that framework of regression trees (Quinlan 1992; Criminisi 2011) in are commonly associated with pathological alterations in which discrete (qualitative) values (here, SNP genotyping) are multiple tissues and organs. used within the split nodes, whereas quantitative expression- Several studies have shown that by integrating gene level measurements of a given gene are used as regression- expression with genotyping of single-nucleotide polymorphic based predictors associated with each leaf node (see (SNP) sites, it is possible to achieve a better quality of illustration in Figure 1, right). To account for multiple genes, phenotype prediction than that obtained by analysis of only the model consists of a large collection of trees where each one data type at a time (e.g. (Ruderfer et al. 2009)). Most tree represents the expression data of a single gene. To obtain attempts to use a combined transcription-genotyping predictor generalization and robustness, we apply the 'random

Predicting phenotypic diversity 1

Copyright 2019. regression forest' framework (Criminisi 2011): for each single genotyping and molecular data of an unseen individual are gene the model consists of an ensemble of randomly trained used to predict the outcome phenotype of the forest (based on regression trees, where each of these trees is randomly a certain forest prediction model). The third phase is different and essentially decorrelated from the other trees of “interpretability”: we interpret the model by asking relevant the same gene (Figure 1, left). This methodology, referred to as biological questions. In particular, we ask which gene-locus 'InPhenotype', combines the advantage of regression-based pairs make major joint contributions to the forest's methods (as it exploits the quantitative nature of gene- performance. In the following, we describe our specific expression data) together with the advantages of decision tree- formulation of each component. based methods (by considering gene-locus interactions) while The InPhenotype forest model maintaining a reasonable complexity (by limiting gene-gene and locus-locus interactions), therefore tackling the main To model a single gene, the InPhenotype algorithm performs a caveats of previous methods. non-linear regression that builds on hierarchical partitioning organized in a regression tree model. The latter is a tree in InPhenotype builds on a forest model that carries several which each 'split node' tests the incoming genotype of a algorithmic adjustments, making it possible to fit the special certain genomic locus, and each 'terminal node' (leaf) is a characteristics of gene expression and genotyping data. For instance, the forest prediction is not a simple average of all predictor model in the form of a linear regression between the leaf-node predictions; instead, we use a weighting scheme that expression values of the gene and the quantitative phenotype. considers the generalization ability of the different leaf nodes. More specifically, each split node is a binary test function, and As another example, split nodes store different test functions, individuals arriving at the node are sent to either the right or depending on the genetic landscape of the relevant organism. the left child nodes. In each split node, only a single genomic Furthermore, the particular construction of the model allows locus is used for testing. In the case of homozygous individuals (such as inbred mouse strains and yeast), the two possible the extraction of gene-locus regulatory interactions that are genotypes of a given locus (e.g., 'AA' and 'GG') indicate ‘right’ believed to be jointly important in predicting the outcome. We or ‘left’. In the case of heterozygosity, each genomic locus applied synthetic data analysis to demonstrate the advantages of the various algorithmic adjustments and to further show the specifies several possible 'genotypic partitions', such as 'AA' for superiority of InPhenotype over existing methods in terms of the ‘right’ child and 'AG'-'GG' for the ‘left’ child, or its prediction accuracy and of its ability to reveal gene-SNP alternatively, both 'AA' and 'AG' can indicate the ‘left’ child. For interactions that have a high impact on the outcome. As a terminal nodes, InPhenotype utilizes a regression model in proof of concept, we applied InPhenotype on real data of two which the quantitative phenotype is the response variable and biological systems: growth diversity following rapamycin the expression of a certain gene is the predictor variable. Each treatment in yeast, and susceptibility to influenza infection in tree therefore refers to a hierarchical piecewise regression model of a single gene, referred to as a 'gene tree' (Figure 1, mice. Most notably, in both yeast and mouse, genes and SNPs right). The overall 'InPhenotype forest' model consists of a were found to pair with high joint importance to phenotypic collection of trees for each gene (Figure 1, left). diversity, showing surprising modular organization: many of the identified SNPs acted jointly with multiple genes having a The InPhenotype tree tackles the main challenges in similar biological function. existing phenotype-modeling methods. First, it exploits the quantitative nature of expression data within the leaf Materials and Methods regression model—unlike the current regression trees that have been applied in biological context, which typically Overview of the InPhenotype algorithm discretizes the molecular measurements into binary split-node We have developed InPhenotype, a methodology for decisions while modeling the leaves independently of the input integrating discrete and continuous data types to predict data (using a Gaussian model; Figure S1; (Chen et al. 2009; quantitative traits based on the random regression forest Chen and Zhang 2013)). Secondly, it provides the ability to framework (Quinlan 1992; Criminisi 2011). InPhenotype's input model gene-locus interactions (e.g. SNP j - gene A in Figure 1), includes discrete genetic data (genotyping) of multiple unlike regularized regression-based methods (Takagi et al. genomic loci, any continuous molecular data (such as - 2014). Thirdly, previous random regression forests have or gene-expression levels) of multiple molecular entities (such typically utilized the entire set of variables (here, both as , genes, and lipids), and a quantitative phenotype. genotyping and expression data) in the binary split functions, Each of these data types is collected across a population of thereby accounting for any type of interaction (here, gene- individuals. For simplicity, the following description of the gene, gene-locus, and locus-locus; (Quinlan 1992; Chen and InPhenotype method refers to the genotyping of SNPs and the Zhang 2013)). InPhenotype, in contrast, models each gene in a expression of genes. In the following, we first describe the different collection of trees, thereby reducing the search space ‘InPhenotype forest’ model and then describe the three main due to the absence of many gene-gene and locus-locus phases of the algorithm: training, testing, and interpretability. interactions. Such simplification of the model is particularly During the training phase, the forest is constructed using a important in the case of relatively small datasets (as in the case certain objective function. In the testing phase, the input of the real data in this study).

Predicting phenotypic diversity 2 The training phase Wilcoxon test is referred to as the final 'gene score'. In stage 3, we improve the real forest through feature selection based on Reconstruction of a gene tree. The tree is trained by optimizing these gene scores. In particular, the InPhenotype forest a binary test function in each split node. The binary function consists of all trees associated with the top-scored genes; the implies a separation of the individuals into the ‘right’ and ‘left’ remaining trees are excluded from the final InPhenotype child based on a certain locus and therefore determines a model. The number of selected genes is determined by one of regression model in each of the child nodes. To optimize the two approaches. The naive approach is to use a predetermined binary test function for a given split node, we therefore number of top-scoring genes that would be retained in the evaluate the ‘prediction gain’ for each candidate genomic locus final InPhenotype model (based on the gene-score metric). and then choose the locus with the best gain. The 'prediction Alternatively, it is possible to select the significant genes, gain' is formalized as the ratio between the regression quality relying on the assessment of gene significance in the after and before the split: 2 2 “interpretability” phase. The reconstruction process is 푅푖,퐿+푅푖,푅 Prediction gain = 2 illustrated in Figure S2A. 푅푖 (Eq. 1) The testing phase 2 2 2 where 푅푖 , 푅푖,퐿 and 푅푖,푅 are the coefficients of determination Given the gene expression and genotyping data for a of the regression at the split node i and at the two child nodes, previously unseen individual, our goal is to predict the respectively. The iterative splits are terminated in a given node phenotype of this individual. To address this, we first calculate when one of two criteria is reached: either the minimal the prediction using every single tree and then calculate a number of individuals that are sent to each of the two child weighted average of predictions over all trees. Prediction nodes is lower than a certain cutoff, or alternatively, the based on a single tree, denoted 푃푉푡, is calculated by applying optimal prediction gain (across all genomic loci) is lower than a the binary genotyping test in the split node of the root, sending defined cutoff. the individual to the right or the left child nodes based on the Reconstruction of the random regression forest. The random results of this test, and repeating the process until a leaf node forest consists of k trees for each gene. The diversity between is reached; 푃푉푡 is calculated using the linear regression model those trees is generated using the standard random forest of that leaf. The prediction based on the entire InPhenotype methodology (Breiman 2001): each tree is trained using a forest model, here denoted 푃푉푓, is calculated by subset of genomic loci (selected randomly with a probability of 1 푃푉 = ∑ 푊 ∙ 푃푉 1/3) and a bootstrap (random selection with repetitions of 푓 푛 ∙ 푘 푡=1..푛∙푘 푡 푡 individuals). Individuals not used in the training of a given tree (Eq. 2) (‘out-of-bag’ individuals) provide valuable information about the generalization of trees and genes in the model. First, the where n∙k is the total number of trees, and Wt is the weight of generalization of a tree is defined as the inverse root-mean- tree t - defined as the generalization score of the tree. squared error (RMSE), calculated separately for each leaf using its out-of-bag individuals and then averaged across all leaves. The interpretability phase 2 The RMSE of a leaf is calculated as √∑푗(푦푗 − 푦̂푗) , where j is an The InPhenotype model provides valuable additional out-of-bag individual arriving at that leaf, 푦푗 is the measured information about the biological system. We use the constructed InPhenotype model to predict regulatory phenotype of individual j and ŷj is the expected phenotype of individual j given the regression model of the leaf. Secondly, interactions between genes and genomic loci that contribute the generalization of a given gene is defined as the average to the outcome phenotype (referred to as 'gene-locus generalization of its collection of trees. interactions'). The regulatory interactions are identified in three steps (Figure S2B). Step 1 identifies significant genes, and Reconstruction of the InPhenotype model. The training of the step 2 calculates SNP scores for each significant gene, thereby InPhenotype model consists of three stages. In stage 1, we identifying the top-scored interacting locus for each gene. utilize the input data types to construct a random regression Together, these two steps provide the candidate gene-locus forest, as detailed above (termed the 'real forest'). Stage 2 interactions. Step 3 evaluates the statistical significance of calculates a gene score for each gene in the real forest based these interactions. We next provide details about these steps. on comparison with null data. To achieve this, the input data Step 1 – “gene identification” – calculates a permutation are permuted and then used to train an additional forest (the P-value for each gene. To this end, m InPhenotype forests are 'null forest') using the same procedure. Data are permuted by trained using m permuted datasets (here, m=10). The shuffling the annotation of individuals both for the expression permutation P-value of a gene is the percentage of measurements of each gene and for the phenotypic values. For permutation-based forests in which the gene attained gene each gene, a non-parametric unpaired Wilcoxon test is used to scores that are higher than its original score. We select the top- evaluate the difference between the distribution of its scored genes using two criteria: first, using a stringent gene- generalization scores (across all trees of the relevant gene) in score cutoff (criterion I), and secondly, using a relaxed gene- the real forest and in the null forest. The −log P-value of this score cutoff as long as the loci make a significant contribution

Predicting phenotypic diversity 3 that is not due only to their gene-expression data (criterion II). Synthetic data generation To address the second criterion, we compare the gene scores We generated synthetic data collections, each comprising 25 resulting from two alternative InPhenotype forests: the datasets. Each dataset consisted of a full set of data introduced standard InPhenotype forest, as described above (the 'full-data into the InPhenotype algorithm, including phenotypes, InPhenotype model'), and in addition, an InPhenotype forest whose training was conducted in the absence of genetic data genotyping and expression data for a certain set of genes (a 'null-genetics InPhenotype model'). Selected genes (of across a cohort of 200 individuals. In addition, input for each criterion II) are those with high rankings according to the full- dataset included 50 genes and 50 SNPs. While most genes and data InPhenotype model and low rankings according to the loci did not affect the phenotype (termed ‘control genes’ and null-genetics model. As exemplified in Figure S3AB, in both ‘control SNPs’), one gene and one SNP each had a causal effect criteria, InPhenotype selects genes whose gene scores are on it. To create each dataset, we first sampled an expression significant (here, the relaxed gene-score cutoff is FDR = 0.01). value for each entity and each individual using a normal distribution with constant mean and standard deviation values In step 2 – “locus identification” – we search for an (for this purpose we used the observed values in the influenza- interacting locus for each selected gene from step 1. To this infection dataset). Next, we randomly sampled the genotype of end, for a given gene and a given locus, the 'gene-specific locus each individual and each SNP from a Bernoulli distribution with score' (in short, “locus score”) is defined as the fraction of split a parameter p, whose value was sampled from a uniform nodes in which the binary decision is a function of the relevant distribution 푈[0.25,0.75] for each SNP. Each SNP j in each locus, where fractions are summed across all trees of the individual i was therefore set to one of two possible genotypic relevant gene and for the flanking genomic loci. In this study, values: 푠푖,푗 ∈ {0,1}. Finally, we generated a quantitative for instance, the flanking genomic region was set to 0.2 Mbp in phenotype that builds on a randomly selected causal SNP j and yeast and 3 Mbp in mouse, in accordance with the length of a randomly selected causal gene g: the genomic interval of association in the corresponding datasets (Brodt et al. 2014; Abu-Toamih Atamni et al. 2017). 푃푖 = 훽푚푚푖푔 + 훽푠푠푖푗 + 휀푖 Overall, each candidate gene-locus interaction consists of a (Eq. 3) gene (selected by criteria I or II) and its top-scored locus. where 푃푖 is the phenotype of individual i, 훽푚 is the gene-effect In step 3 – “interaction assessment” – the statistical size, 푚푖푔 is the input-expression data of gene g in individual i, significance of each candidate interaction is evaluated. Given 훽푠 is the genetic-effect size, and a noise level that the gene is already significant (step 1), the significance of that 휀푖~푁(0, 휎). We generated a large number of such the entire interaction relies on the score of its top-scored synthetic data collections, constructing each collection using a locus. Using the abovementioned m InPhenotype forests that different combination of parameters, including the noise were trained using m permuted datasets, the permutation P- factor = 0.2, 0.4, 1, 1.5, 3 and 5, gene-effect size 훽푚 = 0.05, value of interaction is the percentage of permutation-based 0.1, 0.15, 0.2, 0.3 and 0.45, and the genetic-effect size 훽푠 = 0.5, forests in which a SNP attained SNP scores that are higher than 2, 4, 6, 10 and 15 (gene effects are stronger than genetic its original score. effects, in accordance with our observations in real data; Of note, the term “interaction” refers here to a general Figure S3CD). Unless stated otherwise, the default parameters joint effect of a gene and a locus on the phenotype, with two 훽푚 = 0.45, and 훽푠 = 2 (representing linear relations possible relations: (1) Joint effects without epistasis. For in the absence of epistasis). instance, when two child nodes share similar regression To further assess the accuracy of InPhenotype we coefficients (differing only in their intercepts), there is no gene- generated six additional data collections, using the same locus epistasis. In this case, the advantage of InPhenotype is in process except for specific changes. (1) A data collection was modeling the locus effect in the context of specific gene(s), generated using two causal SNPs instead of only one. The two unlike existing linear models (without epistasis) in which the SNPs exerted a hierarchical effect on the phenotype: the SNP locus has a global effect. (2) Joint effects with epistasis. Such at the root node was associated with the pre-defined genetic- relations appear when two child nodes have distinct regression effect size 훽푠, whereas at its child split node an effect size of coefficients. In this case, the advantage of InPhenotype over 1 훽 was used for the SNP. When applying InPhenotype in this alternative interaction models is in its ability to reduce 2 푠 complexity by focusing on a specific type of interaction (a case, we defined two gene-locus interactions for each gene gene-locus interaction). We further note that in both cases, based on its two top-scored loci. We used the same strategy to InPhenotype’s gene-locus relations are distinct from generate data collections with a larger number of causal SNPs 1 (x SNPs), using an effect size of 훽 for each SNP. (2) A data ‘expression QTLs’, in which the locus is causal to gene 푥 푠 expression levels regardless of the phenotype. collection was generated, in which the joint gene-locus effects involve epistasis. Specifically, instead of using Eq. 3, we used Overall, the main parameters of InPhenotype are the the following formulation: 푃 = 훽 푚 + 훽 푠 + 훽 푚 푠 + number of trees, the stop-criteria (a minimal prediction gain 푖 푚 푖푔 푠 푖푗 푒 푖푔 푖푗 휀 , where 훽 is the size of epistasis (here, 훽 = −2훽 ). (3) A and a minimal number of individuals), and the gene-selection 푖 푒 푒 푚 data collection was generated, in which the noise level varies cutoff. between leafs of the same tree: the low-mean expression leaf

Predicting phenotypic diversity 4 is assigned with a certain noise level whereas the high mean- minimal value of 0.01, which is suitable for cases where the expression leaf is assigned with a higher noise level ’ number of samples is greater than the number of features. 푃̅̅̅̅ ∙ 훽 , where 푃̅ is the mean phenotypic value in our data 푠 To assess the prediction error of each compared method, collection. This synthetic data collection is called the ‘varying we split the 200 individuals in each dataset into two equal- noise levels’ dataset. (4) A data in which several genes (rather sized groups—the ‘training’ and the ‘testing’ individual groups. than one gene) have an effect on the phenotype. To address The algorithm, which was trained on the training group, was this, we used the synthetic data collection from the original then used to predict the phenotype of each sample in the simulation (in which only one gene affects the phenotype and testing group. For each dataset, the prediction error of a given the remaining 49 control genes have no effect). In the 49 method is defined as the RMSE of prediction across all control genes we recalculate the expression 푚 by solving Eq. 푖푔 individuals of the testing group. To estimate the gene-locus 3 using fixed values of 푃푖, 훽푚, 훽푠, 푠푖푗 and . (5) Given that the interaction accuracy, we used the area under the relevant interacting loci tend to have better associations (Figure S3C), 'receiver operating characteristic' (ROC) curve (termed we reasoned that genetic association scores can be used to ‘interaction AUC’ score). The true positive rate (TPR) and the select the SNPs that are given as input to the InPhenotype false positive rate (FPR) were calculated by comparing method. In this data collection we simulated this situation by predicted components to the gold-standard (simulated) causal using a large number of synthetic SNPs (400,000 SNPs) and components. In all cases, both TPR and FPR were calculated on selecting 49 SNPs whose association scores distributed the basis of the rankings of genes and SNPs in each compared similarly to the overall distribution of SNPs in the real murine method. In particular, using InPhenotype we used the SNP and dataset. These SNPs are used as the control SNPs. This gene scores. Since LASSO, Ridge and CAMELOT regressions do synthetic data is referred to as the ‘biological-relevant not test interactions directly, we used the coefficients of their background associations’ dataset. (6) A collection in which the genes (relying on the assumption that the resulting accuracy is allele frequencies are sampled from the genotyping of the real an upper bound on the actual accuracy of interaction mouse dataset. This synthetic data is referred to as the predictions). Finally, since the mgRF code does not provide ‘biological-relevant allele frequencies’ dataset. (7) Data importance scores, we could not test the interaction accuracy collections with varying numbers of control genes, SNPs and of this method. individuals (up to 1000, 1000 and 4000, respectively), which are used to demonstrate the running time. To examine the utility of using InPhenotype’s gain

function, we compared Eq. 1 to a previously-suggested splitting function (Quinlan 1992): Performance analysis 푠푑(푃) − (푁퐿 ∙ 푠푑(푃퐿) + 푁푅 ∙ 푠푑(푃푅)) We compared several alternative methods that also combine ∆error = gene expression with genotyping to predict phenotypic data: (푁퐿 + 푁푅) The LASSO and Ridge regressions (Takagi et al. 2014); the mgRF where NL and NR are the numbers of individuals reaching the method (Chen and Zhang 2013) with a fixed number of feature- right and left child leafs of a certain split node, sd is the selection steps (here, 8 steps); the elastic net-based CAMELOT standard deviation of the phenotype for individuals reaching algorithm (Chen et al. 2009); and InPhenotype using all layers the split node (P), its left child leaf (PL) and its right child leaf of input data (a 'full' InPhenotype model) or, alternatively, (PR). In addition, we compare the prediction yielded by an using phenotype and expression data while omitting genetic InPhenotype tree for a certain individual i (denoted 푃푉푡(푖)) to data (a 'null-genetics' InPhenotype model). InPhenotype was an alternative formulation of tree prediction for individual i, ∗ applied with the following input parameters: the number of denoted 푃푉푡 (푖) (Quinlan 1992), formalized as: trees for each gene was k=100, using a minimal number of 7 individuals (stop criteria) and a prediction-gain cutoff of 0.05 ∗ (푁푥 ∙ 푃푉푡(푖) + 퐾 ∙ 푝푟푒푑(푟표표푡, 푖)) 푃푉푡 (푖) = (stop criteria). Our final prediction model consisted of trees (푁푥 + 퐾) that were constructed using one top-scored gene. The out-of- bag error of a leaf was calculated only for those leaves that where pred(root, i) is the prediction of the root for a given were associated with at least 5 out-of-bag individuals. Exactly individual i based on the regression model of the root, K is a the same parameters were used for the null-genetics constant (here, following (Quinlan 1992), K=15), leaf x is the InPhenotype model. For the mgRF method we applied the leaf that was reached by individual i and is used to calculate its pipeline as described in (Chen and Zhang 2013) with 8 푃푉푡(푖), and Nx is the total number of individuals that were iterations and 5000 trees (this forest size is equivalent to the used to train the model of leaf x. total 460 number of trees trained in the InPhenotype model: n∙k=50*100). For the random forest model we applied the Mouse data analysis randomForest R package, version 4.6, with 5000 trees. For the We applied InPhenotype to phenotyping and transcriptome regularization methods (LASSO, Ridge and CAMELOT’s elastic data that were monitored 2 days after influenza infection in 36 net) we used the glmnet R package, version 2.0.5, with the female mice of different genetic backgrounds, at the age of 8 -penalty = 0 for Ridge, 0.5 for elastic to 10 weeks (data from (Frishberg et al. 2019)). These data -regularization parameters with a were collected across the CC mouse strains (Aylor et al. 2011;

Predicting phenotypic diversity 5 Iraqi et al. 2012), a panel of recombinant inbred mouse strains reconstruction of the InPhenotype model while using all derived by intercrossing of 8 founder strains. InPhenotype used available SNPs in the proximity (±300 kb) of the identified the following input: First, the clinical symptoms of each SNPs. Putative gene-SNP interactions based on the protein individual, using the percentage of weight-loss readout on day interaction network were obtained using the ResponseNet 2 after infection; next, gene-expression data from the lung algorithm (Basha et al. 2013). tissue at 48 h post-infection. To select the most informative A P-value for each gene and each SNP in the mouse genes, we took the 600 genes with the highest differential dataset was also calculated using each of the compared expression between their average expressions (across methods based on their regression coefficients (LASSO, Ridge individuals) at 48 h post-infection compared to the average and CAMELOT regressions) and importance scores (RF): the expression of control (PBS-treated) individuals (data from coefficients (or importance scores) resulting from the real- (Frishberg et al. 2019)). Finally, a genotyping dataset for all CC data-based regression were compared to the corresponding strains was downloaded from the UNC genetics repository scores (of the same gene or SNP) resulting from 100 permuted (http://csbio.unc.edu/CCstatus), comprising measurements datasets. from MegaMuga (a 77-K marker-genotyping array based on the

Illumina Infinium platform). Yeast data analysis To handle missing genotyping data and erroneous genotyping, we improved the measurements of each SNP We applied InPhenotype on published expression data from 79 based on the accumulated information (haplotype Saccharomyces cerevisiae yeast segregants derived from the probabilities) derived from its flanking genomic regions. This BY/RM cross, following their exposure to rapamycin for 50 min was done using the following three steps (Figure S4): First, for (Materials et al. 2011). Our chosen phenotype of interest here a given genome interval, the haplotype probabilities of each of was cell growth in response to this rapamycin treatment. 8 founder lines in each CC line were calculated using the HAPPY InPhenotype was applied on the log expression of the 600 package (Mott et al. 2000). Secondly, to reduce the genes with the highest variation across all yeast samples. We dimensionality we applied principal component analysis on the used genotyping of these yeast segregants (Brem et al. 2002), resulting haplotype probabilities of the 8 founder lines. For from which we selected those SNPs with the most significant each SNP we continued with the number of principal association with the cell-growth phenotype. Of the 2956 components that explains at least 70% of the probability published SNPs, to reduce runtime, the 600 top-associated variance. Thirdly, for each SNP we then applied a Support SNPs were used as input for the InPhenotype method. 17 of Vector Machine classifier on those strains that contained one these 600 selected SNPs were significantly associated (FDR< of the two most abundant genotypes. Those genotypes were 0.06). used as labels for the classifier learning process, while the We applied InPhenotype using the following input selected principal components were used as features. The parameters: the number of trees for each gene was 100, the trained classifier obtained for each SNP was used to produce minimal number of individuals was 14, the prediction-gain the desired genotyping categorization for all strains. In cutoff was 5%, the FDR-corrected permutation P-value cutoff particular, we used the following rule: for a given SNP, if the of genes was 0.01 (Figure S3B), and the FDR-corrected proportion of correctly classified strains was lower than 95% permutation P-value cutoff of gene-locus interactions was we disregarded it; otherwise, we used the predicted labels as 0.06. The out-of-bag error of a leaf was calculated only for the binary genotyping values. To reduce runtime, among these those leaves that were associated with at least 5 out-of-bag SNPs we further selected 706 SNPs that obtained the top samples. After identifying promising gene-locus interactions, associations (P<0.05) with the phenotype. This feature we refined the positions of their candidate SNPs by selection technique was later justified by the observation that reconstructing the model while using all SNPs in the proximity the significant gene-locus interactions tend to lower (better) (± 30 kb) of the identified locus. The ResponseNet algorithm association test P-values in the mouse dataset (Figure S3C). Of (Basha et al. 2013) was applied to identify putative gene-SNP note, only one of the 706 selected SNPs attained significant relations based on the protein-protein interaction network. In associations (FDR > 0.06 for all remaining SNPs; Figure S3C). addition, P-values of genes and SNPs were also calculated using Overall, out of 37,404 published SNPs, 706 were used as input the alternative methods as described in the case of mouse data for the InPhenotype algorithm. analysis. We applied InPhenotype using the following input parameters: the number of trees for each gene was 100, the Data availability minimal number of individuals was 7, the prediction-gain cutoff All data sets used in this work are fully presented in the paper. was 5%, the FDR-corrected permutation P-value cutoff of The code for the InPhenotype algorithm is available for genes was 0.01 (Figure S3A), and the FDR-corrected download at https://github.com/tomharel86/InPhenotype.git. permutation P-value cutoff of loci for their interaction with All supplementary materials have been uploaded to figshare. genes was 0.06. The out-of-bag error of a leaf was calculated only for those leaves that were associated with at least 5 out- RESULTS of-bag samples. Once gene-locus interactions were identified, the positions of the identified SNPs were refined by Overview of InPhenotype

Predicting phenotypic diversity 6 Our aim here was to integrate gene expression data and occur through randomness, where each gene tree is trained by genotyping data in order to predict phenotypic diversity. One the use of a random sample of SNPs and a random sample of of the main challenges was to handle a very large amount of individuals. Once trained, the overall forest consists of the candidate genes and genomic loci (SNPs), which may have joint entire collection of trees from the various genes and is referred effects on the phenotype. None of the current studies to as the 'InPhenotype forest' (Figure 1, left). Through a adequately addresses this challenge: multigenic linear models tailored tree-weighting scheme, the predictions of all gene do not typically model biological interactions (Chen et al. 2009; trees in the forest are combined into a certain phenotype Takagi et al. 2014)(Rao and Knowles 2019), whereas other prediction value. models are highly complex, representing any type of The InPhenotype model is further used to identify 'gene- interactions (gene-gene, gene-locus and locus-locus locus interactions' that make substantial contributions to the interactions) (Lee et al. 2006; Chen and Zhang 2013). To reduce phenotypic outcome. To identify these interactions, we first complexity, the latter methods convert the expression of genes calculate a score for each gene based on the quality of its own into discrete levels and therefore do not exploit the continuous gene trees (a 'gene score'), and then identify the interacting nature of molecular measurements. InPhenotype addresses locus for each top-scored gene based on the representation of the shortcomings of those methods by building a model with the locus within the trees of that gene (a gene-specific ‘locus an explicit representation of gene-locus interactions without score’). Finally, the statistical significance of each candidate systematic modeling of gene-gene interactions and locus-locus interaction is evaluated through permutation tests. We note interactions, therefore reducing the complexity of the model. that in the context of InPhenotype, the term ‘gene-locus Although InPhenotype does not directly account for any locus- interaction’ refers to their joint effect on the phenotype, which locus and gene-gene interaction, such interactions may be does not always involve epistasis. Gene-gene interactions are indirectly identified through the effect of genomic loci on the ignored, being modeled in different trees. Similarly, locus-locus expression of genes that may subsequently interact with other interactions are modeled only in the context of a specific gene. loci. Unlike existing methods (Lee et al. 2006; Chen and Zhang For a detailed description of the InPhenotype forest, 2013) (Figure S1), InPhenotype exploits the continuous nature emphasizing differences from previous models, see Methods of the transcriptome (Figure 1, right). Furthermore, as and Figure S2. different loci affect the phenotype jointly with different genes, the model is distinct from multigenic linear models, even in the absence of epistasis. Performance analysis InPhenotype builds on a hierarchical partitioning that is We evaluated the overall generalization ability of InPhenotype organized in a regression tree model. The tree model is a tree using synthetic data in a population of individuals. To this end in which each split node tests the incoming genotype of a we generated a collection of 25 datasets for each combination certain SNP (note that different genetic architecture, such as of data parameters (gene-effect size, genetic-effect size, noise inbred/outbred and haploids/diploids, would require different level), where each such dataset consisted of one gene-locus test functions), and each terminal node (leaf) is a predictor interaction that has a causal effect on the phenotype, either model in the form of a linear regression between the using a linear model or using deviations from a linear model expression of a given gene and the quantitative phenotype. A (epistasis). We evaluated the quality of prediction using an tree, therefore, is a piecewise hierarchical regression model error-prediction metric, which we assessed by using a testing (Quinlan 1992) of a single gene, which we call a 'gene tree' set of individuals. We compared the prediction errors of (Methods). The InPhenotype model is a random regression InPhenotype to those of five existing transcription-genotyping forest model (Quinlan 1992; Criminisi 2011), where each gene prediction methods: LASSO and Ridge regularized regressions is associated with a collection of slightly different trees (instead (Takagi et al. 2014), the CAMELOT of a single tree). Differences between trees of the same gene

genotyping and gene-expression data (Figure S7A), highlighting regularized regression based on elastic-net (Chen et al. 2009), the importance of data integration. mgRF (which applies random forest in the presence of an iterative feature-selection scheme; (Chen and Zhang 2013)), InPhenotype applies a weighting scheme in which and a standard random forest (Breiman 2001) (Methods). We different trees contribute differently to the output prediction observed that the prediction errors of InPhenotype were (Methods). Theoretically, this may lead to poor generalization significantly lower than those of any alternative method across capability. However, we observed that a modified InPhenotype a wide variety of data parameters (linear model: Figure 2A and testing process in which the tree-weighting scheme is omitted Figure S5A; epistasis: Figure S5B). As a baseline, we utilized (so that all trees contributed equally to the output forest permuted data and, as expected, found poor performance prediction) attained substantially higher prediction errors (e.g., using all methods (Figure 2A). These results were robust to Figure S7B), thus alleviating the risk of overfitting due to the InPhenotype parameters (Figure S6). When applying tree-weighting approach. InPhenotype on gene-expression data alone (using an empty Similarly to existing analyses using random forests set of input SNPs), the prediction errors of InPhenotype were (Lunetta et al. 2004; Bureau et al. 2005; Huang et al. 2005; Qi higher than those obtained by combined modeling of

Predicting phenotypic diversity 7 et al. 2006), InPhenotype allowed assignment of the regulatory with a biologically-relevant distribution of allele frequency and interactions of genes and genomic loci to the outcome locus-phenotype associations (Figure S8CD). As expected, phenotype. To measure the accuracy of predicting such gene- InPhenotype’s performance was substantially reduced with locus interactions, we used the area under the relevant increasing numbers of causal SNPs interacting with the same receiver-operating-characteristic (ROC) curve (an ‘interaction gene, emphasizing that InPhenotype is most effective in AUC' score; Methods). Figure 2B depicts such accuracy in identifying interactions of each gene with a single locus, but is comparison to that of four alternative methods (mgRF was still quite effective in identifying interactions of each gene with omitted because its implementation is limited, while the a few loci (Figure S9). We further observed that InPhenotype’s regression-based methods—LASSO, Ridge, and CAMELOT—do binary split function and tree prediction outperform alternative not predict interaction and thus their evaluations are limited to formulations (Figure S8E) and that the running time is nearly- locus-prediction accuracy; see Methods). Whereas the linear with the number of genes, genomic loci, and individuals accuracy of each of the four compared methods was (Figure S8F). dependent on data parameters, InPhenotype could identify the correct gene-locus interaction with high accuracy for a larger Application of InPhenotype to the response of yeast to range of parameter values. As expected, in all cases the rapamycin and the murine response to Influenza virus application of any method on permuted data resulted in poor performance. To test the InPhenotype algorithm we applied it to two data sets. The first consisted of transcriptome and growth rates in Similar results were obtained when testing alternative response to rapamycin treatment of 79 genotyped yeast scenarios, in particular, with different numbers of genes whose strains (data from (Materials et al. 2011)). The second was the gene-locus interactions contribute to the phenotype (Figure S8A), when using leaf-specific level of noise (Figure S8B), and rapamycin responses we found a high-order organization of transcriptome in lung tissue during influenza infection of 18 this network, where the connectivity of certain genomic loci recombinant inbred ‘Collaborative Cross’ (CC) mouse strains, was substantially greater than that expected for a using measurements of a clinical symptom (percentage of permutation-based network (Figure 3B, 4B; see Methods). We weight loss during infection) as a physiological phenotype (data refer here to a group of genes with a shared locus as a 'module' with genotyping from (Srivastava et al. 2017)). Our goal was to and denote the identified genomic loci—and their respective apply InPhenotype to identify combinations of genes and SNPs modules—as Y1 to Y11 in yeast and M1 to M10 in mouse that contribute significantly to a physiological response (Figure (Tables S1, S2). In particular, out of the 11 genomic loci 3A, 4A). We used data permutation to generate the null identified in the rapamycin response, 4 affected phenotypic distribution of gene scores and then assessed the significance diversity jointly with more than 1 gene (modules Y1 to Y4 of 20, of scores using this background distribution. This analysis 2, 2 and 2 genes, respectively; for example, module Y1 in identified 43 genes in yeast and 46 genes in mouse (FDR- Figure 3C); and similarly, out of the 10 genomic loci identified corrected P-value < 0.01; Methods); whereas genes selected by in the murine response to influenza, 7 interacted with modules criterion I (29 and 28 in mouse and yeast, respectively) could larger than 1 gene (3 to 7 genes; for example, module M2 in be also identified in the absence of genotyping data, genes in Figure 4c). In contrast, in permuted datasets, connectivity with criterion II could not be revealed in the absence of genotyping 2 genes occurred in only 3.8% (yeast) and 4% (mouse) of the data (Figure S3ABD). For 73 of these genes we found a interacting loci, and none of the modules exceeded the size of genomic locus that attained a significant gene-locus interaction 2 genes (Figures 3b, 4b, respectively). The existence of such (FDR-corrected P-value <0.06; Methods) (Tables S1, S2). For 'high-connectivity loci' could not be explained simply by the comparison, alternative methods (including standard genetic genetic association in standard GWAS analysis. Rather, associations) obtained a substantially lower amount of association-score distributions of high-connectivity loci were significant genes and SNPs (Figure S3E), suggesting improved similar to those of the background distribution (Figure S3C), statistical power of InPhenotype compared to alternative providing evidence that high-order organization is not due to approaches. In the following we focus on the identified 73 artifacts imposed during the data generation and analysis gene-locus interactions. procedures. We organized the results as a 'regulatory network', Encouraged by these results, we also performed an in- containing both genes and genomic loci nodes whose depth functional analysis of the regulatory networks. For the regulatory interactions are presented as gene-to-locus gene-locus interactions acting on each physiological phenotype relationships (Figure 3A, 4A). Intriguingly, in both influenza and we first describe the functionality of genes, then consider promising candidates within the relevant genomic loci, and finally, present potential relations between the interacting genomic loci and genes.

Yeast response to rapamycin. We found that many of the identified genes have known roles in vegetative growth and

Predicting phenotypic diversity 8 rapamycin resistance, as expected (7 of 35 genes; Table S1 and associated family, which is associated with mouse weight loss Figure 3a). In particular, four genes (STE3, MUP1, PUF6, and after influenza infection (Peirce et al. 2010); and Rassf3, a ZPR1) have been reported to affect yeast resistance to member of the RAS superfamily that is associated with rapamycin (Ansari et al. 2002; Parsons et al. 2004; Butcher et internalization of the influenza virus (Fujioka et al. 2011). al. 2006; Houston et al. 2006; van Pel et al. 2013), and Similarly, the inferred genomic loci of the various overexpression or deletion of 10 genes (ALB1, BAR1, NOC3, modules consist of various candidate causal genes (Table S2, RGI1, UTP14, NOP1, ERB1, ESBP6, NOP2 and UTP23) have an column 6). For instance, module M1 contains Pkm2, a pyruvate effect on vegetative yeast growth (Yoshikawa et al. 2011; Lu et kinase that interacts with influenza RNA upon infection al. 2013; Sharma et al. 2013). More formally, using the (Miyake et al. 2017), and the Rpl11 gene, which is involved in extensive annotation of yeast genes in the influenza-induced nuclear stress (Yan et al. 2017). In module database (Consortium 2000; Carbon et al. 2017), we found M2, Capn6 encodes a cysteine protease that decreases enrichment of predicted genes in two main functional influenza infection upon targeting (Blanc et al. 2016) and is categories that relate to vegetative yeast growth: ribosome -7 - required for viral replication (König et al. 2010); similarly Pak3 biogenesis (13 genes, p<10 ) and reproduction (8 genes, p<10 5 encodes a protein kinase that is associated with infection by ). Intriguingly, we observed a good match between the the ssRNA West Nile virus (Li et al. 2013) (Figure 4c). In module membership of genes in modules and their functional M4, Il5ra encodes the receptor of the Il5 cytokine (Gorski et al. annotation (Figure 3a): genes of the same module tended to 2013; Gorski and Braciale 2013; Ravanetti et al. 2017); Rad18 have a common functional annotation. Specifically, whereas encodes an E3 ubiquitin ligase that acts as a DNA damage module Y1 was found to be enriched in ribosome biogenesis repair protein and was reported to have an effect on viral genes (12 out of 13 ribosome biogenesis genes; e.g., ALB1 and infection (Lloyd et al. 2006); finally, the Oxtr receptor of NOP2), modules Y2 to Y4 were enriched in reproduction (6 out oxytocin is also associated with viral infection (Liu and Conboy of 8 reproduction-related genes; e.g. MFA2, MF(ALPHA)2). 2017). In module M5, the product of Rbx1 either influences Suggested causal genes of relevant functionalities within the influenza infection through ubiquitination (Gschweitl et al. genomic loci of the identified modules include the ribosome 2016) or decreases IFN signaling through viral hijack (Davis and biogenesis genes MRP1, MRPS28, and FCF1 in module Y1 Patton 2017); Ep300 was reported to be exploited by several (Woolford and Baserga 2013) and reproduction gene STE2 in viruses (Dyer et al. 2008) to facilitate viral replication or module Y2 ((Bardwell 2004), Figure 3C). Figure S10 (top) mitigate host-defense abilities (Castillo et al. 2014). Finally, suggests signaling and regulatory pathways through which the Mfhas1, which encodes a regulator of Toll-like receptor- identified loci interact with the expressed genes, based on the dependent signaling (Zhong et al. 2017), is located in the analysis of the known protein interaction network. genomic region of module M6. Figure S10 (bottom) exploits the protein interaction network to suggest a mechanism by which some of the identified loci interact with genes of their Influenza infection in mice. The identified genes were enriched modules. Further experimentation will be needed to validate in immune and immune-related pathways (for example, the various candidate genes and to characterize the lymphocyte function and interferon (IFN)-receptor signaling, mechanisms by which the identified loci interact with these p<10−5 and p<10−11, respectively) and targets of transcription genes. factors, including the key antiviral factors Irf7 and Stat1 (p<10−11 and p<10−11, respectively; Table S2). For instance, the antiviral-related genes include Atf3, a transcription factor that DISCUSSION regulates the cellular antiviral autophagy (Labzin et al. 2015; InPhenotype can be applied to any input data in which Sood et al. 2017) and IFN expression in natural killer cells continuous phenotypes could be predicted by continuous and (Rosenberger et al. 2008); the E3 ubiquitin ligase Dtx3l, which discrete personal data, therefore opening new prospects for targets viral 3C protease to control viral infection (Zhang et al. systematic analysis of comprehensive data resources, such as 2015); Oasl2, an oligoadenylate synthetase-like gene that was the 500-FG dataset (Schirmer et al. 2016; ter Horst et al. 2016) recently revealed as one of the controllers of antiviral innate and the wellness study dataset (Price et al. 2017). The results immunity (Eskildsen et al. 2003; Choi et al. 2015; Zhu et al. obtained when InPhenotype is applied on synthetic data 2015); and Rtp4, which encodes a receptor-transported protein demonstrate its superior quality over existing methods (Figure that acts as an effector of the type I IFN antiviral response 2). In particular, the results obtained with synthetic data (Schoggins et al. 2011; Touzot et al. 2012; Davenport et al. demonstrate the ability of InPhenotype to facilitate the 2015). Notably, several of the identified genes play a specific detection of regulatory interactions between genomic loci and role in influenza infection, such as Irgm1, a member of the p47 expressed genes, implying that the model can be used to immunity-related GTPase family, which is related to weight loss obtain concrete biological insights. The success of InPhenotype after influenza infection (Steed et al. 2017); Mlkl, whose is attributable to (i) accounting for gene-locus interactions encoded protein facilitates protection against influenza virus while ignoring global locus-locus interactions and gene-gene through cell-death pathways (Nogusa et al. 2016; Death et al. interactions. This way, InPhenotype balances between 2017); Tnfaip3, a TNFα-induced gene whose deficiency in accurate identification of gene-locus interactions while myeloid cells protects against influenza infection (Maelfait et maintaining a reasonable predictive power; and (ii) al. 2012); Themis2, a member of the thymocyte selection- InPhenotype’s ability to model gene-locus interactions while

Predicting phenotypic diversity 9 exploiting the contiguous nature of molecular data, unlike The authors wish to thank Shirley Smith for scientific editing. existing methods (Chen et al. 2009)(Lee et al. 2006; Chen and This work was supported by the Israel Science Foundation Zhang 2013; Takagi et al. 2014). Extra care must be taken when Grant 288/16 (TH, NP-Y and IG-V) and partially supported by using InPhenotype-derived gene-locus interactions. First, the European Research Council grant 637885, the Israeli Centers of analysis is limited to correlations (rather than causality) Research Excellence (I-CORE) Center No 41/11 (TH), ISF 470/17 between molecular components and phenotypic diversity. (EB), partial fellowships from the Edmond J. Safra Center for Secondly, the InPhenotype model does not rely on a classical Bioinformatics at Tel Aviv University (TH). IG-V is a Faculty genetic interaction model (Lynch and Walsh 1998) and Fellow of the Edmond J. Safra Center for Bioinformatics at Tel therefore the term ‘interactions’ refers to the fact that the Aviv University and an Alon Fellow. genomic loci could not have been revealed in the absence of expression data for the specific interacting gene (Figure S3C). LITERATURE CITED Our examination of InPhenotype's models in yeast and Abu-Toamih Atamni H. J., Y. Ziner, R. Mott, L. Wolf, and F. A. mouse data revealed that specific genomic loci show higher Iraqi, 2017 Glucose tolerance female-specific QTL frequencies of interaction with gene expression (Figure 3A, mapped in collaborative cross mice. Mamm. Genome. 4A), suggesting that specific loci are more involved in https://doi.org/10.1007/s00335-016-9667-2 regulatory interactions than other loci. The functional annotations that we observed in yeast are consistent with such Ansari H., G. Greco, and J. Luban, 2002 Cyclophilin A Peptidyl- modularity of gene-locus interactions: genes interacting with Prolyl Isomerase Activity Promotes Zpr1 Nuclear Export. distinct loci were found to belong to different functional Mol. Cell. Biol. 22: 6993–7003. annotations. https://doi.org/10.1128/MCB.22.20.6993-7003.2002 Asyali M., D. Colak, O. Demirkaya, and M. Inan, 2006 Gene The output InPhenotype model is a valuable resource Expression Profile Classification: A Review. Curr. for uncovering the regulatory mechanisms that underlie Bioinform. 1: 55–73. phenotypic diversity. For example, Glipr2 (module M2) is https://doi.org/10.2174/157489306775330615 known to be responsive to IFN-gamma (Bonville et al. 2009), but has not yet been described in the context of influenza Aylor D. L., W. Valdar, W. Foulds-mathes, R. J. Buus, R. a infection; future experiments on influenza-infected mutant Verdugo, et al., 2011 Genetic analysis of complex traits strains can, therefore, now facilitate characterizations of such in the emerging Collaborative Genetic analysis of genes. Similarly, in studying genes within high-connectivity loci complex traits in the emerging Collaborative Cross. it will now be possible to characterize additional causal genes Genome Res. 21: 1213–1222. (e.g., Table S2, column 6). https://doi.org/10.1101/gr.111310.110 The present work offers a basis on which future Bardwell L., 2004 A walk-through of the yeast mating studies can explore the molecular and genetic effects pheromone response pathway. Peptides. underlying complex traits. First, given the generality of our Basha O., S. Tirman, A. Eluk, and E. Yeger-Lotem, 2013 framework, a variety of continuous omics data types such as ResponseNet2.0: Revealing signaling and regulatory metabolites, protein levels, and immune-system cell pathways connecting your proteins and genes--now with proportions could be used instead of gene expression as human data. Nucleic Acids Res. phenotype predictors. Secondly, future extensions into a more https://doi.org/10.1093/nar/gkt532 complex InPhenotype model (e.g., using simple or multiple regression incorporating several genes in each leaf, or using Bhattacharjee M., and M. J. Sillanpää, 2011 A bayesian mixed multigenic scores in each split node) may allow modeling of regression based prediction of quantitative traits from additional types of interactions, but such extensions should be molecular marker and gene expression data. PLoS One 6. interpreted with caution due to the complexity of the model. https://doi.org/10.1371/journal.pone.0026959 Furthermore, because the model is quite broad, other effects Blanc F., L. Furio, D. Moisy, H. L. Yen, M. Chignard, et al., 2016 (such as population structure or covariates) could be Targeting host calpain proteases decreases influenza A incorporated into it. Similarly, it would be interesting to test virus infection. Am J Physiol Lung Cell Mol Physiol 74: the utility of InPhenotype as a classification algorithm (e.g., as ajplung.00314.2015. in (Landwehr et al. 2005)). Thirdly, to improve InPhenotype’s https://doi.org/10.1152/ajplung.00314.2015 prediction quality and focus on the most relevant genes and genomic loci, an advanced feature selection could be Bonville C. A., C. M. Percopo, K. D. Dyer, J. Gao, C. Prussin, et integrated as an introductory step or to facilitate iterative al., 2009 Interferon-gamma coordinates CCL3-mediated improvements. Finally, InPhenotype paves the way for the neutrophil recruitment in vivo. BMC Immunol. 10: 1–13. expansion of general phenotype prediction models, such as https://doi.org/10.1186/1471-2172-10-14 models using time-based data sets or spline functions. Breiman L. E. O., 2001 Random Forests. 5–32. Brem R., G. Yvert, R. Clinton, and L. Kruglyak, 2002 Genetic ACKNOWLEDGMENTS dissection of transcriptional regulation in budding yeast. Science (80-. ). 296: 752–755.

Predicting phenotypic diversity 10 https://doi.org/10.1126/science.1069516 https://doi.org/10.15698/mic2017.11.600 Brodt A., M. Botzman, E. David, and I. Gat-Viks, 2014 Dissecting Death R. C., R. J. Thapa, J. P. Ingram, K. B. Ragan, S. Nogusa, et Dynamic Genetic Variation That Controls Temporal Gene al., 2017 HHS Public Access. 20: 674–681. Response in Yeast. PLoS Comput. Biol. https://doi.org/10.1016/j.chom.2016.09.014.DAI https://doi.org/10.1371/journal.pcbi.1003984 Dyer M. D., T. M. Murali, and B. W. Sobral, 2008 The landscape Bureau A., J. Dupuis, K. Falls, K. L. Lunetta, B. Hayward, et al., of human proteins interacting with viruses and other 2005 Identifying SNPs predictive of phenotype using pathogens. PLoS Pathog. 4. random forests. Genet. Epidemiol. 28: 171–182. https://doi.org/10.1371/journal.ppat.0040032 https://doi.org/10.1002/gepi.20041 Eskildsen S., J. Justesen, M. H. Schierup, and R. Hartmann, 2003 Butcher R. A., B. S. Bhullar, E. O. Perlstein, G. Marsischky, J. Characterization of the 2′-5′-oligoadenylate synthetase LaBaer, et al., 2006 Microarray-based method for ubiquitin-like family. Nucleic Acids Res. 31: 3166–3173. monitoring yeast overexpression strains reveals small- https://doi.org/10.1093/nar/gkg427 molecule targets in TOR pathway. Nat. Chem. Biol. 2: Frishberg A., N. Peshes-Yaloz, O. Cohn, D. Rosentul, Y. 103–109. https://doi.org/10.1038/nchembio762 Steuerman, et al., 2019 Cell composition analysis of bulk Carbon S., H. Dietze, S. E. Lewis, C. J. Mungall, M. C. Munoz- genomics using single-cell data. Nat. Methods. Torres, et al., 2017 Expansion of the gene ontology https://doi.org/10.1038/s41592-019-0355-5 knowledgebase and resources: The gene ontology Fujioka Y., M. Tsuda, T. Hattori, J. Sasaki, T. Sasaki, et al., 2011 consortium. Nucleic Acids Res. 45: D331–D338. The Ras-PI3K Signaling Pathway Is Involved in Clathrin- https://doi.org/10.1093/nar/gkw1108 Independent Endocytosis and the Internalization of Castillo A., L. Wang, C. Koriyama, Y. Eizuru, K. Jordan, et al., Influenza Viruses. PLoS One 6: 1–9. 2014 A systems biology analysis of the changes in gene https://doi.org/10.1371/journal.pone.0016324 expression via silencing of HPV-18 E1 expression in HeLa Gorski S. A., Y. S. Hahn, and T. J. Braciale, 2013 Group 2 Innate cells. Open Biol. 4: 130119-. Lymphoid Cell Production of IL-5 Is Regulated by NKT https://doi.org/10.1098/rsob.130119 Cells during Influenza Virus Infection. PLoS Pathog. 9. Chen B.-J., H. C. Causton, D. Mancenido, N. L. Goddard, E. O. https://doi.org/10.1371/journal.ppat.1003615 Perlstein, et al., 2009 Harnessing gene expression to Gorski S., and T. Braciale, 2013 IL-5 produced by natural helper identify the genetic basis of drug resistance. Mol. Syst. cells suppresses neutrophil function during influenza Biol. 5: 310. https://doi.org/10.1038/msb.2009.69 infection. (P4239). J. Immunol. 190: 130.19 LP-130.19. Chen Z., and W. Zhang, 2013 Integrative analysis using module- Gschweitl M., A. Ulbricht, C. A. Barnes, R. I. Enchev, I. Stoffel- guided random forests reveals correlated genetic factors Studer, et al., 2016 A SPOPL/cullin-3 ubiquitin ligase related to mouse weight. PLoS Comput. Biol. 9: complex regulates endocytic trafficking by targeting e1002956. EPS15 at endosomes. Elife 5: 1–26. https://doi.org/10.1371/journal.pcbi.1002956 https://doi.org/10.7554/eLife.13841 Choi U. Y. ung, J. S. Kang, Y. S. ahng Hwang, and Y. J. Kim, 2015 Horst R. ter, M. Jaeger, S. P. Smeekens, M. Oosting, M. A. Oligoadenylate synthase-like (OASL) proteins: dual Swertz, et al., 2016 Host and Environmental Factors functions and associations with diseases. Exp. Mol. Med. Influencing Individual Human Cytokine Responses. Cell. 47: e144. https://doi.org/10.1038/emm.2014.110 https://doi.org/10.1016/j.cell.2016.10.018 Consortium T. G. O., 2000 Gene ontologie: Tool for the Houston E., W. Y. Ho, A. Stray-pedersen, E. L. Ocheltree, D. unification of biology. Nat. Genet. 25: 25–29. Philip, et al., 2006 Proc . Natl . Acad . Sci . USA ( 103, https://doi.org/10.1038/75556.Gene 6659 – 6664; Criminisi A., 2011 Decision Forests: A Unified Framework for Huang X., W. Pan, S. Grindle, X. Han, Y. Chen, et al., 2005 A Classification, Regression, Density Estimation, Manifold comparative study of discriminating human heart failure Learning and Semi-Supervised Learning. Found. Trends® etiology using gene expression profiles. BMC Comput. Graph. Vis. 7: 81–227. Bioinformatics 6. https://doi.org/10.1186/1471-2105-6- https://doi.org/10.1561/0600000035 205 Davenport E. E., R. D. Antrobus, P. J. Lillie, S. Gilbert, and J. C. Iraqi F. A., M. Mahajne, Y. Salaymah, H. Sandovski, H. Tayem, et Knight, 2015 Transcriptomic profiling facilitates al., 2012 The genome architecture of the collaborative classification of response to influenza challenge. J. Mol. cross mouse genetic reference population. Genetics 190: Med. 93: 105–114. https://doi.org/10.1007/s00109-014- 389–401. https://doi.org/10.1534/genetics.111.132639 1212-8 König R., S. Stertz, Y. Zhou, A. Inoue, H. H. Hoffmann, et al., Davis K. A., and J. T. Patton, 2017 Shutdown of interferon 2010 Human host factors required for influenza virus signaling by a viral-hijacked E3 ubiquitin ligase. Microb. replication. Nature 463: 813–817. cell 4: 387–389.

Predicting phenotypic diversity 11 https://doi.org/10.1038/nature08699 Nogusa S., R. J. Thapa, C. P. Dillon, S. Liedmann, T. H. Oguin, et al., 2016 RIPK3 Activates Parallel Pathways of MLKL- Labzin L. I., S. V. Schmidt, S. L. Masters, M. Beyer, W. Krebs, et Driven Necroptosis and FADD-Mediated Apoptosis to al., 2015 ATF3 Is a Key Regulator of Macrophage IFN Protect against Influenza A Virus. Cell Host Microbe 20: Responses. J. Immunol. 195: 4446–4455. 13–24. https://doi.org/10.1016/j.chom.2016.05.011 https://doi.org/10.4049/jimmunol.1500204 Parsons A. B., R. L. Brost, H. Ding, Z. Li, C. Zhang, et al., 2004 Landwehr N., M. Hall, and E. Frank, 2005 Logistic model trees. Integration of chemical-genetic and genetic interaction Mach. Learn. https://doi.org/10.1007/s10994-005-0466- data links bioactive compounds to cellular target 3 pathways. Nat. Biotechnol. 22: 62–69. Lee S.-I., D. Pe’er, A. M. Dudley, G. M. Church, and D. Koller, https://doi.org/10.1038/nbt919 2006 Identifying regulatory mechanisms using individual Peirce M. J., M. Brook, N. Morrice, R. Snelgrove, S. Begum, et variation reveals key role for chromatin modification. al., 2010 Themis2/ICB1 is a signaling scaffold that Proc. Natl. Acad. Sci. U. S. A. 103: 14062–7. selectively regulates macrophage toll-like receptor https://doi.org/10.1073/pnas.0601852103 signaling and cytokine production. PLoS One 5. Li J., S. C. Ding, and H. Cho, 2013 A Short Hairpin RNA Screen of https://doi.org/10.1371/journal.pone.0011465 Interferon-Stimulated Genes Identifies. Am. Soc. Pel D. M. van, P. C. Stirling, S. W. Minaker, P. Sipahimalani, and Microbiol. 4: e00385-13. P. Hieter, 2013 Saccharomyces cerevisiae Genetics https://doi.org/10.1128/mBio.00385-13.Editor Predicts Candidate Therapeutic Genetic Interactions at Liu Y., and I. Conboy, 2017 Unexpected evolutionarily the Mammalian Replication Fork. G3&#58; conserved rapid effects of viral infection on oxytocin Genes|Genomes|Genetics 3: 273–282. receptor and TGF-β/pSmad3. Skelet. Muscle 7: 1–10. https://doi.org/10.1534/g3.112.004754 https://doi.org/10.1186/s13395-017-0125-y Price N. D., A. T. Magis, J. C. Earls, G. Glusman, R. Levy, et al., Lloyd A. G., S. Tateishi, P. D. Bieniasz, M. A. Muesing, M. 2017 A wellness study of 108 individuals using personal, Yamaizumi, et al., 2006 Effect of DNA repair protein dense, dynamic data clouds. Nat. Biotechnol. Rad18 on viral infection. PLoS Pathog. 2: 368–373. https://doi.org/10.1038/nbt.3870 https://doi.org/10.1371/journal.ppat.0020040 Qi Y., Z. Bar-Joseph, and J. Klein-Seetharaman, 2006 Evaluation Lu J., M. Sun, and K. Ye, 2013 Structural and functional analysis of different biological data and computational of Utp23, a yeast ribosome synthesis factor with classification methods for use in protein interaction degenerate PIN domain. RNA 19: 1815–24. prediction. Proteins 63: 490–500. https://doi.org/10.1261/rna.040808.113 https://doi.org/10.1002/prot.20865 Lunetta K. L., L. B. Hayward, J. Segal, and P. van Eerdewegh, Quinlan J. R., 1992 LEARNING WITH CONTINUOUS CLASSES, in 2004 Screening large-scale association study data: Proceedings AI’92, 5th Australian Conference on Artificial Exploiting interactions using random forests. BMC Intelligence.World Scientific,. Genet. 5. https://doi.org/10.1186/1471-2156-5-32 Rao A. S., and J. W. Knowles, 2019 Polygenic risk scores in Lynch M., and B. Walsh, 1998 Genetics and Analysis of coronary artery disease. Curr. Opin. Cardiol. 1. Quantitative Traits. Sinauer, Sunderland, Mass. 980. https://doi.org/10.1097/HCO.0000000000000629 Maelfait J., K. Roose, P. Bogaert, M. Sze, X. Saelens, et al., 2012 Ravanetti L., A. Dijkhuis, Y. S. Sabogal Pineros, S. M. Bal, B. S. A20 (Tnfaip3) deficiency in myeloid cells protects against Dierdorp, et al., 2017 An early innate response underlies influenza A virus infection. PLoS Pathog. 8. severe influenza-induced exacerbations of asthma in a https://doi.org/10.1371/journal.ppat.1002570 novel steroid-insensitive and anti-IL-5-responsive mouse Materials S. I., Y. Genome, and T. R. M. Average, 2011 model. Allergy Eur. J. Allergy Clin. Immunol. 72: 737– Supporting Information. Am. Stat. 1–9. 753. https://doi.org/10.1111/all.13057 https://doi.org/10.1073/pnas.1116442108/- Rosenberger C. M., A. E. Clark, P. M. Treuting, C. D. Johnson, /DCSupplemental.www.pnas.org/cgi/doi/10.1073/pnas. and A. Aderem, 2008 ATF3 regulates MCMV infection in 1116442108 mice by modulating IFN-gamma expression in natural Miyake Y., K. Ishii, and A. Honda, 2017 Influenza virus infection killer cells. Proc. Natl. Acad. Sci. U. S. A. 105: 2544–9. induces host pyruvate kinase M which interacts with https://doi.org/10.1073/pnas.0712182105 viral RNA-dependent RNA polymerase. Front. Microbiol. Ruderfer D. M., D. C. Roberts, S. L. Schreilber, E. O. Perlstein, 8: 1–7. https://doi.org/10.3389/fmicb.2017.00162 and L. Kruglyak, 2009 Using expression and genotype to Mott R., C. J. Talbot, M. G. Turri, a C. Collins, and J. Flint, 2000 predict drug response in yeast. PLoS One 4. A method for fine mapping quantitative trait loci in https://doi.org/10.1371/journal.pone.0006907 outbred animal stocks. Proc. Natl. Acad. Sci. U. S. A. 97: Schirmer M., S. P. Smeekens, H. Vlamakis, M. Jaeger, M. 12649–12654. https://doi.org/10.1073/pnas.230304397 Oosting, et al., 2016 Linking the Human Gut Microbiome

Predicting phenotypic diversity 12 to Inflammatory Cytokine Production Capacity. Cell. Zhong J., H. Wang, W. Chen, Z. Sun, J. Chen, et al., 2017 https://doi.org/10.1016/j.cell.2016.10.020 Ubiquitylation of MFHAS1 by the ubiquitin ligase praja2 promotes M1 macrophage polarization by activating JNK Schoggins J. W., S. J. Wilson, M. Panis, M. Y. Murphy, C. T. and p38 pathways. Cell Death Dis. 8: e2763-10. Jones, et al., 2011 A diverse range of gene products are https://doi.org/10.1038/cddis.2017.102 effectors of the type i interferon antiviral response. Nature 472: 481–485. Zhu J., A. Ghosh, and S. N. Sarkar, 2015 OASL - A new player in https://doi.org/10.1038/nature09907 controlling antiviral innate immunity. Curr. Opin. Virol. 12: 15–19. https://doi.org/10.1016/j.coviro.2015.01.010 Sharma S., J. Yang, P. Watzinger, P. Kötter, and K. D. Entian, 2013 Yeast Nop2 and Rcm1 methylate C2870 and C2278 of the 25S rRNA, respectively. Nucleic Acids Res. 41: 9062–9076. https://doi.org/10.1093/nar/gkt679 Sood V., K. B. Sharma, V. Gupta, D. Saha, P. Dhapola, et al., 2017 ATF3 negatively regulates cellular antiviral

signaling and autophagy in the absence of type I

interferons. Sci. Rep. 7: 1–17.

https://doi.org/10.1038/s41598-017-08584-9

Srivastava A., A. P. Morgan, M. L. Najarian, V. K. Sarsani, J. S. Sigmon, et al., 2017 Genomes of the mouse collaborative cross. Genetics 206: 537–556. https://doi.org/10.1534/genetics.116.198838

Steed A. L., G. P. Christophi, G. E. Kaiko, L. Sun, V. M. Goodwin,

et al., 2017 influenza through type I interferon. 502:

498–502. https://doi.org/10.1126/science.aam5336

Takagi Y., H. Matsuda, Y. Taniguchi, and H. Iwaisaki, 2014 Predicting the phenotypic values of physiological traits using SNP genotype and gene expression data in mice. PLoS One 9: 1–17. https://doi.org/10.1371/journal.pone.0115532 Touzot M., V. Soumelis, and T. Asselah, 2012 A dive into the complexity of type i interferon antiviral functions. J. Hepatol. 56: 726–728. https://doi.org/10.1016/j.jhep.2011.07.009

Visscher P. M., N. R. Wray, Q. Zhang, P. Sklar, M. I. McCarthy, et al., 2017 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101: 5–22. https://doi.org/10.1016/j.ajhg.2017.06.005 Woolford J. L., and S. J. Baserga, 2013 Ribosome biogenesis in the yeast Saccharomyces cerevisiae. Genetics 195: 643– 681. https://doi.org/10.1534/genetics.113.153197

Yan Y., Y. Du, G. Wang, and K. Li, 2017 Non-structural protein 1 of H3N2 influenza A virus induces nucleolar stress via interaction with nucleolin. Sci. Rep. 7: 1–10. https://doi.org/10.1038/s41598-017-18087-2 Yoshikawa K., T. Tanaka, Y. Ida, C. Furusawa, T. Hirasawa, et al., 2011 Comprehensive phenotypic analysis of single-gene deletion and overexpression strains of Saccharomyces cerevisiae. Yeast 28: 349–361. https://doi.org/10.1002/yea.1843

Zhang Y., D. Mao, W. T. Roswit, X. Jin, A. C. Patel, et al., 2015 PARP9-DTX3L ubiquitin ligase targets host histone H2BJ and viral 3C protease to enhance interferon signaling and control viral infection. Nat. Immunol. 16: 1215– 1227. https://doi.org/10.1038/ni.3279

Predicting phenotypic diversity 13

Predicting phenotypic diversity 14 Figure 1 A tree for: A tree Gene Gene Gene Gene D C B A

Phenotype Expr. of GG gene A SNPj

Phenotype GG Phenotype Expr. of Expr. of CC SNPi gene A gene A

Phenotype TT Expr. of gene A Figure 2

AA Prediction errorPrediction Prediction errorPrediction errorPrediction LASSO Ridge 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 1 2 3 4 5 0.05 0.15 0.25 0.35 0.45 mgRF Genetic effect size Noise level Gene effect size RF CAMELOT B InPhenotype

Real data Permuted data Interaction AUC Interaction Interaction AUC Interaction Interaction AUC Interaction 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0 5 10 15 0 1 2 3 4 5 0.05 0.15 0.25 0.35 0.45 Genetic effect size Noise level Gene effect size Figure 3

A

ZPR1 UTP14 DCG1 NOP2 Y5 Y8

ERB1 NOG2

RRS1 PUF6

HSH49 ZPS1 ASG7 NOC3 MLS1 Y1 GeneGene IMP4 ALB1 Y11 Y9 Y7 SNP ARO9 ESBP6 Ribosome biogenesis gene ZIP2 DAL2 ESF1 LEU2 MUP1 Reproduction gene

UTP23 DIP2 RBG2 NOP1

Y3 Y4 Y6 Y2 Y10

BAR1 MFA2 STE3 MFA1 RGI1 NDJ1 MFα2 HMLα1 NEJ1

B C 11 PUF6PUF6 UTP14UTP14

0.80.8 NOP2NOP2 NOP1NOP1

Real data RBG2RBG2 0.60.6 Permuted data MLS1MLS1 RRS1RRS1

Locus score 0.40.4 IMP4IMP4 ZPR1ZPR1 NOG2NOG2 0.20.2 Module fractions Module Module fractions Module UTP23UTP23 DIP2DIP2

0.00.00.20.20.40.40.60.60.80.81.01.0 00 1 2 20 0.6 0.8 1 1.2 1.4 1.6 Connectivity 4(Mbp) MRP1 MRPS28 FCF1 Figure 4

A

Atf3 Helz2 Amica1 Ifi204 Themis2 M8 M9

Mlkl M5 Mbd2 M6

Ogfr Tgtp2 Znfx1 Oasl2 Lap3 Xrn2

Tspan2 Gm12250 Tnfaip3 Anxa7 Myc M10 SNP

Gene

Gadd45b M2 Pnp2 Dtx3l M7 Ypel3 Antiviral-related genes

Glipr2 Hn1 Rtp4 Tpm3 Mt2

Gsdmd Manf Slfn2 Arel1 Mndal

M3 M1 M4

Rassf3 C19orf12 Fbxw17 Gm9844 Irgm1 Cmpk2

B C 0.6

0.5 Hn1

0.4 Pnp2 Real data Tspan2 Permuted data Gadd45b 0.3 Glipr2

Locus score C19orf12 0.2

0.1 Module fractions Module

0 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 131.6 134.6 137.6 140.6 143.6 146.6 Connectivity Chromosome X (Mbp) Capn6 Pak3