Predicting Phenotypic Diversity from Molecular and Genetic Data

Genetics: Early Online, published on July 27, 2019 as 10.1534/genetics.119.302463 Predicting phenotypic diversity from molecular and genetic data Tom Harel, Naama Peshes-Yaloz, Eran Bacharach, Irit Gat-Viks* School of Molecular Cell Biology and Biotechnology, Department of Cell Research and Immunology, The George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel * Corresponding author: [email protected] (IGV) ABSTRACT Despite the importance of complex phenotypes, an in-depth understanding of the combined molecular and genetic effects on a phenotype has yet to be achieved. Here we introduce InPhenotype, a novel computational approach for complex phenotype prediction, where gene-expression data and genotyping data are integrated to yield quantitative predictions of complex physiological traits. Unlike existing computational methods, InPhenotype makes it possible to model potential regulatory interactions between gene expression and genomic loci without compromising the continuous nature of the molecular data. We applied InPhenotype on synthetic data, exemplifying its utility for different data parameters, as well as its superiority compared to current methods in both prediction quality and the ability to detect regulatory interactions of genes and genomic loci. Finally, we show that InPhenotype can provide biological insights on both mouse and yeast datasets. KEYWORDS complex traits; genetics; gene expression; computational modeling nderstanding the mechanisms underlying complex have relied on regression models, such as LASSO and Ridge diseases presents a substantial challenge. Among the regression (Takagi et al. 2014), elastic net (CAMELOT) (Chen et U most successful and widely used approaches are genome- al. 2009), and bayesian mixed regression (Bhattacharjee and wide association studies (GWAS), in which genotyping (or DNA- Sillanpää 2011). However, owing to the large number of sequencing) information is utilized to systematically investigate expressed genes and SNP variables, it is not feasible to include the genetic basis of phenotypic diversity (Visscher et al. 2017). all potential interactions within these regression-based models. Such studies are focused on genetic information that is To account for those interactions, alternative phenotype- relevant to every tissue, condition and time point; however, prediction approaches have been based on decision tree genetic data are static and thus cannot capture dynamic, models of either a single tree (Lee et al. 2006) or multiple trees epigenetic or environmental factors. An alternative strategy is (Chen and Zhang 2013), with two main caveats. First, the to use high-throughput molecular data, such as mRNA- internal nodes of the trees store the split functions based on sequencing, to uncover relationships between molecular and gene-expression and genotyping data, while each leaf node of phenotypic diversity in a population of individuals (Asyali et al. a tree provides the most probable answer (Figure S1). Such 2006). This alternative is valuable in two ways: first, it is models are prone to overfitting since they represent any types applicable to both qualitative traits (through classification of interaction (gene-gene, gene-locus, and locus-locus methods) and quantitative traits (through regression methods), interactions). Secondly, since these methods typically discretize and secondly, the molecular data naturally encapsulate a the expression data within the split nodes, such approaches do variety of underlying epigenetic, environmental and not realize the full potential of the quantitative expression developmental effects. However, since this strategy requires measurements. prior selection of a specific tissue and experimental condition, Here, we model quantitative clinical outcomes using the it is of limited utility in the case of in-vivo clinical outcomes that framework of regression trees (Quinlan 1992; Criminisi 2011) in are commonly associated with pathological alterations in which discrete (qualitative) values (here, SNP genotyping) are multiple tissues and organs. used within the split nodes, whereas quantitative expression- Several studies have shown that by integrating gene level measurements of a given gene are used as regression- expression with genotyping of single-nucleotide polymorphic based predictors associated with each leaf node (see (SNP) sites, it is possible to achieve a better quality of illustration in Figure 1, right). To account for multiple genes, phenotype prediction than that obtained by analysis of only the model consists of a large collection of trees where each one data type at a time (e.g. (Ruderfer et al. 2009)). Most tree represents the expression data of a single gene. To obtain attempts to use a combined transcription-genotyping predictor generalization and robustness, we apply the 'random Predicting phenotypic diversity 1 Copyright 2019. regression forest' framework (Criminisi 2011): for each single genotyping and molecular data of an unseen individual are gene the model consists of an ensemble of randomly trained used to predict the outcome phenotype of the forest (based on regression trees, where each of these trees is randomly a certain forest prediction model). The third phase is different and essentially decorrelated from the other trees of “interpretability”: we interpret the model by asking relevant the same gene (Figure 1, left). This methodology, referred to as biological questions. In particular, we ask which gene-locus 'InPhenotype', combines the advantage of regression-based pairs make major joint contributions to the forest's methods (as it exploits the quantitative nature of gene- performance. In the following, we describe our specific expression data) together with the advantages of decision tree- formulation of each component. based methods (by considering gene-locus interactions) while The InPhenotype forest model maintaining a reasonable complexity (by limiting gene-gene and locus-locus interactions), therefore tackling the main To model a single gene, the InPhenotype algorithm performs a caveats of previous methods. non-linear regression that builds on hierarchical partitioning organized in a regression tree model. The latter is a tree in InPhenotype builds on a forest model that carries several which each 'split node' tests the incoming genotype of a algorithmic adjustments, making it possible to fit the special certain genomic locus, and each 'terminal node' (leaf) is a characteristics of gene expression and genotyping data. For instance, the forest prediction is not a simple average of all predictor model in the form of a linear regression between the leaf-node predictions; instead, we use a weighting scheme that expression values of the gene and the quantitative phenotype. considers the generalization ability of the different leaf nodes. More specifically, each split node is a binary test function, and As another example, split nodes store different test functions, individuals arriving at the node are sent to either the right or depending on the genetic landscape of the relevant organism. the left child nodes. In each split node, only a single genomic Furthermore, the particular construction of the model allows locus is used for testing. In the case of homozygous individuals (such as inbred mouse strains and yeast), the two possible the extraction of gene-locus regulatory interactions that are genotypes of a given locus (e.g., 'AA' and 'GG') indicate ‘right’ believed to be jointly important in predicting the outcome. We or ‘left’. In the case of heterozygosity, each genomic locus applied synthetic data analysis to demonstrate the advantages of the various algorithmic adjustments and to further show the specifies several possible 'genotypic partitions', such as 'AA' for superiority of InPhenotype over existing methods in terms of the ‘right’ child and 'AG'-'GG' for the ‘left’ child, or its prediction accuracy and of its ability to reveal gene-SNP alternatively, both 'AA' and 'AG' can indicate the ‘left’ child. For interactions that have a high impact on the outcome. As a terminal nodes, InPhenotype utilizes a regression model in proof of concept, we applied InPhenotype on real data of two which the quantitative phenotype is the response variable and biological systems: growth diversity following rapamycin the expression of a certain gene is the predictor variable. Each treatment in yeast, and susceptibility to influenza infection in tree therefore refers to a hierarchical piecewise regression model of a single gene, referred to as a 'gene tree' (Figure 1, mice. Most notably, in both yeast and mouse, genes and SNPs right). The overall 'InPhenotype forest' model consists of a were found to pair with high joint importance to phenotypic collection of trees for each gene (Figure 1, left). diversity, showing surprising modular organization: many of the identified SNPs acted jointly with multiple genes having a The InPhenotype tree tackles the main challenges in similar biological function. existing phenotype-modeling methods. First, it exploits the quantitative nature of expression data within the leaf Materials and Methods regression model—unlike the current regression trees that have been applied in biological context, which typically Overview of the InPhenotype algorithm discretizes the molecular measurements into binary split-node We have developed InPhenotype, a methodology for decisions while modeling the leaves independently of the input integrating discrete and continuous data types

Predicting Phenotypic Diversity from Molecular and Genetic Data

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support