Downloading All Package Dependencies Which Can Be Found in Our Attached Code

bioRxiv preprint doi: https://doi.org/10.1101/2021.05.11.443707; this version posted May 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Novel Interpretable Tissue-Specifc and Multi-Tissue Transcriptomic Clocks to Infer Aging Mechanisms Aayush Gupta, Mindren Lu, Jessica Sun [email protected], [email protected], [email protected] 1. Abstract Aging is characterized as a progressive decline in fitness that ultimately results in death. We set out to build both tissue-specifc and multi-tissue transcriptomic clocks to make global tissue age predictions in individuals from GTEx. Existing work in the feld primarily uses epigenetic clocks as predictors of age, but these models have known issues and are signifcantly less interpretable than their transcriptomic counterparts. Due to their transcriptomic nature, we can use these models to directly infer mechanisms of aging from their features. Linear regression remains the current standard analysis technique, but we improved upon its baseline performance with modern techniques, exploring both XGBoost and MLPs. We also experimented with using deconvolved cell data for predictions, which account for cellular composition and reduce signal distortion from rare cell types. Since it is known that the heterogeneity of cell types in particular tissues can lead to noise in these models, we proposed using deconvolution as a potential remedy for this problem. Our results found that MLPs are not well suited for the task due to a lack of training data, but the use of XGBoost is efective at improving the baseline performance of predictions of existing tissue-specifc clocks. These models allowed us to directly compute genes most important to age prediction in our models, and we showed that multiple genes found have been independently identifed elsewhere to show evidence of correlation with age. Given the small size of our datasets, we were unable to make conclusive determinations about multi-tissue predictors, but preliminary results suggest that the technique shows promise and is worthy of future investigation. Likewise, given our limited deconvolved cell data, we did not currently observe strong results, but we again note that this is an area in need of further investigation. By improving upon the performance of existing models, we demonstrated that a novel machine learning technique, XGBoost, can be an efective technique to further our understanding of aging mechanisms by extraction of the most relevant genes found in those models. This is signifcant because the genetic causes of aging are still not fully understood, and research in the feld of aging is lacking in comparison to other domains. As the problem of identifying tissues that age at diferent rates is of specifc interest, our tissue-specifc models potentially have other applications in this domain, including informing pathologies in tissues that are found to be aging faster, or analyzing how people with similar ages can have vastly diferent tissue ages. An extended technical presentation of this work can be found here, and a highly simplifed non-technical overview presentation can be found here. 2. Background Aging is characterized as a progressive decline in fitness that ultimately results in death. As people age, this deterioration increases the risk for many debilitating diseases, including cancer, cardiovascular disorders, and neurodegenerative conditions1; however, the specifc mechanisms that cause aging are still unknown. 2,3 Thus far, one of the best known biomarkers for aging is an epigenetic clock based on age related changes in DNA methylation 4. While epigenetic clocks have strong predictive power, they are less interpretable since we cannot directly infer the mechanisms of aging from their predictions 5,6. Therefore, we propose building tissue-specifc and multi-tissue transcriptomic clocks to make global tissue age predictions. The problem of identifying tissues that age at diferent rates is of specifc interest 7 8; an example application could inform pathologies in tissues that are found to be aging faster 9, allowing physicians to concentrate on the most vulnerable parts of the body. As a corollary, we can use the same bioRxiv preprint doi: https://doi.org/10.1101/2021.05.11.443707; this version posted May 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. approach to analyze deconvolved cell-specifc expression data to compare aging rates between diferent cell types in the same tissue. So far, classical methods such as Lasso or ElasticNet have been used on bulk expression data10, but we intend to improve on that by using a simple multi-layer perceptron (MLP) clock on deconvolved GTEx data. An MLP is a feed-forward artifcial neural network which uses multiple layers of perceptrons, i.e., individual classifers used in supervised learning. MLPs are particularly promising in their demonstrated strong predictive power, as they can encode non-linear relationships between the input and output data, such as the interaction of multiple genes impacting the age. We also used XGBoost, a boosted random forest classifer. This model, released in 2016, uses a collection of decision trees that divide on subsets of features, where bootstrap aggregation is used to aggregate the final outputs of each decision tree into a fnal answer. This model is popular, as it provides interpretability and feature importance, while outperforming many traditional models 13. Past methods have primarily analyzed only single tissue models 11, making it a poor predictor of diferent tissues with very diferent cell composition. Note that there is an analog between a predictor trained on non-deconvolved data across multiple tissues to predict human age, and a predictor trained on deconvolved data across multiple cells within a tissue to predict tissue age. Namely, because the data is in the same format (a matrix of RNA expression levels for tissues or cells), we can use the same model architecture trained on these two diferent modalities of data. Thus, we attempt to train on pairs of tissues and all the tissues, but we accept the tradeof that fewer patients have multiple tissues. We also used a feature importance map to identify genes that are most strongly correlated with predicting tissue age. In the future, these newly-identifed genes can be used to compare aging between diferent tissue types and determine how rapidly tissues age relative to each other, an important source of information for clinical diagnosis revisited signifcantly in the last 20 years 2,12. 3. Aims 3.1 Aim 1 Build transcriptomic clocks which can predict the age of a person given gene expression in various tissues with an MLP. Our goal is to improve existing prediction performance, and we will also try to use deeper models such as MLPs and XGBoost to make better predictions than a simple linear regression, as it has been shown to be efective 7,11. We will build both tissue-specifc and multi-tissue clocks for this purpose. 3.2 Aim 2 From our models, extract the most signifcant genes contributing to age prediction. We can then validate our fndings against known genes linked to aging to directly infer associated mechanisms. 3.3 Aim 3 Analogously, explore the building of transcriptomic clocks which predict the age of a tissue given gene expression data in its individual cell types. We plan to use newly deconvolved GTEx data, and aim to improve upon the performance of the existing model of a tissue specifc ElasticNet with bulk expression data. 4. Methods bioRxiv preprint doi: https://doi.org/10.1101/2021.05.11.443707; this version posted May 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 4.1 Dataset Pre-Processing 4.1.1 GTEx Dataset Our data is primarily derived from the GTEx dataset (https://gtexportal.org/home/), a matrix of genotype-tissue expression values from 54 diferent tissues. Tissues in GTEx range from a few dozen samples to thousands of samples. Each tissue is labeled with specifc variables regarding the donor, including age, sex, ethnicity, RNA integrity number, cause of death, and time after death. Gene expression is shown in transcripts per million (TPM), calculated as where nt is the number of reads for a gene t, lt is the normalized transcript/gene length, and T is the set of all genes depending on whether the quantifcation is at the gene level. We directly performed inference on the expression values in GTEx, and used the chronological age of each person given as a ground truth for the tissue-specifc ages as well. Despite some serious pathology notes for particular individuals, the GTEx dataset samples tissues from a population normal relative to others of their age, so these ages can be used to approximate the ages of tissues 5,14. 4.1.2 Deconvolved Data In addition to this bulk data, Kai Kang from the Kellis Lab provided us with computationally deconvolved data, by separating tissue expression values by individual cell types. Kang’s work used implemented deconvolution with a novel complete deconvolution method called CDSeq. CDSeq is a reference-free method that can distinguish which reads come from which cell types without needing prior experiments. This difers from reference-based methods like CIBERSORT by eliminating bias towards how the cell type was identifed by previous researchers.15 CDSeq uses bulk RNA-seq data where the profle represents a weighted average of the expression profles of the constituent cell types.

Load more