PDF Download

ARTICLE https://doi.org/10.1038/s41467-020-16539-4 OPEN Machine learning uncovers cell identity regulator by histone code Bo Xia1,2,3,4,7, Dongyu Zhao1,2,3,4,7, Guangyu Wang1,2,3,4,7, Min Zhang2,3,4, Jie Lv1,2,3,4, Alin S. Tomoiaga5, ✉ ✉ Yanqiang Li 1,2,3,4, Xin Wang 1,2,3,4, Shu Meng2,3,4, John P. Cooke2,3,4, Qi Cao 6 , Lili Zhang2,3,4 & ✉ Kaifu Chen 1,2,3,4 Conversion between cell types, e.g., by induced expression of master transcription factors, 1234567890():,; holds great promise for cellular therapy. Our ability to manipulate cell identity is constrained by incomplete information on cell identity genes (CIGs) and their expression regulation. Here, we develop CEFCIG, an artificial intelligent framework to uncover CIGs and further define their master regulators. On the basis of machine learning, CEFCIG reveals unique histone codes for transcriptional regulation of reported CIGs, and utilizes these codes to predict CIGs and their master regulators with high accuracy. Applying CEFCIG to 1,005 epigenetic profiles, our analysis uncovers the landscape of regulation network for identity genes in individual cell or tissue types. Together, this work provides insights into cell identity regulation, and delivers a powerful technique to facilitate regenerative medicine. 1 Center for Bioinformatics and Computational Biology, Houston Methodist Research Institute, Houston, TX, USA. 2 Center for Cardiovascular Regeneration, Department of Cardiovascular Sciences, Houston Methodist Research Institute, Houston, TX, USA. 3 Department of Cardiothoracic Surgeries, Weill Cornell Medical College, Cornell University, New York, NY, USA. 4 Institute for Academic Medicine, Houston Methodist Research Institute, Houston, TX, USA. 5 Business Analytics, CIS & Law Department, The O’Malley School of Business Accounting, Manhattan College, Riverdale, NY, USA. 6 Department of Urology, Robert H. Lurie Comprehensive Cancer Center, Chicago, IL, USA. 7These authors contributed equally: Bo Xia, Dongyu Zhao, Guangyu Wang. ✉ email: [email protected]; [email protected]; [email protected] NATURE COMMUNICATIONS | (2020) 11:2696 | https://doi.org/10.1038/s41467-020-16539-4 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-16539-4 cell type is determined by the combinatorial function of transcription factors, whose depletion impaired the differentia- Aits cell identity genes (CIGs), which jointly dictate phe- tion towards a specific cell type; (3) genes required for key notypes of the cell and play critical roles in differentiation, functions or phenotypes of a cell type; (4) genes that were widely development, and disease. CIGs form an intricate regulation used as markers for a cell type. Together, we curated 247 known network, in which a set of master transcription factors governs identity genes for ten well-defined cell types (Supplementary the expression program of CIGs1. Transitions between a few cell Data 1). These genes include 42–82 identity genes in each of the types have been achieved in vitro by induced expression of master four categories (Fig. 1a) and 18 to 36 identity genes for each of the transcription factors2 and hold great promise for cellular ther- ten cell types (Fig. 1b). As was reported before, chromatin apy3. However, constrained by limited ability to uncover the immunoprecipitation sequencing (ChIP-seq) signal of activating accurate catalog of CIGs, our understanding of cell identity is epigenetic marks, such as H3K4me3, H3K4me1, and H3K27ac, incomplete. Conventional methods define cell identity by several showed a unique broad pattern of enrichment at individual marker genes, a methodology that is widely recognized to be known CIGs (Supplementary Fig. 1A, B), e.g., SOX2 in H1 suboptimal. Without comprehensive information about cell human embryonic stem cell (hESC)2, GATA2 in endothelial cells identity fidelity, clinical application of progenitor-derived cells (ECs)15, LMO2 in CD34 + hematopoietic stem cells (HPCs)16, might be encumbered by phenotypic aberration or instability4–9. and PAX6 in mid-radical glial cells (MRGs)17 (Supplementary Thus, it is of pivotal importance to uncover the complete catalog Fig. 1C). These known CIGs have a high but widely distributed of CIGs, which will provide a blueprint to facilitate basic inves- Tau index of expression specificity, with their index values ran- tigation on cell identity, and to guide clinical application of ging from ~1 to 0.6 and ranked from top to median among all engineered cells. genes (Supplementary Fig. 1d). Scientists have been defining CIGs based on genome-wide We noted that different histone modifications differ in expression profile10, which is not yet an optimal method. Many distribution pattern on chromatin and thus require different genes expressed in a cell type are not related to cell identity. parameters to analyze their ChIP-Seq data. We therefore Alternatively, scientists also define CIGs based on the expression developed GridGO, a grid-based genetic algorithm to automati- specificity analysis10, which requires comparison between a query cally optimize the detection of histone modification features for cell type and most, if not all, other cell types. It also requires CIGs (Supplementary Fig. 2a). The algorithm aggressively distinguishing between cell-type-specific and biological- searches for optimal cutoff for each of three important condition-specific genes, e.g., heat shock genes. Therefore, it is parameters in ChIP-Seq analysis, including the height cutoff for not cost-effective, if not impossible, to collect data for all cell peak calling and the upstream or downstream distance cutoffs for types under all biological conditions for the expression specificity assigning a peak to a nearby Transcription Start Site (TSS). To analysis. Further, the relationship between CIGs and cell-type- investigate additional histone modification features for CIGs, we specific genes can be complicated and uncertain. Some identity further developed algorithms in GridGO to analyze six statistical genes may be highly cell-type-specific, whereas other identity features for each ChIP-Seq enrichment peak. These features genes may be expressed in multiple related cell types and have include peak width, height, skewness, kurtosis, total signal, and less expression specificity. As such, to define CIGs solely by their coverage in a given genomic region (Supplementary Fig. 2b). expression profile may be not optimal. It becomes apparent GridGO effectively improved P-values of differences in most of recently that CIGs are different from other expressed genes in the histone modification features between CIGs and the remaining epigenetic mechanism to regulate their transcription. For exam- genes (Supplementary Fig. 2c, d). We further compared the ples, super enhancers and a unique broad pattern of H3K4me3 receiver operating characteristic (ROC) curve for recapturing the modification were found to regulate CIGs, whereas it is typical curated CIGs using each feature before and after GridGO enhancer and sharp H3K4me3 modification that regulate other optimization. Sixteen signatures, each defined as one feature of expressed genes such as housekeeping genes11–14. These dis- one histone modification, display significant improvement coveries suggest strong potential to systematically uncover CIGs (Supplementary Fig. 2e-g). Intriguingly, H3K4me3, H3K4me1, on the basis of epigenetic signature analysis. and H3K27ac showed significant difference in each of their Here we develop Computational Epigenetic Framework for features between known CIGs and random control genes (Fig. 1c). Cell Identity Gene (CEFCIG), a computational framework to However, H3K27me3, one of the most studied repressive histone uncover CIGs and further define their master regulators. On the modifications, displayed little such difference (Fig. 1c). basis of machine learning, CEFCIG reveals unique histone codes By combining gene expression profile with histone modifica- for transcriptional regulation of reported CIGs and utilizes these tion signatures, we developed a logistic regression model, codes to predict CIGs and their master regulators with high CIGdiscover, for CIGs discovery (Supplementary Fig. 3a). A accuracy. Applying CEFCIG to 1005 epigenetic profiles, our histone modification may have distinct biological implications analysis uncovers the landscape of regulation network for identity when it appears on the promoter vs. on the gene body. Therefore, genes in individual cell or tissue types. Together, this work pro- for each histone modification feature, CIGdiscover calculated a vides insights into cell identity regulation and delivers a powerful value for the associated promoter and further a value for the technique to facilitate regenerative medicine. associated gene body. These together led to 49 signatures. Considering that some peak features occasionally show consider- able correlation with each other, CIGdiscover tried the backward Results feature elimination and forward feature construction methods to Prediction of CIGs by CIGdiscover on the basis of the unique select an optimal combination of features for prediction of CIGs. histone codes. To build a solid foundation for study on CIGs, we Intriguingly, the backward method did not improve the model, performed a thorough review of literature (see “Methods”)to whereas the forward method successfully removed 40 of the develop Cell Identity Gene Data Base (CIGDB), the first database 49 signatures to achieve an optimal prediction accuracy

PDF Download

Entrez Symbols Name Termid Termdesc 117553 Uba3,Ube1c

Supp Material.Pdf

Literature Mining Sustains and Enhances Knowledge Discovery from Omic Studies

Plasma Cells in Vitro Generation of Long-Lived Human

Investigation of Differentially Expressed Genes in Nasopharyngeal Carcinoma by Integrated Bioinformatics Analysis

Mouse Haus1 Conditional Knockout Project (CRISPR/Cas9)

Identification of Differential Expression of P53 Associated Rnas at 3 Different Treatment Timepoints, and Association with Chip- Seq Identified P53 Genes

Identification of Genes Associated with Castration‑Resistant Prostate Cancer by Gene Expression Profile Analysis

Supplemental Data.Pdf

PDF Output of CLIC (Clustering by Inferred Co-Expression)

Receptor Signaling Through Osteoclast-Associated Monocyte

A Deep Proteomics Perspective on CRM1-Mediated Nuclear Export and Nucleocytoplasmic Partitioning