Integrated Analysis of RNA-Seq and Chip-Seq Data Using Strand NGS To
Total Page:16
File Type:pdf, Size:1020Kb
Application Note Integrated analysis of RNA-Seq and ChIP-Seq data using Strand NGS to understand the Regulation of Cardiogenesis Mohammed Toufiq, Sumeet Deshmukh, Srikanthi Ramachandrula, Sunil C. Cherukuri Strand Life Sciences Pvt Ltd Overview built an architecture of pathway and NLP 1% of all human births in the west are affected networks. Pathway Analysis showed NKX2-5 with congenital heart disease and this to play a pivotal role in cardiogenesis and any constitutes a major burden on public health change in its levels leads to multiplying effects organizations. Numerous next-generation on heart development pathway. NLP was used sequencing studies utilize multiple to build a network with plausible interaction of technologies to understand and answer genes mined from interaction DB and a complex biological questions. Here, using proprietary database constructed by Strand Strand NGS bioinformatics software version using all the PubMed abstracts. Further, NLP 3.0 and above, we provide an illustrative showed that most proteins that interact with example on how to integrate data from NKX2-5 are either enhancers or growth factors different sources to identify differential which might explain the mechanism of expression profiles and infer their transcription heightened impact on heart development. We factor binding sites (using RNA-Seq and ChIP- intend to showcase Strand NGS as a go to tool Seq data) in a combined analysis to define the for analyzing and interpreting multi-omics data regulation of Cardiogenesis. Transcription with intuitive workflows and user friendly factors like NKX2-5 and MEIS1 have been features. shown to play a critical role in vertebrate heart development. Identifying their expression and targets of these factors, along with the regulatory interactions will be a major step towards understanding the broader cardiac developmental processes. As a part of this study, we re-analyzed publicly available datasets of GSE44576. This RNA-Seq and ChIP-Seq data was generated using the Illumina platform to investigate the expression profiles and genome binding sites of transcription factors. Similar to the conclusions in the original publication1, we could identify the mechanism of transcriptional regulation during cardiac differentiation by successive binding of the two homeodomain transcription factors NKX2-5 and MEIS1 on Popdc2 enhancer. ChIP-Seq data analysis helped in Figure 1: RNA-Seq data and ChIP-Seq data finding the binding domains of NKX2-5 and analysis workflow used for studying the MEIS1, while RNA-Seq data analysis aided in regulation of Cardiogenesis finding their impact on the expression of other genes. In addition to these results, we also Integrated analysis of RNA-Seq & ChIP-Seq data Datasets (less than or equal to 10) were trimmed from that end. To ensure that this trimming did not RNA-Seq and ChIP-Seq datasets from mouse result in very short reads, the minimum read (Mus musculus, mm9 build) were obtained length was fixed at 25bp. In each sample, 79- from NCBI GEO database [GSE445762]. The 84% of the total reads were aligned of which RNA-Seq data is paired end and ChIP-Seq 76-80% being uniquely aligned. data is single end, both generated using the Illumina platform (See Table 1 and Table 2). Post alignment, the reads were de-duplicated, Wild Type (WT) refers to a gene that prevails quantified based on the methods suggested by 3 among the individuals of the natural population Mortazavi et al and quantile normalization whereas Hypomorph is a condition in which was applied. Genes were filtered based on the altered gene product possesses a reduced their normalized signal intensity values th th level of activity or lacks the molecular function. (between 20.0 - 100.0 percentile) and are present at least in 1 out of 2 conditions. Mann- Samples Tissue Whitney unpaired test was performed to find SRR748961 E11.5 heart, Wild Type 1 entities showing statistically significant SRR748962 E11.5 heart, Wild Type 2 differences with p- value cut-off ≤ 0.05. Genes SRR748963 E11.5 heart, Wild Type 3 showing a fold change ≥ 1.5 were retained and SRR748964 E11.5 heart, Hypomorph 1 used for downstream analysis including Gene SRR748965 E11.5 heart, Hypomorph 2 Ontology (GO) and Pathway Analysis. SRR748966 E11.5 heart, Hypomorph 3 The raw reads corresponding to ChIP-Seq Table 1: RNA-Seq experiment information data were aligned in Strand NGS against the genome (mm9). The raw reads were aligned Samples Tissue with a minimum of 90% identity, maximum of SRR748967 E11.5 heart, Input 5% gaps, and 25bp as the minimum aligned SRR748968 E11.5 heart, Nkx2-5 ChIP (S1) read length. SRR748969 E11.5 heart, Nkx2-5 ChIP (S4) Post alignment, de-duplication of reads was Table 2: ChIP-Seq experiment information performed and peaks were detected with the MACS4 algorithm on each of the replicate samples (S1 and S4) against the input separately, using an average fragment size of Data analysis methodology 300 bases and a p-value cut off of 1.0E-5. The resulting peak regions were annotated with All analyses reported in this study were genes present in +/-5 kbp window and the performed using Strand NGS bioinformatics common peak-associated genes were software version 3.0 and above (See Figure identified using a 2-way Venn diagram. Post 1). Before proceeding with the alignment, we this, MEIS1 motif was downloaded from looked at some pre-alignment QC metrics to JASPAR database5, imported into Strand NGS investigate the read quality. In all the samples, and scanned against the whole genome in most of the reads had an average read quality order to find the possible binding sites for of ~39. Alignment was performed using our in- MEIS1. house Strand NGS aligner which follows the Burrows-Wheeler Transform (BWT) approach. Results and Discussion The data was aligned against Transcriptome and Genome (mm9) using UCSC model. The RNA- Seq Analysis alignment parameters allowed for 10% of mismatches and 5% of gaps in a read. Reads Post quantification and filtering, data analysis aligning to multiple locations were reported using Strand NGS revealed a total of 11,929 only once and reads aligning to more than 5 differentially-expressed genes with p-value ≤ locations were ignored. Since the base quality 0.05 with 6,038 showing a fold change dipped towards the 3’ end, low quality bases difference of ≥ 1.5 between the Wild Type and www.strand-ngs.com [email protected] ; [email protected] Integrated analysis of RNA-Seq & ChIP-Seq data Hypomorphic Hearts. Among these, 5,952 showed similar behaviour in the Gene view genes showed up-regulation and 86 genes (See Figure 4). Principle Component Analysis showed down-regulation. Figure 2 displays a (PCA) further confirmed the clear distinction 2-D scatter plot showing a fold change and separation between the Wild Type and difference. Genes including NKX2-5 and Hypomorphic Hearts (See Figure 5). In PCA, MEIS1 (highlighted in yellow) show ≥ 1.5 fold PC1 is the Eigen vector that captures the change expression differences and are primary variation in the dataset, PC2 captures statistically significant. Most genes show an the second most variation and PC3 captures up-regulated expression in the Wild Type the least variations in the dataset. In our Hearts. Figure 3 displays profile plot of gene analysis, sample groups are separated by expression regulation. PC1, PC2, and PC3. Strand NGS offers support for GO analysis, a functional enrichment analysis of the significant genes based on biological process, molecular functions, and cellular components. The affected gene list is found to be significantly enriched for genes primarily involved in heart development and other cardiac-related functions (See Figure 6). Figure 2: 2-D scatter plot plotted on fold change analysis list. Genes including NKX2-5 and MEIS1 (highlighted in yellow) are showing ≥ 1.5 fold change level difference and are also statistically significant with p-value ≤ 0.05. Most genes show an up-regulated expression in the Wild Type Hearts. Figure 4: Gene View showing raw counts of Popdc2 transcripts having a similar pattern to NKX2-5 expression levels in Hypomorphic and Wild Type Hearts. Figure 3: Hypomorphic Hearts show lower gene expression level for NKX2-5, MEIS1, and Popdc2 compared to Wild Type Hearts in the profile plot. In addition, it was observed that Hypomorphic Figure 5: Clear distinction and data separation Hearts with lowered NKX2-5 expression in PCA between the Hypomorphic and Wild showed lowered expression levels of Popdc2 Type Hearts. and their corresponding transcripts also www.strand-ngs.com [email protected] ; [email protected] Integrated analysis of RNA-Seq & ChIP-Seq data A B A B Figure 6: A. Gene Ontology analysis of functionally affected genes. B. The affected gene list is found to be significantly enriched for genes involved in heart development and other cardiac-related functions ChIP-Seq Analysis Peak detection is used to identify the interaction pattern of a protein with DNA which involves gene activation by a set of protein transcription factors. As part of this analysis, peaks were detected with MACS algorithm4. The total numbers of peaks detected on each of the replicate samples (S and S ) against 1 4 Figure 8: A. MEIS1 motif downloaded from the input separately are 14,149 and these JASPAR database. B. MEIS1 motif scanned peak regions annotated to 8,178 genes (+/- against the whole genome in Strand NGS to 5kbp region size). This is as expected as identify the possible binding sites for MEIS1. Popdc2 enhancer has a prominent role for binding of NKX2-5 and MEIS1 (See Figure 7). Integrated analysis of RNA-Seq and MEIS1 motif was downloaded from JASPAR 5 ChIP-Seq database , imported into Strand NGS and scanned it against the whole genome in order To explain the regulation of Cardiogenesis, we to find the possible binding sites for MEIS1 conducted an integrated analysis of the (See Figure 8A and 8B).