Application Note

Integrated analysis of RNA-Seq and ChIP-Seq data using Strand NGS to understand the Regulation of Cardiogenesis Mohammed Toufiq, Sumeet Deshmukh, Srikanthi Ramachandrula, Sunil C. Cherukuri Strand Life Sciences Pvt Ltd

Overview built an architecture of pathway and NLP 1% of all human births in the west are affected networks. Pathway Analysis showed NKX2-5 with congenital heart disease and this to play a pivotal role in cardiogenesis and any constitutes a major burden on public health change in its levels leads to multiplying effects organizations. Numerous next-generation on heart development pathway. NLP was used sequencing studies utilize multiple to build a network with plausible interaction of technologies to understand and answer genes mined from interaction DB and a complex biological questions. Here, using proprietary database constructed by Strand Strand NGS software version using all the PubMed abstracts. Further, NLP 3.0 and above, we provide an illustrative showed that most proteins that interact with example on how to integrate data from NKX2-5 are either enhancers or growth factors different sources to identify differential which might explain the mechanism of expression profiles and infer their heightened impact on heart development. We factor binding sites (using RNA-Seq and ChIP- intend to showcase Strand NGS as a go to tool Seq data) in a combined analysis to define the for analyzing and interpreting multi-omics data regulation of Cardiogenesis. Transcription with intuitive workflows and user friendly factors like NKX2-5 and MEIS1 have been features. shown to play a critical role in vertebrate heart development. Identifying their expression and targets of these factors, along with the regulatory interactions will be a major step towards understanding the broader cardiac developmental processes. As a part of this study, we re-analyzed publicly available datasets of GSE44576. This RNA-Seq and ChIP-Seq data was generated using the Illumina platform to investigate the expression profiles and genome binding sites of transcription factors. Similar to the conclusions in the original publication1, we could identify the mechanism of transcriptional regulation during cardiac differentiation by successive binding of the two homeodomain transcription factors NKX2-5 and MEIS1 on Popdc2 enhancer. ChIP-Seq data analysis helped in Figure 1: RNA-Seq data and ChIP-Seq data finding the binding domains of NKX2-5 and analysis workflow used for studying the MEIS1, while RNA-Seq data analysis aided in regulation of Cardiogenesis finding their impact on the expression of other genes. In addition to these results, we also

Integrated analysis of RNA-Seq & ChIP-Seq data

Datasets (less than or equal to 10) were trimmed from that end. To ensure that this trimming did not RNA-Seq and ChIP-Seq datasets from mouse result in very short reads, the minimum read (Mus musculus, mm9 build) were obtained length was fixed at 25bp. In each sample, 79- from NCBI GEO database [GSE445762]. The 84% of the total reads were aligned of which RNA-Seq data is paired end and ChIP-Seq 76-80% being uniquely aligned. data is single end, both generated using the Illumina platform (See Table 1 and Table 2). Post alignment, the reads were de-duplicated, Wild Type (WT) refers to a gene that prevails quantified based on the methods suggested by 3 among the individuals of the natural population Mortazavi et al and quantile normalization whereas Hypomorph is a condition in which was applied. Genes were filtered based on the altered gene product possesses a reduced their normalized signal intensity values th th level of activity or lacks the molecular function. (between 20.0 - 100.0 percentile) and are present at least in 1 out of 2 conditions. Mann- Samples Tissue Whitney unpaired test was performed to find SRR748961 E11.5 heart, Wild Type 1 entities showing statistically significant SRR748962 E11.5 heart, Wild Type 2 differences with p- value cut-off ≤ 0.05. Genes SRR748963 E11.5 heart, Wild Type 3 showing a fold change ≥ 1.5 were retained and SRR748964 E11.5 heart, Hypomorph 1 used for downstream analysis including Gene SRR748965 E11.5 heart, Hypomorph 2 Ontology (GO) and Pathway Analysis. SRR748966 E11.5 heart, Hypomorph 3 The raw reads corresponding to ChIP-Seq Table 1: RNA-Seq experiment information data were aligned in Strand NGS against the genome (mm9). The raw reads were aligned Samples Tissue with a minimum of 90% identity, maximum of SRR748967 E11.5 heart, Input 5% gaps, and 25bp as the minimum aligned

SRR748968 E11.5 heart, Nkx2-5 ChIP (S1) read length.

SRR748969 E11.5 heart, Nkx2-5 ChIP (S4) Post alignment, de-duplication of reads was Table 2: ChIP-Seq experiment information performed and peaks were detected with the MACS4 algorithm on each of the replicate

samples (S1 and S4) against the input separately, using an average fragment size of Data analysis methodology 300 bases and a p-value cut off of 1.0E-5. The resulting peak regions were annotated with All analyses reported in this study were genes present in +/-5 kbp window and the performed using Strand NGS bioinformatics common peak-associated genes were software version 3.0 and above (See Figure identified using a 2-way Venn diagram. Post 1). Before proceeding with the alignment, we this, MEIS1 motif was downloaded from looked at some pre-alignment QC metrics to JASPAR database5, imported into Strand NGS investigate the read quality. In all the samples, and scanned against the whole genome in most of the reads had an average read quality order to find the possible binding sites for of ~39. Alignment was performed using our in- MEIS1. house Strand NGS aligner which follows the Burrows-Wheeler Transform (BWT) approach. Results and Discussion The data was aligned against Transcriptome and Genome (mm9) using UCSC model. The RNA- Seq Analysis alignment parameters allowed for 10% of mismatches and 5% of gaps in a read. Reads Post quantification and filtering, data analysis aligning to multiple locations were reported using Strand NGS revealed a total of 11,929 only once and reads aligning to more than 5 differentially-expressed genes with p-value ≤ locations were ignored. Since the base quality 0.05 with 6,038 showing a fold change dipped towards the 3’ end, low quality bases difference of ≥ 1.5 between the Wild Type and

www.strand-ngs.com [email protected] ; [email protected] Integrated analysis of RNA-Seq & ChIP-Seq data

Hypomorphic Hearts. Among these, 5,952 showed similar behaviour in the Gene view genes showed up-regulation and 86 genes (See Figure 4). Principle Component Analysis showed down-regulation. Figure 2 displays a (PCA) further confirmed the clear distinction 2-D scatter plot showing a fold change and separation between the Wild Type and difference. Genes including NKX2-5 and Hypomorphic Hearts (See Figure 5). In PCA, MEIS1 (highlighted in yellow) show ≥ 1.5 fold PC1 is the Eigen vector that captures the change expression differences and are primary variation in the dataset, PC2 captures statistically significant. Most genes show an the second most variation and PC3 captures up-regulated expression in the Wild Type the least variations in the dataset. In our Hearts. Figure 3 displays profile plot of gene analysis, sample groups are separated by expression regulation. PC1, PC2, and PC3. Strand NGS offers support for GO analysis, a functional enrichment analysis of the significant genes based on biological process, molecular functions, and cellular components. The affected gene list is found to be significantly enriched for genes primarily involved in heart development and other cardiac-related functions (See Figure 6).

Figure 2: 2-D scatter plot plotted on fold change analysis list. Genes including NKX2-5 and MEIS1 (highlighted in yellow) are showing ≥ 1.5 fold change level difference and are also statistically significant with p-value ≤ 0.05. Most genes show an up-regulated expression in the Wild Type Hearts.

Figure 4: Gene View showing raw counts of Popdc2 transcripts having a similar pattern to NKX2-5 expression levels in Hypomorphic and Wild Type Hearts.

Figure 3: Hypomorphic Hearts show lower level for NKX2-5, MEIS1, and

Popdc2 compared to Wild Type Hearts in the profile plot.

In addition, it was observed that Hypomorphic Figure 5: Clear distinction and data separation Hearts with lowered NKX2-5 expression in PCA between the Hypomorphic and Wild showed lowered expression levels of Popdc2 Type Hearts. and their corresponding transcripts also

www.strand-ngs.com [email protected] ; [email protected] Integrated analysis of RNA-Seq & ChIP-Seq data

A B A

B

Figure 6: A. Gene Ontology analysis of functionally affected genes. B. The affected gene list is found to be significantly enriched for genes involved in heart development and other cardiac-related functions

ChIP-Seq Analysis

Peak detection is used to identify the interaction pattern of a protein with DNA which involves gene activation by a set of protein transcription factors. As part of this analysis, peaks were detected with MACS algorithm4. The total numbers of peaks detected on each of the replicate samples (S and S ) against 1 4 Figure 8: A. MEIS1 motif downloaded from the input separately are 14,149 and these JASPAR database. B. MEIS1 motif scanned peak regions annotated to 8,178 genes (+/- against the whole genome in Strand NGS to 5kbp region size). This is as expected as identify the possible binding sites for MEIS1. Popdc2 enhancer has a prominent role for binding of NKX2-5 and MEIS1 (See Figure 7). Integrated analysis of RNA-Seq and MEIS1 motif was downloaded from JASPAR 5 ChIP-Seq database , imported into Strand NGS and scanned it against the whole genome in order To explain the regulation of Cardiogenesis, we to find the possible binding sites for MEIS1 conducted an integrated analysis of the (See Figure 8A and 8B). differentially expressed genes identified from RNA-Seq data and the transcription factor binding sites from ChIP-Seq data. The genes resulting from the fold change regulation in RNA-Seq and the peak-associated genes in ChIP-Seq offered help to infer the binding impacts of expression levels and this indicated the expression is regulated by NKX-2 binding (See Figure 9). To validate further, Strand NGS has support for curated pathways6-7, which provide an interactive computing environment that promotes investigation and

enables understanding of data within a Figure 7: Popdc2 showing transcription factor biological context. Pathway analysis revealed binding occupancy sites for both NKX2-5 (found that these genes are involved in heart through peak detection) and MEIS1 (inferred development and are found to show lowered through scan motif) in Genome Browser. expression in Hypomorphic Hearts (See Figure

www.strand-ngs.com [email protected] ; [email protected] Integrated analysis of RNA-Seq & ChIP-Seq data

10). Further NLP and Hierarchical Clustering showed that most genes interacting with NKX2-5 mimic their expression pattern with interacting genes are either DNA-binding proteins or growth factors (See Figure 11 and 12)

Figure 11: NLP-derived regulatory network showing genes interacting with NKX2-5 involved in heart development, DNA binding etc.

Figure 9: Venn diagram displays the overlap between the up-regulated genes from RNA- Seq and the peak-associated genes in ChIP- Seq. 703 genes have shown binding sites for NKX2-5 in both replicates and show higher expression in Wild Type than in Hypomorphic Hearts. This indicates that their expression is regulated by NKX-2 binding.

Figure 12: Expression patterns of genes interacting with NKX2-5 in Hierarchical Clustering. Dendrogram constructed on the genes derived from NLP network showed that most genes interacting with NKX2-5 mimic their expression pattern. Also, most genes interacting with NKX2-5 are either DNA- binding proteins or transcription factors inferred through NLP-derived regulatory network. Figure 10: Integrated pathway analysis of a curated WikiPathway (Effect on Heart Development Pathway6) as shown in pathway viewer with matched entities that are the genes involved in heart development and are found to have lowered expression levels in Hypomorphic Hearts. Hence, NKX2-5 plays a pivotal role in Cardiogenesis and any change in its levels has multiplying effects on heart development pathway.

www.strand-ngs.com [email protected] ; [email protected] Integrated analysis of RNA-Seq & ChIP-Seq data

Conclusion References:

Strand NGS is a software suite that offers a 1. Dupays L, Shang C, Wilson R, Kotecha S, Wood S, Towers N, and Mohun T. Sequential host of powerful data analysis and Binding of MEIS1 and NKX2-5 on the Popdc2 interpretation tools enabling a thorough Gene: A Mechanism for Spatiotemporal investigation of complex biological datasets. In Regulation of Enhancers during Cardiogenesis. this study, applying integrated solutions, we Cell Rep, 13(1):183-195, 2015. 2. NCBI GEO database: could identify the mechanism of transcriptional https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? regulation during cardiac differentiation by acc=GSE44576 successive binding of the two homeodomain 3. Mortazavi A, Williams BA, McCue K, Schaeffer transcription factors NKX2-5 and MEIS1 on L, and Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Popdc2 enhancer. The combined analysis of methods, 5(7):621-628, 2008. RNA-Seq and ChIP-Seq data using Strand 4. Zhang Y, Liu T, Meyer CA, Eeckhoute J, NGS revealed the expression profiles and their Johnson DS, Bernstein BE, Nusbaum C, Myers binding domains, where ChIP-Seq helped in RM, Brown M, Li W, and Liu XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol, identifying the binding domains of NKX2-5 and 9(9):137, 2008. inferring those of MEIS1, while RNA-Seq aided 5. Mathelier A, Zhao X, Zhang AW, Parcy F, in finding their impact on the expression of Worsley-Hunt R, Arenillas DJ, Buchman S, other genes Chen CY, Chou A, Ienasescu H, Lim J, Shyr C, Tan G, Zhou M, Lenhard B, Sandelin A, and Integrating data from different sequencing Wasserman WW. JASPAR 2014: an extensively expanded and updated open-access database technologies in Strand NGS revealed unknown of transcription factor binding profiles. Nucleic information, NKX2-5 plays a pivotal role in Acids Res, 42:142-147, 2014. cardiogenesis and any change in its levels has 6. Kutmon M, Riutta A, Nunes N, Hanspers K, multiplying effects on heart development Willighagen EL, Bohler A, Mélius J, Waagmeester A, Sinha SR, Miller R, Coort SL, pathway. Further, construction of NLP Cirillo E, Smeets B, Evelo CT, Pico AR. networks and Hierarchical Clustering WikiPathways: capturing the full diversity of Dendrogram showed that most proteins that pathway knowledge. Nucleic Acids Res, 44, 488-494, 2016. interact with NKX2 5 are DNA binding - 7. Caspi R, Billington R, Ferrer L, Foerster H, proteins, enhancers or growth factors which Fulcher CA, Keseler IM, Kothari A, explain the mechanism of heightened impact Krummenacker M, Latendresse M, Mueller LA, on heart development. Ong Q, Paley S, Subhraveti P, Weaver DS, Karp PD. The MetaCyc Database of metabolic pathways and enzymes and the BioCyc

collection of Pathway/Genome Databases. Nucleic Acids Res, 44(1):471-480, 2015.

Strand Life Sciences Pvt. Ltd. 5th Floor, Kirloskar Business Park Bellary Road, Hebbal, Bangalore 560024 Phone: +91-80-4078-7263 Fax: +91-80-4078-7299 Website: www.strandls.com

www.strand-ngs.com [email protected] ; [email protected]