Pathway and Network Analysis of Transcriptomic and Genomic Data.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게 l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. 다음과 같은 조건을 따라야 합니다: 저작자표시. 귀하는 원저작자를 표시하여야 합니다. 비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다. 변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다. l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다. l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다. 저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약(Legal Code)을 이해하기 쉽게 요약한 것입니다. Disclaimer Doctoral Thesis Pathway and Network Analysis of Transcriptomic and Genomic Data Sora Yoon Department of Biological Sciences Graduate School of UNIST 2019 Pathway and Network Analysis of Transcriptomic and Genomic Data Sora Yoon Department of Biological Sciences Graduate School of UNIST Abstract The development of high-throughput technologies has enabled to produce omics data and it has facilitated the systemic analysis of biomolecules in cells. In addition, thanks to the vast amount of knowledge in molecular biology accumulated for decades, numerous biological pathways have been categorized as gene-sets. Using these omics data and pre-defined gene-sets, the pathway analysis identifies genes that are collectively altered on a gene-set level under a phenotype. It helps the biological interpretation of the phenotype, and find phenotype-related genes that are not detected by single gene- based approach. Besides, the high-throughput technologies have contributed to construct various biological networks such as the protein-protein interactions (PPIs), metabolic/cell signaling networks, gene-regulatory networks and gene co-expression networks. Using these networks, we can visualize the relationships among gene-set members and find the hub genes, or infer new biological regulatory modules. Overall, this thesis/dissertation describes three approaches to enhance the performance of pathway and/or network analysis of transcriptomic and genomic data. First, a simple but effective method that improves the gene-permuting gene-set enrichment analysis (GSEA) of RNA-sequencing data will be addressed, which is especially useful for small replicate data. By taking absolute statistic, it greatly reduced the false positive rate caused by inter-gene correlation within gene-sets, and improved the overall discriminatory ability in gene-permuting GSEA. Next, a powerful competitive gene-set analysis tool for GWAS summary data, named GSA-SNP2, will be introduced. The z-score method applied with adjusted gene score greatly improved sensitivity compared to existing competitive gene-set analysis methods while exhibiting decent false positive control. The performance was validated using both simulation and real data. In addition, GSA-SNP2 visualizes protein interaction networks within and across the significant pathways so that the user can prioritize the core subnetworks for further mechanistic study. Finally, a novel approach to predict condition-specific miRNA target network by biclustering a large collection of mRNA fold-change data for sequence-specific targets will be introduced. The bicluster targets exhibited on average 17.0% (median 19.4%) improved gain in certainty (sensitivity + specificity). The net gain was further increased up to 32.0% (median 33.2%) by filtering them using functional network information. The analysis of cancer-related biclusters revealed that PI3K/Akt signaling pathway is strongly enriched in targets of a few miRNAs in breast cancer and diffuse large B-cell lymphoma. Among them, five independent prognostic miRNAs were identified, and repressions of bicluster targets and pathway activity by mir-29 were experimentally validated. The BiMIR database provides a useful resource to search for miRNA regulation modules for 459 human miRNAs. i ii Table of Contents Abstract ............................................................................................................................................ i Table of Contents ............................................................................................................................ iii List of Figures ................................................................................................................................. vi List of Tables ................................................................................................................................ viii Chapter I: Introduction ................................................................................................................... 1 1.1 Omics data ................................................................................................................................ 1 1.1.1 Genomic data ......................................................................................................................... 1 1.1.1.1 Genome-wide association study ........................................................................................... 2 1.1.2 Transcriptomic data ................................................................................................................ 4 1.1.2.1 Microarray and RNA-sequencing ......................................................................................... 4 1.1.2.2 Issues in RNA-sequencing data analysis .............................................................................. 5 1.2 Pathway analysis ....................................................................................................................... 6 1.2.1 Pathway databases .................................................................................................................. 7 1.2.2 Pathway analysis methods ...................................................................................................... 7 1.2.2.1 Over-representation analysis ................................................................................................ 7 1.2.2.2 Functional class sorting ....................................................................................................... 7 1.2.2.2.1 Gene-set enrichment analysis ............................................................................................ 8 1.2.2.3 Pathway topology-based method .......................................................................................... 9 1.2.3 Competitive and self-contained gene-set analysis .................................................................... 9 1.3 Biological network .................................................................................................................. 12 1.4 Research overview .................................................................................................................. 13 Chapter II: Improving gene-set enrichment analysis of RNA-seq data with small replicates .... 15 2.1 Abstract................................................................................................................................... 15 2.2 Introduction............................................................................................................................. 15 2.3 Materials and Methods ............................................................................................................ 16 2.3.1 Absolute gene-permuting GSEA and filtering ....................................................................... 16 2.3.2 Simulation of the read count data with the inter-gene correlation........................................... 18 2.3.3 Biological relevance measure of a gene-set ........................................................................... 20 2.3.4 RNA-seq data handling and gene-set condition ..................................................................... 21 2.3.5 Gene-set size ........................................................................................................................ 21 iii 2.3.6 AbsfilterGSEA R package .................................................................................................... 21 2.4 Results .................................................................................................................................... 21 2.4.1 Comparison of gene-permuting GSEA methods for simulated read count data ...................... 21 2.4.2 Comparison of GSEA methods for RNA-seq data ................................................................. 25 2.4.3 Effects of the absolute filtering on false positive control and biological relevance ................. 26 2.5 Discussion ............................................................................................................................... 27 2.6 Supplementary information of Chapter II ................................................................................. 29 Chapter III: A powerful pathway enrichment and network analysis tool for GWAS summary data ................................................................................................................................................ 34 3.1 Abstract................................................................................................................................... 34 3.2 Introduction............................................................................................................................. 34 3.3 Materials and Methods ............................................................................................................ 36 3.3.1 Algorithm of GSA-SNP2 ...................................................................................................... 36 3.3.2 Competitive pathway analysis tools .....................................................................................