Algorithms to Integrate Omics Data for Personalized Medicine
Total Page:16
File Type:pdf, Size:1020Kb
ALGORITHMS TO INTEGRATE OMICS DATA FOR PERSONALIZED MEDICINE by MARZIEH AYATI Submitted in partial fulfillment of the requirements For the degree of Doctor of Philosophy Thesis Adviser: Mehmet Koyut¨urk Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY August, 2018 Algorithms to Integrate Omics Data for Personalized Medicine Case Western Reserve University Case School of Graduate Studies We hereby approve the thesis1 of MARZIEH AYATI for the degree of Doctor of Philosophy Mehmet Koyut¨urk 03/27/2018 Committee Chair, Adviser Date Department of Electrical Engineering and Computer Science Mark R. Chance 03/27/2018 Committee Member Date Center of Proteomics Soumya Ray 03/27/2018 Committee Member Date Department of Electrical Engineering and Computer Science Vincenzo Liberatore 03/27/2018 Committee Member Date Department of Electrical Engineering and Computer Science 1We certify that written approval has been obtained for any proprietary material contained therein. To the greatest family who I owe my life to Table of Contents List of Tables vi List of Figures viii Acknowledgements xxi Abstract xxiii Abstract xxiii Chapter 1. Introduction1 Chapter 2. Preliminaries6 Complex Diseases6 Protein-Protein Interaction Network6 Genome-Wide Association Studies7 Phosphorylation 10 Biweight midcorrelation 10 Chapter 3. Identification of Disease-Associated Protein Subnetworks 12 Introduction and Background 12 Methods 15 Results and Discussion 27 Conclusion 40 Chapter 4. Population Covering Locus Sets for Risk Assessment in Complex Diseases 43 Introduction and Background 43 iv Methods 47 Results and Discussion 59 Conclusion 75 Chapter 5. Application of Phosphorylation in Precision Medicine 80 Introduction and Background 80 Methods 83 Results 89 Conclusion 107 Chapter 6. Conclusion and Future Research 113 Complete References 117 v List of Tables 3.1 Statistical significance of top two subnetworks. The table shows the q-value of top two subnetworks identified using each scoring scheme according to the permuted genotype and PPI for WTCCC-T2D 30 3.2 Statistical significance (q-value) of top subnetworks. The subnetworks are identified using MoBaS according to the permuted genotype and PPI on two independent Psoriasis datasets 36 3.3 The contingency between the individual genes and subnetworks identified on two independent Psiorasis datasets. The p-values of the contingency of each table according to Chi-Square test are (a) 1.72E-175 , (b) 3.39E-15, (c) 4.32E-28 and (d) 0. 39 4.1 Genotype models. g(c; s) denotes the genotype of locus c in sample s and m(i) is a genotype model. 48 4.2 Genome-Wide Association data used in the computational experiments. 62 4.3 The number of PoCos identified on each dataset, and the distribution of the genomic loci in each individual PoCo. The average and standard deviation is reported across different folds. 62 4.4 Shared molecular bases of T2D, BD, and CAD as revealed by NetPocos. For each disease, ten most frequent genes that are involved in NetPocos selected by L1-regularized logistic regression in risk prediction are listed. Previously reported association of these vi genes with the three diseases are indicated with a \Yes" or "No" in the respective column of each row. 76 5.1 Data-specific kinase prediction. The phosphosites listed in this table are reported to have more than one kinase in PhosphoSitePlus. CophosK+ identifies previously reported, but different kinases as the top-ranked candidate based on each dataset. 107 vii List of Figures 1.1 Treatment paradigm. Without personalized medicine, some patients might benefit and some might not benefit from the treatment (a). With personalized medicine the patient-specific analysis will be beneficial to all patients (b).2 1.2 Omics Analysis. Separate analysis of (a) genomics, (b) transcriptomics, (c) Proteomics and (d) Interactomics might not be able to capture the whole picture of the mechanism of diseases. (e) Integration of omics data provides a comprehensive view to discover mechanism involved in complex diseases.3 1.3 Organization of thesis in the context of central dogma of molecular biology. Different technologies enable generation of molecular data in different omics levels. Each chapter presents the novel algorithm to integrate the disparate omic datasets to analyze complex diseases.4 3.1 Illustration of existing and proposed scoring schemes. This figure shows the scoring schema for quantifying the disease association of protein subnetworks: (a) Node-Based scoring, (b) Linear Combination of node scores and edge scores, (c) the proposed Modularity-Based (MoBaS) scoring scheme. For each method, the score of subnetwork is computed as an aggregate of all quantities in the figure. 19 viii 3.2 Statistical significance of subnetworks. Statistical significance of high-scoring subnetworks identified using Node-Based scoring (first column), Linear Combination of node scores and edge scores (second column), and Modularity-Based (MoBaS) scoring (third column). The highest scoring 20 subnetworks identified using each scoring scheme are shown. The x-axis shows the rank of each subnetwork according to their score, the y-axis shows its score. The blue curve shows the scores of the subnetworks identified on the WTCCC-T2D dataset. For each i on the x-axis, the red (green) curve and error bar in the first (second) row show the distribution of the scores of i highest scoring subnetworks in 100 datasets obtained by permuting the genotypes of the samples (permuting the interactions in the PPI networks while preserving node degrees). 31 3.3 Two significant subnetworks. Two subnetworks that are found to be significantly associated with T2D. The size of each node indicates the significance of the association of the corresponding protein with T2D (rv). The diamond nodes are those previously reported to be associated with T2D in the literature1. The intensity of purple coloring in the nodes indicates the number of computational disease gene prioritization methods2 that identified the respective gene to be associated with T2D. The individual p-values of each gene in the subnetwork are shown in the table left of the subnetwork. The genes with insignificant p-value (p > 0:05) ix that are known to be related to T2D are highlighted in yellow. The genes with insignificant p-value and are not reported to be related to T2D are highlighted in orange. These genes are the candidates for further investigation. 33 3.4 The statistical significance of high-scoring subnetworks using MoBaS on Psiorasis dataset. The highest scoring 16 subnetworks identified using MoBaS are shown. The x-axis shows the rank of each subnetwork according to their score, the y-axis shows its score. The first row shows the result of MoBaS on GAIN - PS dataset and the second row shows the result on WTCCC - PS dataset. The blue curve shows the scores of the subnetworks identified on the dataset. For each i on the x-axis, the red (green) curve and error bar show the distribution of the scores of i highest scoring subnetworks in 100 datasets obtained by permuting the genotypes of the samples (permuting the interactions in the PPI networks while preserving node degrees). 37 3.5 Reproducibility of identified subnetworks using MoBaS in two independent datasets. The size of the circles represents the size of identified subnetwork. The thickness of the edges represents the significance of overlap between the two subnetwork based on hypergeometric distribution. 38 3.6 Robustness of MoBaS. The relation between the rank of the subnetwork in original data with rank of the subnetworks in x incomplete data in 10 different runs. Different colors represent different percentage of missing samples. 41 4.1 The workflow of the proposed method for risk assessment. 47 4.2 Model selection and computation of binary genotype profiles for each genomic locus. The genotypes of four loci on a hypothetical case-control dataset are shown on the left. The five possible binary genotype profiles for each locus are computed, as shown in the middle. Blue squares indicate the presence of the genotype of interest in the respective sample for each model (respectively, homozygous minor allele, heterozygous, homozygous major allele, presence of minor allele, presence of major allele). The resulting binary genotype profiles for each locus are shown on the right. Red squares indicate the existence of genotype of interest according the selected model. In this example, models m(4), m(1), m(5), and m(2) are respectively selected for the four loci. 49 4.3 Identification of NetPocos. Each vi represents a protein (V ) and each cj represents a genomic locus (U). Blue edges represent the interactions between proteins (E), purple edges indicate that the respective locus is in the RoI of the coding gene for the respective protein and red edges represent the eQTL links. Initially, P is empty and all loci are considered and the locus (c5) that maximizes δ(:) is added to P . After this point, the search space is restricted to loci that are at most three hops away from c5. We continue this xi procedure until the set of selected loci cover a sufficient fraction of the case samples. Cyan nodes and gold nodes show the selected loci and proteins respectively. 56 4.4 Comparison of the risk assessment performance of NetPocos, individual locus based features, and polygenic score on seven different diseases. The x-axis shows the p-value threshold (α) used in filtering based feature selection and the y-axis shows the area under the ROC curve (AUC) for performance in risk assessment. The curve shows the average AUC score and error bars show the standard deviation of AUC score across 5 folds in 5 different runs. 63 4.5 The best risk prediction performance achieved by each method and the size of the resulting model for all seven diseases. 66 4.6 Comparison of the risk assessment performance of NetPocos and network-free PoCos on T2D, BD and CAD.