Novel Extensions of Label Propagation for Biomarker Discovery in Genomic Data

NOVEL EXTENSIONS OF LABEL PROPAGATION FOR BIOMARKER DISCOVERY IN GENOMIC DATA by Matthew E. Stokes B.S. Systems and Control Engineering, Case Western Reserve University, 2008 M.S. Intelligent Systems Program / Biomedical Informatics, University of Pittsburgh 2011 Submitted to the Graduate Faculty of the Kenneth P. Dietrich School of Arts and Sciences in fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2014 UNIVERSITY OF PITTSBURGH DIETRICH SCHOOL OF ARTS AND SCIENCES This dissertation proposal was presented by Matthew E. Stokes on July 17, 2014 and approved by M. Michael Barmada, PhD, Department of Human Genetics Gregory F. Cooper, MD, PhD, Department of Biomedical Informatics and the Intelligent Systems Program Milos Hauskrecht, PhD, Department of Computer Science and the Intelligent Systems Program Dissertation Advisor: Shyam Visweswaran, MD, PhD, Department of Biomedical Informatics and the Intelligent Systems Program ii NOVEL EXTENSIONS OF LABEL PORPAGATION FOR BIOMARKER DISCOVERY IN GENOMIC DATA Matthew E. Stokes, M.S University of Pittsburgh, 2014 Copyright © by Matthew E. Stokes 2014 iii NOVEL EXTENSIONS OF LABEL PROPAGATION FOR BIOMARKER DISCOVERY IN GENOMIC DATA Matthew E. Stokes, PhD University of Pittsburgh, 2014 One primary goal of analyzing genomic data is the identification of biomarkers which may be causative of, correlated with, or otherwise biologically relevant to disease phenotypes. In this work, I implement and extend a multivariate feature ranking algorithm called label propagation (LP) for biomarker discovery in genome-wide single-nucleotide polymorphism (SNP) data. This graph-based algorithm utilizes an iterative propagation method to efficiently compute the strength of association between a SNP and a phenotype. I developed three extensions to the LP algorithm, with the goal of tailoring it to genomic data. The first extension is a modification to the LP score which yields a variable-level score for each SNP, rather than a score for each SNP genotype. The second extension incorporates prior biological knowledge that is encoded as a prior value for each SNP. The third extension enables the combination of rankings produced by LP and another feature ranking algorithm. iv The LP algorithm, its extensions, and two control algorithms (chi squared and sparse logistic regression) were applied to 11 genomic datasets, including a synthetic dataset, a semi- synthetic dataset, and nine genome-wide association study (GWAS) datasets covering eight diseases. The quality of each feature ranking algorithm was evaluated by using a subset of top- ranked SNPs to construct a classifier, whose predictive power was evaluated in terms of the area under the Receiver Operating Characteristic curve. Top-ranked SNPs were also evaluated for prior evidence of being associated with disease using evidence from the literature. The LP algorithm was found to be effective at identifying predictive and biologically meaningful SNPs. The single-score extension performed significantly better than the original algorithm on the GWAS datasets. The prior knowledge extension did not improve on the feature ranking results, and in some cases it reduced the predictive power of top-ranked variants. The ranking combination method was effective for some pairs of algorithms, but not for others. Overall, this work’s main results are the formulation and evaluation of several algorithmic extensions of LP for use in the analysis of genomic data, as well as the identification of several disease-associated SNPs. v TABLE OF CONTENTS NOVEL EXTENSIONS OF LABEL PROPAGATION FOR BIOMARKER DISCOVERY IN GENOMIC DATA ................................................................................................................. IV GLOSSARY................................................................................................................................ XV 1.0 INTRODUCTION ........................................................................................................ 1 1.1 DIMENSIONALITY REDUCTION .................................................................. 2 1.2 OVERVIEW OF LABEL PROPAGATION ..................................................... 2 1.3 HYPOTHESIS AND SPECIFIC AIMS............................................................. 3 1.4 CONTRIBUTIONS ............................................................................................. 4 1.5 OVERVIEW OF DISSERTATION ................................................................... 5 2.0 BACKGROUND .......................................................................................................... 6 2.1 GENETIC BASIS OF COMPLEX DISEASES ................................................ 6 2.2 GENETIC VARIATIONS .................................................................................. 8 2.3 GENOME-WIDE ASSOCIATION STUDIES................................................ 10 2.4 ANALYSIS OF GENETIC VARIATION DATA .......................................... 11 2.4.1 Linkage Disequilibrium ................................................................................ 12 2.4.2 Population Stratification ............................................................................... 13 2.4.3 Computational Complexity ........................................................................... 13 2.4.4 Avoiding Bias ................................................................................................. 14 vi 2.5 DIMENSIONALITY REDUCTION ................................................................ 15 2.5.1 Types of Dimensionality Reduction ............................................................. 16 2.6 FEATURE SELECTION METHODS ............................................................ 18 2.6.1 Filter Methods ................................................................................................ 19 2.6.2 Wrapper Methods.......................................................................................... 21 2.6.3 Embedded Methods ....................................................................................... 22 2.6.4 Combining Feature Rankings....................................................................... 23 3.0 ALGORITHMIC METHODS .................................................................................. 26 3.1 ALGORITHMIC DETAILS ............................................................................. 26 3.2 DETAILS OF LP ALGORITHM FOR SNP DATA ...................................... 28 3.3 ADDITIONAL LP ALGORITHMS ................................................................ 31 3.4 ALGORITHMIC EXTENSIONS .................................................................... 36 3.4.1 LP1 score ........................................................................................................ 36 3.4.2 Incorporation of prior knowledge ................................................................ 39 3.4.2.1 Sources of Knowledge ......................................................................... 39 3.4.2.2 Use of Prior Knowledge in LP ........................................................... 42 3.4.3 Combining feature rankings ......................................................................... 44 4.0 EXPERIMENTAL METHODS ................................................................................ 46 4.1 DATASETS ........................................................................................................ 46 4.1.1 Synthetic dataset ............................................................................................ 47 4.1.2 GAW17 semi-synthetic dataset..................................................................... 48 4.1.3 Late-onset Alzheimer’s disease datasets ...................................................... 49 4.1.4 WTCCC GWAS datasets .............................................................................. 51 vii 4.1.5 Quality control for GWAS datasets ............................................................. 51 4.1.6 Prior knowledge sources ............................................................................... 52 4.2 EVALUATION .................................................................................................. 54 4.2.1 Precision-recall curves .................................................................................. 54 4.2.2 Evaluation of predictive performance ......................................................... 55 4.2.3 Reproducibility of biomarkers ..................................................................... 56 4.2.4 Evidence of biological validity ...................................................................... 57 4.3 COMPARISON ALGORITHMS..................................................................... 58 4.3.1 Chi squared test ............................................................................................. 58 4.3.2 Sigmoid Weighted ReliefF ............................................................................ 59 4.3.3 Sparse Logistic Regression ........................................................................... 60 5.0 EXPERIMENTAL RESULTS .................................................................................. 62 5.1 EVALUATION OF LP3 AND LP1 .................................................................. 62 5.1.1 Synthetic Data ................................................................................................ 63 5.1.2 GAW17 Data .................................................................................................

Novel Extensions of Label Propagation for Biomarker Discovery in Genomic Data

WO 2Ull/13162O Al

CRL4-DCAF12 Ubiquitin Ligase Controls MOV10 RNA Helicase During Spermatogenesis and T Cell Activation

Human Lectins, Their Carbohydrate Affinities and Where to Find Them

STAU2 (NM 001164380) Human Untagged Clone Product Data

Large-Scale Identification of Clonal Hematopoiesis and Mutations Recurrent in Blood Cancers

Anti-STAU2 Monoclonal Antibody, Clone 2D7 (DCABH-6925) This Product Is for Research Use Only and Is Not Intended for Diagnostic Use

Mouse Tmtc2 Conditional Knockout Project (CRISPR/Cas9)

Discovery of an O-Mannosylation Pathway Selectively Serving Cadherins and Protocadherins

Analyses of Long-Term Fine Particulate Air Pollution Exposure, Genetic Variants, and Blood DNA Methylation Age in the Elderly

Comparative Analysis of the Ubiquitin-Proteasome System in Homo Sapiens and Saccharomyces Cerevisiae

TMTC2 (G-15): Sc-169651

BMC Biology Biomed Central