University of Pennsylvania ScholarlyCommons

Publicly Accessible Penn Dissertations

2017

Understanding Chronic Kidney Disease: Genetic And Epigenetic Approaches

Yi-An Ko Ko University of Pennsylvania, [email protected]

Follow this and additional works at: https://repository.upenn.edu/edissertations

Part of the Bioinformatics Commons, Genetics Commons, and the Systems Biology Commons

Recommended Citation Ko, Yi-An Ko, "Understanding Chronic Kidney Disease: Genetic And Epigenetic Approaches" (2017). Publicly Accessible Penn Dissertations. 2404. https://repository.upenn.edu/edissertations/2404

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/2404 For more information, please contact [email protected]. Understanding Chronic Kidney Disease: Genetic And Epigenetic Approaches

Abstract The work described in this dissertation aimed to better understand the genetic and epigenetic factors influencing chronic kidney disease (CKD) development. Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) significantly associated with chronic kidney disease. However, these studies have not effectively identified target for the CKD variants. Most of the identified variants are localized to non-coding genomic regions, and how they associate with CKD development is not well-understood. As GWAS studies only explain a small fraction of heritability, we hypothesized that epigenetic changes could explain part of this missing heritability.

To identify potential targets of the genetic variants, we performed expression quantitative loci (eQTL) analysis, using genotyping arrays and RNA sequencing from human kidney samples. To identify the target genes of CKD-associated SNPs, we integrated the GWAS-identified SNPs with the eQTL results using a Bayesian colocalization method, coloc. This resulted in a short list of target genes, including PGAP3 and CASP9, two genes that have been shown to present with kidney phenotypes in knockout mice. To examine the functional role of a newly identified gene from the integrative analysis, MANBA, we knocked down this gene in zebrafish. This esultedr in pericardial edema, a phenotype seen with kidney developmental defect.

While epigenetic dysregulation has been suggested as a mechanism for the development of many diseases, little was known about the epigenome of normal and diseased human kidneys. We performed cytosine methylation analysis in microdissected human kidney tubules in both CKD samples and controls, and found differentially methylated regions (DMRs) in CKD. These DMRs are further validated in the second cohort. The DMRs were mostly localized outside of promoter areas and enriched in intronic enhancer regions, and we found that the DMRs contain consensus-binding motifs for key renal transcription factors (HNF, TCFAP, SIX2). Furthermore, we found these DMRs correlated with the transcript changes. A network analysis of these correlated DMR-transcript pairs revealed an enrichment of a fibrosis network highlighted the TGF-β pathway. In vitro validation experiments established a likely causal relationship between epigenetic changes and CKD development.

In summary, integration of eQTL analysis with GWAS-identified CKD ariantsv has facilitated the identification of likely CKD target genes. We discovered that cytosine methylation changes in human kidney tissue samples in the major fibrotic pathways. Overall, genetic variations as well as cytosine methylation levels have significant impact on regulation, possibly downstream organ function, and CKD development.

Degree Type Dissertation

Degree Name Doctor of Philosophy (PhD)

Graduate Group Cell & Molecular Biology

First Advisor Katalin Susztak Keywords Chronic kidney disease, cytosine methylation, Epigenetics, eQTL, Genetics, integrative analysis

Subject Categories Bioinformatics | Genetics | Systems Biology

This dissertation is available at ScholarlyCommons: https://repository.upenn.edu/edissertations/2404 UNDERSTANDING CHRONIC KIDNEY DISEASE:

GENETIC AND EPIGENETIC APPROACHES

Yi-An Ko

A DISSERTATION

in

Cell and Molecular Biology

Presented to the Faculties of the University of Pennsylvania

In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

2017

______

Supervisor of Dissertation

Katalin Susztak, M.D. Ph.D., Professor of Medicine Department

______

Graduate Group Chairperson

Daniel S. Kessler, Ph.D., Associate Professor of Cell and Developmental Biology

Dissertation Committee:

Christopher D. Brown, Ph.D., Assistant Professor of Genetics

Russ P. Carstens, M.D., Associate Professor of Medicine

Michael A. Pack, M.D., Professor of Medicine

Daniel J. Rader, M.D., Seymour Gray Professor of Molecular Medicine

DEDICATION

To the contributors of my two alleles

Jung-Shiang Tu and Der-Shyang Ko

For their love and support

ii

ACKNOWLEDGMENT

During the course of my Ph.D. studies at University of Pennsylvania. I have had the pleasure to work with many inspiring and knowledgeable people. I am grateful to the Cell and Molecular Biology Program in the University of Pennsylvania Perelman School of Medicine for creating a very stimulating and inspiring environment. I thank to my thesis committee members, Dr. Katalin Susztak, Dr. Christopher D Brown, Dr. Russ P. Carstens, Dr. Michael A. Pack, and Dr. Daniel J. Rader, for all of their encouragement and hard work in directing my thesis. My development in science has been shaped by a number of people, but none more than Dr. Katalin Susztak. Working with Dr. Susztak has been a great journey of research, work, discovery, mistakes, laughter, celebration, and understanding. I have learned far more from her than only science, and look forward to a what I hope will be a long and productive collaboration. I am fortunate to have met Dr. Susztak soon after arriving in the U.S., and I hope to make her proud. Thanks to the fourth and fifth floor Clinical Research Building members for being my GPS and friends in life and work, including members from Holtzman Lab, Epstein Lab, Tishkoff Lab, Zhou Lab, Brown, Wu Lab, Murray Lab and Sundaram Lab. I especially want to thank Juni, Ivy, Cody, Ian, Arcy, Chuck, Jay for being there since my first day on the floor, all the chats, help, encouragements and our Friday night movie club. I have extra-special affections for Elicia and Gleb for their day-to-day presence in my life, the sense and nonsense moments, scientific and non-scientific conversations. These will always be among my best memories as a graduate student. Thanks too to my musical friends, Pasquale, Toby, Neil, Diane, Gary, Eleanor, Sara, Ralph, Eva, Helga, Bob, Sue, and Joyce. You are my extended cultural family in the U.S., and we will continue the celebration of art and life together. Thanks to my biking team leader Rick -- we will not be defeated by the cold weather! Thanks for the volunteering coordinators at Action Wellness, Jay, Ron, Harley, Jay-Bird; I am so happy to be part of you, and I will continue to be there. Finally, I would like to give special thanks to Dr. Paul Li and Dr. Akon Higuchi for directing my career since I was fresh out of undergrad; your influence on me has been much greater than you can imagine. Last but not least, I wish to thank Peter for his everlasting love, encouragement and support.

Yi-An Ko, Philadelphia, 2016.

iii

ABSTRACT

UNDERSTANDING CHRONIC KIDNEY DISEASE: GENETIC AND EPIGENETIC

APPROACHES

Yi-An Ko

Katalin Susztak

The work described in this dissertation aimed to better understand the genetic and epigenetic factors influencing chronic kidney disease (CKD) development. Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) significantly associated with chronic kidney disease. However, these studies have not effectively identified target genes for the CKD variants. Most of the identified variants are localized to non-coding genomic regions, and how they are related to CKD development is not well-understood. As

GWAS studies only explain a small fraction of heritability, we hypothesized that epigenetic changes could explain part of this missing heritability.

To identify potential gene targets of the genetic variants, we performed expression quantitative loci (eQTL) analysis, using genotyping arrays and RNA sequencing from human kidney samples. To identify the target genes of CKD-associated SNPs, we integrated the GWAS- identified SNPs with the eQTL results using a Bayesian colocalization method, coloc. This resulted in a short list of target genes, including PGAP3 and CASP9, two genes that have been shown to present with kidney phenotypes in knockout mice. To examine the functional role of a newly identified gene from the integrative analysis, MANBA, we knocked down this gene in zebrafish. This resulted in pericardial edema, a phenotype seen with kidney developmental defect.

While epigenetic dysregulation has been suggested as a mechanism for the development of many diseases, little was known about the epigenome of normal and diseased human kidneys.

We performed cytosine methylation analysis in microdissected human kidney tubules in both

iv

CKD samples and controls, and found differentially methylated regions (DMRs) in CKD. These

DMRs are further validated in the second cohort. The DMRs were mostly localized outside of promoter areas and enriched in intronic enhancer regions, and we found that the DMRs contain consensus-binding motifs for key renal transcription factors (HNF, TCFAP, SIX2). Furthermore, we found these DMRs correlated with the transcript changes. A network analysis of these correlated DMR-transcript pairs revealed an enrichment of a fibrosis network highlighted the TGF-

β pathway. In vitro validation experiments established a likely causal relationship between epigenetic changes and CKD development.

In summary, integration of eQTL analysis with GWAS-identified CKD variants has facilitated the identification of likely CKD target genes. We discovered that cytosine methylation changes in human kidney tissue samples in the major fibrotic pathways. Overall, genetic variations as well as cytosine methylation levels have significant impact on gene expression regulation, possibly downstream organ function, and CKD development.

v

TABLE OF CONTENTS

ACKNOWLEDGMENT ...... III ABSTRACT ...... IV TABLE OF CONTENTS ...... VI LIST OF TABLES ...... VIII LIST OF FIGURES ...... IX LIST OF APPENDIX ...... XI CHAPTER 1: Introduction ...... 1 1.1 Chronic kidney disease: approaches for understanding the disease ...... 1 1.1.1. Overview of chronic kidney disease ...... 1 1.1.2. Genetics and CKD ...... 1 1.1.3. Epigenetics and CKD ...... 4 1.2 Genome-wide association studies ...... 5 1.2.1. The history of genetic mapping ...... 5 1.2.2. GWAS methods ...... 7 1.2.3. GWAS and chronic kidney diseases ...... 9 1.3 Expression quantitative analysis (eQTL) ...... 12 1.3.1. The development of eQTL analysis ...... 12 1.3.2. eQTL analysis ...... 13 1.3.3. Finding colocalization of GWAS and eQTL loci ...... 14 1.4 Epigenetics ...... 15 1.4.1. Epigenetics in chronic kidney disease ...... 15 1.4.2. Epigenetics and CKD development in mouse models ...... 17 1.4.3. Epigenetics and CKD development human studies ...... 19 1.4.4. Epigenetics Overview ...... 20 1.4.5. Methods to study the epigenome ...... 26 CHAPTER 2: Genetic-variation-driven gene-expression changes highlight genes with important functions for kidney disease ...... 33 2.1. Abstract ...... 33 2.2. Introduction ...... 34 2.3. Results ...... 35 2.3.1 eQTL analysis of human kidney tissues ...... 35 2.3.2 Validation and replication of kidney cis-eQTL signals ...... 37 2.3.3 eGenes are enriched for metabolism-associated genes ...... 41 2.3.4 Kidney eQTL highlights the genetics of disease traits ...... 43 2.3.5 Defining the genetic network underlying CKD ...... 44 2.3.6 Integrative analysis identifies MANBA as a potential gene for CKD ...... 47 2.3.7 Analysis of MANBA gene expression ...... 51 2.3.8 Knockdown of Manba results in kidney defect in zebrafish ...... 52 2.4. Discussion and Conclusion ...... 55 2.5. Materials and Methods ...... 57 CHAPTER 3: Cytosine methylation changes in enhancer regions of core pro-fibrotic genes characterize kidney fibrosis development ...... 62 3.1 Abstract ...... 62 3.2 Introduction ...... 62 3.3 Results ...... 64

vi

3.3.1 CKD kidneys show distinct cytosine methylation profile ...... 64 3.3.2 Validation and external replication of the results ...... 66 3.3.3 Differentially methylated loci are enriched in kidney-specific gene regulatory regions 68 3.3.4 Differentially methylated regions are functionally relevant and correlate with transcript levels 72 3.3.5 Methylation in differentially methylated regions affect gene expression in vitro ...... 74 3.4 Discussion and Conclusion ...... 76 3.5 Materials and Methods ...... 78 CHAPTER 4: Conclusion ...... 85 4.1. Summary of results and discussion ...... 85 4.2. Future direction ...... 86 APPENDIX ...... 89 BIBLIOGRAPHY ...... 93

vii

LIST OF TABLES

Table 1-1 | Gene regulatory function of histone marks

Table 2-1 | SNPs identified in human kidney eQTL that are in tight LD with CKD GWAS SNPs

Table 2-2 | Significant genes identified through Coloc that are associated to both CKD GWAS- and human kidney-identified SNPs

Table 3-1 | Demographic, clinical and histological characteristics of the samples

viii

LIST OF FIGURES

Figure 1-1 | Illustration of cis-eQTL.

Figure 1-2 | The eukaryotic transcription unit.

Figure 1-3 | Schematic diagram of the restriction enzyme based method to identify differentially methylated regions.

Figure 2-1 | PCA plot of kidney samples against 1000 genomes reference genome.

Figure 2-2 | Cis-eQTL analysis of human kidney genome.

Figure 2-3 | Positive correlation between sample number and identified eGenes.

Figure 2-4 | Positive correlation between sample number and replicated significant SNP-eGene pairs.

Figure 2-5 | Significant SNPs in the eQTL are enriched at regulatory regions.

Figure 2-6 | An example of significant SNPs in the eQTL that is enriched at regulatory regions.

Figure 2-7 | Network of eGenes.

Figure 2-8 | Density distributions of gene expression of eGenes and all genes in kidney samples.

Figure 2-9 | Disease trait GWAS Enrichment analysis.

Figure 2-10 | Integrative analysis of GWAS and eQTL SNPs.

Figure 2-11 | Integrative analysis of GWAS and eQTL SNPs.

Figure 2-12 | Genetic and epigenetic annotation of MANBA-locus SNPs.

Figure 2 13 | Locuszoom plots in MANBA region at chr4:103.4-103.8Mb.

Figure 2-14 | Genetic and epigenetic annotation of MANBA-locus SNPs.

Figure 2-15 | Gene expression analyses of MANBA.

Figure 2-16 | Kidney defect phenotype observed in zebrafish after knockdown MANBA.

Figure 2-17 | Kidney defect phenotype observed in zebrafish after knockdown MANBA.

Figure 3-1 | Statistically significant cytosine methylation differences in chronic kidney disease.

Figure 3-2 | External and internal validation of the observed changes.

ix

Figure 3-3 | Chronic kidney disease differentially methylated regions are localized to kidney- specific enhancer regions.

Figure 3-4 | Chronic kidney disease differentially methylated regions are enriched for kidney- specific binding sites.

Figure 3-5 | Differentially methylated regions correlate with transcript changes.

Figure 3-6 | Regulation of transcripts by a DNA methyltransferase inhibitor in in vitro cultured human tubular epithelial cells.

Figure 3-7 | Gene body cytosine methylation changes drive gene expression differences.

x

LIST OF APPENDIXES

Appendix 1 | List of CKD associated SNPs. List of SNPs curated manually from all kidney- function-associated GWAS.

Appendix 2 | All significant eGenes

Appendix 3 | Differentially methylated loci and the corresponding differentially expressed transcripts

Appendix 4 | Infinium DKD dataset and HELP CKD dataset differentially methylated loci

Appendix 5 | Technical validation of methylation level using MassArray Epityper

Appendix 6 | RefSeq annotated genomic distribution of the hypo- and hypermethylated regions

Appendix 7 | List of differentially methylated regions with p-value methylation level, genomic locus and nearest annotated transcript

xi

CHAPTER 1: Introduction 1.1 Chronic kidney disease: approaches for understanding the disease 1.1.1. Overview of chronic kidney disease

The major function of the human kidney is to filter blood and eliminate metabolic waste through urine. The healthy human kidney filters 180 liters of blood each day. Most electrolytes and small molecules are reabsorbed from the primary filtrate while some of them are secreted in the urine. This process requires a large quantity of active molecule transport and associated energy expenditure in the kidney.

Kidney function is mostly measured by filtration capacity (glomerular filtration rate or

GFR), based on plasma clearance of endogenous creatinine. Chronic kidney disease (CKD) is a common condition, estimated to affect 13.6% of the US population [2] and 11 to 13% of the global population [3]. CKD is characterized by a 40% decline in eGFR or leakiness of the filter to plasma , specifically albumin. More than 75% of persons with CKD have hypertension and diabetes in the U.S. [2].

CKD is associated with a 3- to 5-fold increase in risk of mortality and is an irreversible disease process that culminates in end-stage renal disease (ESRD) [4, 5]. Current treatments for

ESRD are limited to dialysis and kidney transplantation. Understanding of the mechanisms of

CKD development is relatively limited with respect to the roles of genetic and environmental factors (hypertension, diabetes). No new drugs have been registered for CKD since 2001.

1.1.2. Genetics and CKD

A small proportion of CKD cases has monogenetic causes, including cystic kidney diseases and glomerular diseases. However, in the majority of cases, CKD is a complex-trait disease with both genetic and environmental factors, including diabetes and hypertension [6, 7].

One of the key questions in the study of CKD is whether there is evidence suggesting a causal role for genetics, specifically whether CKD is heritable. Heritability is the proportion of phenotype 1

variation that is the result of the additive genetic effect of a disease or other condition, e.g. a positive or negative correlation of a given trait between parents and offspring. Heritability can be assessed through different methods including familial aggregation, adopted children (to distinguish environmental factors from heritability), and twin studies.

Family aggregation studies in the general population have addressed the heritability of kidney function, and are also useful for assessing rare variants traveling through pedigrees [8-12].

In the Framingham Heart Study offspring cohort with more than 300 families [8], the multivariable- adjusted heritability estimates were 0.29 for creatinine and 0.33 for eGFR. The heritability of eGFR estimated in a normal kidney function Utah pedigree cohort was 0.3 to 0.5 [13]. (Kidney traits are further discussed in Chapter 1.2.3.)

Studies have also revealed that association between kidney disease and family history is stronger in African Americans [14, 15] than European-descent Americans [14], suggesting potential differences in genetic variation between different populations. In short, kidney function is heritable and shows phenotype variations that are specific to different populations.

Familial linkage analyses have helped identify the causal genes for strongly heritable kidney diseases that follow simple Mendelian inheritance patterns. For example, the autosomal dominant polycystic kidney disease (ADPKD) gene on 16 [16], and the Alport syndrome gene on the X chromosome [17], were both discovered through genomic linkage analysis. Mutations in these genes have low frequency but the effect size for disease risk is large.

Linkage analysis, a statistical method that associates the function of genes or variants with their chromosomal location, has also identified a few genomic loci associated with CKD.

These include the identification of a chromosome 8 locus link to urinary microalbumin (Fox et al.,

2005), a link between chromosome 19 and 12 with urine albumin to creatinine ratio (UACR; see

Chapter 1.2.3 for additional detail) (Freedman et al., 2003), a link between the q arm of chromosome 7 with diabetic nephropathy [18], a link between chromosome 2q36 and 6q22-23 with IgA nephropathy [19, 20], and linkage between genes on 22q, 5q and 7q with

2

UACR [21-23].

Although linkage analysis was a successful approach to detect rare genetic variants with large effects on disease risk, diseases with different penetrance in affected individuals make disease classification difficult (e.g. not all offspring that carry the same genomic mutation will develop the same disease) and decrease study power. Pedigree analyses have generated results that map genetic loci to kidney diseases using multiple families. However, it is very unlikely that each family has the same sets of genetic variants, so the linkage results can conflict and the significance decreases. Linkage analysis often identifies rare variants, with large effect size, that are segregated within tested families This provides useful insight for monogenic traits, but for complex traits, this approach has generated few results and often cannot be reproduced in the general population.

Since complex traits often have multiple genetic factors, these genetic factors should be surveyed in general populations that carry the disease, rather than by linkage analysis on family pedigrees. To study genetic variants that are associated with specific diseases in the general population, the “common disease, common variant” (CD-CV) hypothesis was developed in the

1990s. The CD-CV hypothesis proposes that genetic variants in normal individuals contribute to overall risk of complex-trait diseases. It was first described in a few publications [24-26], then supported by comprehensive empirical data [27]. Later the CD-CV hypothesis was advanced through the completion of the human genomic map [28, 29], generation of public SNP databases

[30], and the International HapMap project [31, 32].

Based on the CD-CV hypothesis, GWAS were devised to study the association between common variants and complex-trait phenotypes. GWAS with genotyping arrays have facilitated the identification of multiple loci associated with kidney function. Genotyping arrays are designed so that through statistical inference, they can recover most of the HapMap and 1000 Genomes cataloged variants. The array is an economical substitute for whole-genome sequencing [33].

More than 100 loci have been identified, and these variants show strong and reproducible

3

association with CKD. The majority of the variants are localized to non-coding regions of the genome [34]. More detailed development of GWAS can be found in Chapter 1.2.

The detection of genetic variants that exert small effects or appear with very low frequency in complex-trait disease development requires very large study cohorts for sufficient statistical power. These variants may therefore be missed by GWAS with current sample size. To date, the identified genetic variants explain only a small fraction of the heritability of CKD. Single nucleotide polymorphisms (SNPs) associated with kidney function explain only approximately 3% for eGFR thus far [35-38].

Since the majority of the GWAS-identified variants are localized to the non-coding regions of the genome, it is unclear which genes are affected by the genotypic changes. In order to develop a more complete understanding of the genetic basis for CKD, CKD-associated SNPs need to be tied to their target genes. With the advance of microarrays and RNA sequencing, it is feasible to assess genome-wide gene-expression changes. A few studies have shown the heritability of gene-expression traits [39-42], and thus we can use gene expression as a quantitative trait for performing expression quantitative trait loci analysis (eQTL), a method that will be discussed further in Chapter 1.3. CKD-associated SNPs can be more accurately understood by using eQTL to link to potential target genes, and could be studied for their relevant biological functions.

In Chapter 2, I focus on understanding the genetics of CKD, with an aim of identifying target genes in GWAS studies. I use eQTL analysis to identify target genes for the CKD GWAS hits. Methods used in this study have identified potential target genes for mechanistic and causal studies of kidney disease development.

1.1.3. Epigenetics and CKD

A plausible mechanism that could explain the missing heritability for phenotypes is epigenetic change [43]. In general, epigenetics studies chemical modification changes in the genome that do not alter DNA sequences. Epigenetic traits are DNA-sequence-independent 4

transgenerational changes, brought on by changes to the environment of the affected individual that propagate environmental effects on phenotypic development. Transient metabolic changes, for example, can result in long-term phenotypic alterations by epigenetic reprogramming.

One popular hypothesis for the missing heritability is the “developmental origins of adult diseases” hypothesis, also referred to as the Brenner-Barker hypothesis or fetal programming.

For example, an adverse nutritional environment during development has been shown in animal models to be associated with worse lifetime outcomes including blood pressure, metabolic and renal traits, even after the subject animals have been separated from the adverse conditions.

In addition to fetal programming, the effect of poor glycemic control can be detected even a quarter-century after cardiovascular and kidney disease development, a phenomenon called

“hyperglycemic memory.” More detail is provided in Chapter 1-4.

To understand whether epigenetic changes potentially contribute to CKD development, I performed genome-wide cytosine methylation analysis of microdissected human kidney samples.

I discuss this work and its results in Chapter 3.

1.2 Genome-wide association studies 1.2.1. The history of genetic mapping

Genetic mapping of inheritance dates to the late 19th century with Gregor Mendel’s publication of the laws underlying inheritance patterns he observed in Pisum sativum [44].

However, in the beginning of the 20th century, cases involving more than two traits were found not to follow Mendelian random segregation principles consistently. For example, while eye and body color in Drosophila melanogaster tend to pass down together, this is not always true. These inconsistencies motivated researchers to look into the genetic mechanisms underlying multiple- trait inheritance.

Around the same time, cell biology was progressing quickly along with increased microscope resolution. Frans Alfons Janssens discovered the crossing-over of genes during 5

meiosis and termed the process “chiasma.” Thomas Hunt Morgan found that this process explained his own observations in D. melanogaster cross-breeding experiments. He adopted the evidence collected by Janssens and developed the theory of genetic linkage. Morgan stated that as a result of meiotic recombination, any markers that are correlated in segregation are located close to one another in the genome [45]:

If the materials that represent these factors are contained in the chromosomes, and if those factors that “couple” be near together in a linear series, then when the parental pairs (in the heterozygote) conjugate like regions will stand opposed. There is good evidence to support the view that during the strepsinema stage homologous chromosomes twist around each other, but when the chromosomes separate (split) the split is in a single plane, as maintained by Janssens. In consequence, the original materials will, for short distances, be more likely to fall on the same side of the split, while remoter regions will be as likely to fall on the same side as the last, as on the opposite side.

Morgan highlighted the idea that traits are linked and his student Alfred Sturtevant took the analysis forward. He realized that all D. melanogaster have the same chromosome length.

This facilitated construction of a genetic map, because the frequency of specific phenotypic pairs among all offspring could be used to calculate the relative distance of the relevant genetic markers. He proposed that the distance between any two genetic markers on a D. melanogaster chromosome was inversely related to the ease of separating the traits. Sturtevant successfully tested the hypothesis and produced the first chromosome map, which included the sex-linked genes in the correct order and relative spacing [46]. The non-random assortment of genetic markers in the chromosomes was later termed linkage disequilibrium (LD) [47].

Since the first genetic mapping of Drosophila traits, geneticists have mapped numerous other model organisms by intensive interbreeding for the mutants of detectable phenotypes.

However, this cannot be carried out in humans for obvious ethical reasons. There were some successes in identifying linkage, including the first autosomal linkage in human, the Lutheran blood group antigen system [48]. Then, in 1980, Botstein suggested the first systematic method to map genes, using an existing method involving restriction fragment length polymorphisms

(RFLPs). RFLP detects DNA polymorphisms through restriction endonucleases. The restriction

6

endonucleases are sequence-specific restriction enzymes that cannot recognize and digest the

DNA sequences when there is polymorphism in the DNA sequence. Applying this method to human genomic DNA would result in different lengths of DNA sequence, indicating the presence of polymorphisms [49, 50]. Combining the polymorphisms as genetic markers with pedigree information, linkage of human genetic diseases was enabled, for example Huntington’s disease

[51], adult polycystic kidney disease [16], cystic fibrosis [52-54], and retinoblastoma [55, 56].

Since the 1980s, genetic mapping has identified mutations in thousands of genes that are associated with Mendelian diseases through relatively low-density panels of markers. However, common diseases exhibit more complex inheritance patterns in the general population, and traditional gene-mapping methods cannot identify genetic causes effectively for such common diseases as hypertension, cardiovascular diseases, obesity, and CKD.

1.2.2. GWAS methods

GWAS examines genetic variants across the to identify associations between variants and phenotypes. Study power is one of the main considerations in designing a

GWAS. Study power is the probability of the null hypothesis being correctly rejected. Large sample size is necessary to provide enough study power, especially when the association between a single variant and the disease is weak.

Association studies are carried out in two stages. In the first stage (the discovery phase), the association of variants and phenotypes is determined. Variants that are identified if they have statistically significant allele frequency differences associated with disease phenotypes and quantitative traits. In the second stage (the replication phase), significantly associated markers from the discovery phase are evaluated for association in additional independent study samples that resemble the initial discovery sample. Replication serves to confirm association and to detect potential bias. The replication analyses reduce the incidence of technical mistakes and spuriousness in the identified associations.

To date, more than 88 million genomic variants have been cataloged in the 1000 7

Genomes Project [57]. Directly assaying all known common polymorphisms is inefficient and impractical, because many variants are highly correlated due to LD. Genotyping a reduced set of variants in one strongly linked block can be used to infer the remaining variants through genotype imputation (discussed below). It is therefore important to select variants that can achieve genome-wide coverage within a feasible budget. Generating genetic maps based on direct genomic distance is rarely used for marker selection (e.g. using evenly spaced markers across the autosomal chromosome). On the other hand, cataloged common human sequence variations are more commonly used to calculate LD and select genetic markers for the genotyping arrays

[58, 59].

In the 1990s, array-based genotyping platforms assayed fewer than 10,000 SNPs, but this has subsequently increased to between 300,000 and 5,000,000 SNPs per array in recent years. GWAS efficiency and statistical power can be improved through genotype phasing and imputation. Phasing is a method to estimate haplotype. When genomic variants are genotyped, we observe pairs of alleles, but the strand and haplotype (a series of genomic variants that are on the same chromosome) to which they belong are unknown; in other words, we cannot identify the parent from which the individual acquired the haplotype.

Once the haplotype is phased, imputation is performed. Imputation is the statistical inference of unobserved genotypes using a set of observed-genotype-phased haplotypes as a reference [31, 33, 57, 60, 61]. This technique allows us to estimate the genetic markers that are not directly genotyped. The HapMap, 1000 Genomes Project and Haplotype Reference

Consortium have produced extensive LD maps, and there are several methods to impute genotypes based on these LD relationships [62-65]. When combining genotype data from different studies, often the genotype data are generated on different platforms. By imputing genotypes, meta-analysis can infer genotypes that are unknown in some of the studies, allowing

GWAS to increase sample size and power.

Interpretation of the results of large association studies faces the statistical difficulty of

8

type I errors, where incorrect rejection of the null hypothesis yields false positive results. To cope with false positives, a multiple-testing corrected p-value is typically used. One of the most frequently used is to take the Bonferroni correction for multiple tests, where the cutoff p-value of

0.05 is corrected by the approximate one million independent tests (the number of common variants tested in the genomes of individuals of European ancestry) to generate the threshold [66,

67].

Hundreds of traits have been examined by GWAS, and variants that have reached genome-wide significance can be accessed at the GWAS Catalog supported by the National

Human Genome Research Institute (NHGRI) and the European Bioinformatics Institute (EMBL-

EBI) [34].

The majority of GWAS-identified loci are non-coding variants and this limits the ability of

GWAS to identify the causal variants and to link these genetic variants to the disease mechanisms.

1.2.3. GWAS and chronic kidney diseases

The National Kidney Foundation defines chronic kidney disease according to any of the following criteria: pathologic kidney abnormalities, persistent proteinuria, other urine abnormalities

(e.g., renal hematuria), imaging abnormalities, and eGFR <60 mL/min/1.73 m2 on two occasions separated by more than 90 days [68]. CKD is classified in five stages, where stage 1 is the mildest form and stage 5 is the most severe with eGFR <15 mL/min1.73m2. When stage 5 CKD is treated by dialysis or kidney transplant, it is also categorized as ESRD or renal failure. The current treatment options for ESRD are limited to dialysis or kidney transplant.

Most of these measurable traits of kidney dysfunction can be used as a quantitative trait in GWAS studies, including eGFR, proteinuria, histologic morphology, and metabolites observed in urine. eGFR is estimated using equations that incorporate endogenous filtration markers, for example serum creatinine (SCr) or cystatin C (CysC). Case-control studies can also use eGFR to define CKD as patients with eGFR below 60 mL/min/1.73m2. The CKDGen Consortium published 9

a meta-analysis of multiple cohorts with the largest sample size to date for kidney function eGFR in 2016. In total, 175,000 individuals were included and 53 loci were identified (29 known and 24 novel loci; most of these variants are associated with eGFRcrea, one with eGFRcys and four with

CKD) [69].

Another key marker for CKD is urinary albumin. Albumin is a 66 kDa that circulates in the blood stream and cannot pass through the glomerular filtration barrier. However, when kidney structural integrity of the glomerular filtration barrier is injured or albumin reabsorption along the proximal tubules is impaired, larger proteins like albumin will leak into the urine.

To measure proteinuria, a frequently used method is to collect samples at only one time point and measure the albumin and creatinine concentrations. Creatinine level is used for normalization for the albumin because albumin levels will vary with urine concentration. This method yields the urine albumin-to-creatinine ratio (UACR); microalbuminuria is defined as 30-

300 mg/g, and macroalbuminuria as UACR >300 mg/g.

In a GWAS study examining UACR, a total of 67,000 samples was used for meta- analysis and independent replication in more than 51,000 samples, which identified 1 locus with a missense variant in the cubilin gene (CUBN) [70, 71]. As mentioned earlier, CKD is defined by reduced eGFR or by microalbuminuria. Several attempts were made to validate eGFR-associated

SNPs in UACR; however, none of the eGFR-associated SNPs were associated with UACR or albuminuria, suggesting different genetic components for the two traits [72, 73].

Several specific etiologies of CKD were tested in case-control studies, including diabetic kidney disease (DKD), ESRD, IgA nephropathy, steroid sensitive nephrotic syndrome (SSNS), and membranous nephropathy (MN).

It is estimated that approximately 45% of CKD patients have diabetic kidney disease

(DKD), which is diabetes accompanied by CKD markers including increased UACR and decreased eGFR. The Diabetes Control and Complications Trial (DCCT)/Epidemiology of

10

Diabetes Interventions and Complications (EDIC) performed GWAS in 1,304 participants and found 13 independent loci [74] in type I diabetes; however, there were no SNPs that reached genome-wide significance. Similarly, no variant reached genome-wide significance in a trans- ethnic meta-analysis of type II diabetes in the Family Investigation of Nephropathy and Diabetes

(FIND) cohort in 2008 [75].

In 2015, 1 locus on chromosome 6 reached genome-wide significance in a study of more than 6,000 unrelated individuals in the FIND cohort for DKD, and this result was replicated in a separate study of more than 7,000 individuals [76]. A meta-analysis showed 21 significant loci, which replicated loci that were close to genome-wide significance in both Pezzolesi et al and the

FIND study [77]. And one significant locus was found with combined discovery and replication cohorts [78]. Other studies have discovered loci associated with type I or type II diabetes- associated ESRD [78-80]. However, there was no overlap between DKD- and CKD-identified loci.

The disparities among studies possibly arose from the use of different diagnostic parameters for DKD. Some studies used diabetes mellitus (DM) combined with eGFR to classify

DKD, while others combined DM with albuminuria levels. The other possible reason may be due to differences in environmental effects such as glycemic control, DM duration, and blood pressure control, as they influence disease development and perhaps the association results.

Important loci on chromosome 22 have been discovered using admixture mapping associated with non-DM ESRD in African American cohorts [81, 82]. This locus was later fine- mapped to three variants close to the APOL1 gene, which showed an evolutionary advantage of resistance to trypanosome-mediated sleeping sickness in West Africa [83, 84].

Finally, IgA nephropathy (IgAN) is characterized by the deposit of IgA immune complexes in the glomerular mesangium, which causes progressive kidney injury [85]. The genetic component of IgAN has recently been shown by many linkage studies as well as GWAS results.

Two major GWAS meta-analyses replicated 9 loci and identified 6 novel loci [86, 87].

A collection of CKD associated SNPs is summarized in Appendix 1, which lists of 110

11

independent loci of GWAS findings for different kidney disease associated traits; all neighboring

SNPs with LD ≥ 0.2 were removed.

1.3 Expression quantitative locus analysis (eQTL) 1.3.1. The development of eQTL analysis

As discussed in the preceding section, the majority (around 90%) of the variants identified in GWAS are in non-coding regions of the genome. The non-coding variants present special challenges for research, because predicting functional consequences of genetic variants in non-coding regions has been difficult. Identifying quantitative phenotypes that are associated with these variants (QTL) can facilitate the mechanistic studies for disease development, for example gene expression levels, or expression QTL (eQTL).

The systematic measurement of gene expression levels through microarray methods facilitated the first few major eQTL analyses, proceeding from the yeast [88], maize [89], mouse

[89], to human genome [90], and the eQTL method has been applied to several diseases.

However, most eQTL analyses of human samples were performed in immortalized cell lines or circulating cells [91]. The transcriptome is tissue-type specific, thus surrogate cell types cannot represent organ-specific regulation of gene expression by variants. There is a large number of ubiquitously expressed genes that show shared regulation of gene expressions in the genome, as shown in GTEx consortium [92] and other international projects that have generated comprehensive catalogs of human eQTL maps.

Adopting transcription levels measured by RNA-sequencing in eQTL was advanced through the work of the GTEx consortium [92]. The application of RNA-sequencing enabled the quantification of transcript levels, identification of isoforms, and examination of allele-specific expression, as well as long non-coding RNAs and novel transcripts in a much higher resolution than the array-based method [91, 93].

12

1.3.2. eQTL analysis

eQTL mapping examines the genotype of a genetic locus that is significantly associated

with the transcript or gene expression levels local (cis) or distal (trans) to the transcription start

site of each gene. The cis-eQTL concept is illustrated in Figure 1-1, where the distance is

conventionally specified as 1Mb from the transcription start site of each gene, as most regulatory

Figure 1-1 | Illustration of cis-eQTL The genetic locus where the genotype of a variant is significantly associated with the gene expression levels with one or more genes in a defined distance. Gene expression levels are the quantitative phenotype and the genetic loci controlling the phenotypic changes are identified by comparing gene expression and genotypic data from the same set of samples and determining the associations between them.

control takes place in the vicinity of genes. As shown in the figure, an eQTL may contain multiple

genetic variants that tag the actual causal variant or variants in close LD, which accounts for the

genotype-dependent gene expression.

Stringent genotype quality control (QC) is crucial to reduce the incidence of technical

problems and spuriousness in the identified associations. After QC, genotypes are subject to

phasing and imputation using large reference panels to infer the untyped genotypes. It also

increases the accuracy of imputation for variants with lower frequencies [94-96]. Gene expression

analysis using RNA-seq instead of microarray to quantify the transcript expression level provides

a few advantages. RNA-seq provides transcript isoform expression as well as information for

allele-specific expression.

13

eQTL analysis of peripheral blood samples that the number of genes with identified local eQTLs rose from around 5% in 100 samples to approaching 50% of all genes after meta-analysis using several thousand samples [97]. This establishes that increased sample size can increase resolution of variants explaining the transcript variance [97, 98]..

Recent reports from GTEx with 44 tissue types showed that although increased sample size is positively related to more shared cis-eQTL, weaker associations still remain that are unique to the individual tissues and have higher effect size than shared eQTL [98].

1.3.3. Finding colocalization of GWAS and eQTL loci

The general model used to understand GWAS signals is that a causal variant is localized to a regulatory region in the disease-causing cell type. The variant alters TF binding and further affects the expression level of the genes.

One way to prioritize studying regions associated with potential causal variant is to combine statistical association of genetic variants with gene expression, and with complex traits, by colocalization of signals.

An important example of the integrative analysis from GWAS to eQTL to phenotype was demonstrated with the FTO locus. This locus was discovered to be associated with obesity in several GWAS [99, 100], with the reproducibility suggesting that it was the most significant signal for obesity. The obesity-associated FTO region was shown to have enhancer properties identified using histone mark modifications as well as enhancer assays [101].

The FTO region was also demonstrated to interact with the promoters of the IRX3 gene through circular chromosome conformation capture followed by sequencing (4C-seq) [102]. In short, the FTO region regulates the IRX3 gene through its enhancer activity. The rs1421085 T-to-

C variant in the obesity-associated FTO region was shown to disrupt a conserved ARID5B repressor motif. Under the normal condition of the T allele at rs1421085, ARID5B binds to the sequence and suppresses the enhancer, causing a drop of IRX3 and IRX5 expression. These 14

changes shift the mesenchymal progenitors from white adipocytes, browning to beige adipocytes

(lipid storage), and leading to decreased mitochondrial respiration and thermogenesis with increased adipocyte size. When there is disruption of the ARID5B binding site with C allele at rs1421085, IRX3 and IRX5 expression increases and the pre-adipocyte differentiates into lipid- storage white adipocyte. [101].

Studies have shown that trait-associated GWAS SNPs are significantly more likely to be eQTLs than expected by chance [103]. To highlight the target genes of GWAS SNPs, we can co- localize the eQTL and GWAS signals. The colocalization of genetic variant signals implies that the gene with eQTL at the same locus as GWAS can influence the disease trait [104]. Several methods have been proposed, and each has a different hypothesis and approach for mapping the location and signals from the summary statistic in eQTL and GWAS.

Among many GWAS and eQTL integration studies, I reviewed two that were used in

Chapter 2 as they provided superior understanding of colocalization. The Sherlock method [105] uses Bayesian statistics to match the association signals from GWAS and eQTL for individual genes, in order to distinguish whether the significance are coming from the same variant. A second method, Coloc [106], also based on Bayesian statistics, tests the hypotheses that the colocalizations of GWAS and eQTL signals represents a causal or coincidental relationship.

These methods explain the significance of using both eQTL and GWAS summary statistics to assist in identification of potential causal variant(s) at a specific locus. To further narrow down and identify the potential disease-causing variants, we need to further annotate the genome with regulatory elements in a cell-type specific manner, including histone tail modifications and cytosine methylation levels.

1.4 Epigenetics 1.4.1. Epigenetics in chronic kidney disease

As GWAS only explains 3 % of CKD heritability, there is a need to understand the 15

“missing” heritability. Increasing the samples size of GWAS cohorts, using denser genotyping platforms and different ethnic groups will be critical for this effort. Another potentially important mechanism for understanding missing heritability is transgenerational epigenetics.

There is a large variation in the number of nephrons in the human kidney, ranging from approximately 200,000 to 2 million. The number of nephrons at birth is inversely correlated with the risk of hypertension and CKD [107]. Low nephron count is correlated with low birth weight

[108], which often results from intrauterine malnutrition [109, 110]. Low nephron number causes increased glomerular pressure on the remaining nephrons, and the increased workload on each glomerulus can induce microalbuminuria and glomerulosclerosis, resulting in nephron loss and renal injury [111].

Intrauterine malnutrition is associated with metabolic syndrome, hypertension, and renal disease development by fetal programming [111-114]. A direct correlation of low birth weight to microalbuminuria in type I diabetes has been established in the general population [115]. Rodent models of intrauterine growth retardation have indicated that lack of nutrient or oxygen availability results in the development of salt-sensitive hypertension and microalbuminuria [116]. Together, these results indicate that the epigenome could be an important mediator of long-lasting intrauterine environmental effects, particularly because the epigenome is plastic during development [117-119].

In addition, several environmental factors have been shown to have a long-lasting effect on CKD development [120]. Environmentally induced epigenetic modification could represent a plausible mediator explaining the impact of environmental factors on CKD development. Although differentiated cells have a stable epigenome, lately the cellular memory or programming effect of the environment has been described to extend beyond development as well (Fig. 1-2).

Epigenetics can explain missing heritability in cases of transgenerational changes.

Trangenerational epigenetic effects have been shown in rodent models [121, 122] and also proposed in a two human studies [123, 124]. The association between intrauterine exposure to

16

maternal diabetes has been established as having an adverse effect on BMI and insulin secretory function in offspring [125]. The transgenerational effect was further studied in a Pima Indian

(U.S.) cohort with or without intrauterine exposure to diabetes. The Pima Indian study showed that the risk can be explained by cytosine methylation differences that are enriched in type 2 diabetes, maturity onset diabetes of the young (MODY), and Notch pathways [126].

One key question is how does nutrient availability change the epigenome? Most chromatin-modifying enzymes require substrates or cofactors that are intermediates of cell metabolism [127-130]. For example, acetylation by histone acetyltransferases depends on the local subcellular acetyl-CoA concentration [131, 132]. In vitro evidence supports the conclusion that fluctuations of metabolite levels modulate activities of chromatin-modifying enzymes, thereby influencing chromatin dynamics [133-135].

Cells and tissues are sensitive to epigenetic shift while the epigenome is plastic and being established during development and differentiation. Change in the epigenome during development plays an important role in phenotype development later on, by modulating how cells respond to stimuli [136, 137]. For example, while podocytes without an essential component of

H3K4 methyltransferase develop into functionally competent cells, it was shown that they have an abnormal response to environmental stressors (e.g. aging) [138]. Recent studies suggest that we can predict cellular aging based on the epigenetic signature [139-142]. This indicates that the epigenome could provide a traceable link between the genome and the environment. In short, the epigenome is emerging as a biochemical record of relevant life events.

1.4.2. Epigenetics and CKD development in mouse models

Recent research supports the role of epigenetics in CKD development. Bechtel et al. showed differences in the cytosine methylation profiles of fibroblasts isolated from control and

CKD mouse kidneys [143]. Using a folic acid-induced kidney fibrosis model, the investigators showed that hypermethylation of a guanosine triphosphatase-activating protein, RASAL1, causes increased Ras activation in fibroblasts, resulting in fibroblast proliferation and fibrosis 17

development. Genetic deletion or chemical inhibition of DNMT1 is able to reduce fibrosis development in the folic acid-induced kidney fibrosis model. Future studies using DNMT inhibitors in other progressive kidney disease models, as well as cell-type-specific Dnmt1 deletion in mice, will aide in elucidating how changes in cytosine methylation contribute to kidney fibrosis development.

The contribution of histone deacetylase (HDACs) has also been described in different mouse models of acute and chronic kidney injury [144-150]. The HDAC inhibitor Trichostatin A

(TSA) ameliorates proliferative glomerulonephritis [151], long-term glomerulosclerosis [152, 153], and proteinuria in a nephrotoxic serum nephritis model [152]. Pretreating mice with valproic acid, a class I-selective HDAC inhibitor, was highly effective in reducing proteinuria as well as sclerosis in the doxorubicin-induced focal segmental glomerulosclerosis model [154]. Two separate studies have shown that HDAC inhibitors can also attenuate diabetes-induced renal hypertrophy and renal damage in rodents [155, 156].

Together, these chemical inhibitor-based studies strongly suggest that histone deacetylases play a major role in CKD development. The proposed mechanism is that HDAC inhibitors are involved in reactivating major developmental pathways and promoting tissue repair after injury. Unfortunately, HDAC inhibitors, similar to many inhibitors, have cell-type specificity issues. Therefore, follow-up studies using genetically engineered site-specific knockout animals will be essential to determine the role and contribution of histone acetylation in kidney fibrosis.

Cell-culture-based studies from the Natarajan and El-Osta laboratories have indicated that cells that are grown in a high-glucose medium develop rapid and persistent changes in their histone tail modification patterns [157-160]. In cultured aortic endothelial cells, transient hyperglycemia caused changes in H3K4me1 on the promoter of the p65 subunit of nuclear factor-

κB, and subsequently was associated with increased p65 transcript levels. The expression of p65 was increased in endothelial cells even several days after they were returned to a normal glucose medium. 18

A similar mechanism has also been shown in diabetic mice. The sustained proinflammatory phenotype observed in vascular smooth muscle cells cultured from type 2 diabetic db/db mice was associated with reduced levels of the repressive mark H3K9me3 on the affected inflammatory gene promoters [161]. Treatment of glomerular mesangial cells with transforming growth factor (TGF)-β1 or high glucose could lead to changes in key active and repressive histone modifications at the promoters of inflammatory and fibrotic genes [162, 163].

These observations link episodes of hyperglycemia with a persistent increase of inflammation and fibrosis, which are important characteristics of diabetic kidney disease.

1.4.3. Epigenetics and CKD development human studies

Large-scale human studies showing epigenetic differences in CKD have not been published, because the epigenome is cell-type specific and therefore human kidney tissue and kidney cell lines are necessary for experiments. There are a few human epigenome-wide- association studies that used peripheral blood mononuclear cells or other surrogate cell types.

Sapienza et al [164] described differences in cytosine methylation profiles of diabetic patients with and without ESRD. Infinium HumanMethylation27 BeadChip arrays were used to determine cytosine methylation changes. Although there was significant patient heterogeneity, Sapienza et al identified close to 110 loci with statistically significant methylation differences in ESRD samples compared with those without ESRD. Most loci showed lower methylation levels in the ESRD cases, and many of the differentially methylated regions were in close proximity to genes that previously had been shown to express differently in patients with kidney diseases. Although this is an important first step to understand the epigenome of CKD, future research will be required to determine whether markers found in human urine correlate to epigenetic changes in renal epithelial cells, which can further be used to develop disease markers using non-invasive sample collecting method.

Another study, by Ciavatta et al [165], examined the potential contribution of the epigenome to antineutrophil cytoplasmic antibody (ANCA)-associated vasculitis. ANCA disease is 19

associated with spontaneous development of autoantigens against myeloperoxidase (MPO) and proteinase 3 (PR3) proteins. Ciavatta et al found that H3K27me3, a chromatin modification associated with silenced promoters, was decreased at the PR3 and MPO promoter areas in

ANCA patients compared with healthy controls. H3K27me3 was also associated with decreased cytosine methylation of PR3 and MPO CGI promoter areas. The investigators proposed that normally higher RUNX3 levels were responsible for recruiting H3K27 methyltransferase, a subunit of enhancer of zeste homolog 2 (EZH2), to PR3 and MPO loci. Loss of RUNX3 expression is a plausible cause for the loss of H3K27 methylation and the expression of MPO and PR3 in ANCA patients. This was an elegant illustration that epigenetic modifications associated with gene silencing are perturbed at ANCA autoantigen-encoding genes, potentially contributing to autoimmunity development.

Future studies could determine whether similar mechanisms play a role in other autoimmune diseases as well. In Chapter 3, we discuss the regulation of gene expression in the pathogenesis of CKD through cytosine methylation. The alterations in both gene expression and cytosine methylation in the TGF-β pathway (TGFBR3, SMAD3, SMAD6) were observed in the

CKD patient samples, and these genes were known to be the key regulators of kidney fibrosis development and have important roles in CKD development [166].

1.4.4. Epigenetics Overview

The epigenetic mark is characterized by chemically modified DNA or associated proteins.

Epigenetic marks include cytosine modification (mainly methylation) and histone tail modification, which are changes caused by nutrition and other environmental factors [167, 168]. The epigenetic mark is inherited during cell division to maintain cell identity, thereby mediating between the stably inherited genome and the changing environment. Characterization of the epigenetic mark has played a major role in defining cell-type-specific gene expression.

A classic transcription unit in a multicellular eukaryote contains both clusters of proximal promoter elements and several types of cis-acting regulatory sequences including insulators, 20

promoters, enhancers, silencers, and locus-control regions [169]. Insulators border and separate transcriptional units. Promoters are usually located at the 5' end of transcription start site, the minimal sequences required for accurate transcription initiation. The promoter region can encompass a cluster of cytosine-guanine nucleotide pairs within the linear sequence, known as

CpG islands (CGI) [169, 170].

For gene transcription to occur, not only promoters but also long- and short-range regulatory regions are needed. These cis-type gene-regulatory regions are often highly cell-type- specific. Simultaneous binding of transcription factors (TFs) to each other and to the long- and short-range regulatory regions results in genomic DNA loops that join distant regulatory DNA sequences together (Figure 1-2) [171, 172].

Enhancers are also of critical importance, and can be located from between a few kilobases (short-range) to hundreds of kilobases (long-range) away from the regulated gene [169,

173, 174]. Enhancers can be found in the opposite DNA strand, downstream of the regulated gene, or at intronic regions. Enhancers usually show significant enrichment for the binding of cell- type-specific TFs [174]. TF binding, or the presence of multiple enhancers, is critical for regulating the optimal transcription strength following an external stimulus. However, sometimes the loss of even one enhancer can result in loss of expression level of a gene [175].

21

While the critical importance of enhancers has been increasingly appreciated, genome- wide identification of enhancers has been exceedingly difficult. Recent advances in epigenetics, sequencing and computational methods have helped to define these important gene regulatory regions and will be discussed in detail in this chapter.

Histone modification: The DNA in the cell nucleus is wrapped around small, positively charged proteins called histones, to form a larger organized structure called chromatin. When 147 base pairs of DNA are wrapped around histones, it forms a compact structure called a nucleosome [176, 177]. In addition to DNA, nucleosomes are comprised of a histone octamer consisting of two copies each of the core histones H2A, H2B, H3, and H4 [175].

Nucleosomes play important roles in compacting DNA into the nucleus. When the nucleosome is in condensed condition, called heterochromatin, it is inaccessible to transcription factors and transcriptionally inactive [178]. Modification of the amino acids in the histone tails appears to be central to the processes that regulate transcription, including methylation, acetylation, phosphorylation, and ubiquitination [179]. The histone code hypothesis proposes that

Figure 1-2 | The eukaryotic transcription unit. The eukaryotic transcription complex is characterized by a combination of transcription factors binding to promoters and one or more looping enhancers. Enhancers can be upstream, downstream or intronic. Enhancers are cell-type-specific. Both enhancer looping and TF binding are essential for messenger RNA transcription. Different chromatin regions are characterized by different histone tail marks. CTCF, CCCTC-binding factor. TF, transcription factor. 22

chromatin-DNA interaction (i.e. the gene regulatory complex) is guided by histone tail modification

[170]. There are molecular histone code “writers” that establish these modifications, and

“readers”, the enzymes that use these histone tail modifications to assign gene regulatory

function to DNA [180-183]. More than 60 histone modifications have been described [184, 185],

although some are redundant and the overall set can be simplified into 10-15 different patterns,

each with specific gene regulatory functions (Table 1-1).

Functional genomic elements are identified through sequence-association with histone

modification, TF binding and DNaseI hypersensitivity [186]. Cell-type-specific gene regulation can

be understood by defining the cellular histone code (the epigenome). The pioneering ENCODE

and Roadmap Epigenome projects have resulted in the ongoing characterization of cell-type-

specific annotations of gene regulatory regions through the use of cultured human cell lines and

human tissues [187].

Transcription factor binding is a key characteristic of a regulatory region; gene regulatory

regions are mostly open and nucleosome-free. Nucleosome-free DNA is sensitive to DNaseI

digestion, allowing the identification of gene regulatory regions by DNaseI hypersensitivity

methods [188-190]. Nucleosomes directly adjacent to DNaseI-hypersensitive sites are

characterized by specific histone tail modifications.

For example, H3K4me3 in combination with H3K4me2 and H3K4me1 H3K27ac usually

marks active promoter areas. On the other hand, H3K27me3 and H3K9me3 usually mark inactive

promoters. H3K4me1, H3K27ac and H3K9ac marks together are associated with enhancer

activity, and H3K36me3 and H3K20me1 mark transcribed regions both for coding and non-coding

transcripts [170, 191]. The binding of the transcriptional repressor, CTCF, defines insulator areas

that separate different transcriptional units.

Table1-1| Gene regulatory function of histone marks Chromatin Mark Chromatin State CTCF§ Insulator/repetitive/CNV¶ H3K27me3 Poised promoter/repressed promoter H3K36me3 Transcription transition and elongation 23

H4K20me1 Transcription transition H3K4me1 Enhancer H3K4me3 Promoter H3K4me2 Promoter/enhancer H3K27ac Active enhancer H3K9ac Active Promoter § CTCF, CCCTC-binding factor; ¶ CNV, copy number variation

Combinations of histone marks further tune gene regulatory regions within the genome.

In particular, H3K4me1 and H3K27ac together have been shown to mark the location of cis- regulatory active enhancers. H3K9me3 and H3K20me3 histone tail modifications are used to locate satellite, telomeric, and repeat regions. In embryonic stem cells, large genomic regions have histone modifications with characteristics opposite to those seen in differentiated cell types.

For example, many developmentally important transcription factor-encoding regions in embryonic stem cells contain both the active mark of H3K4me3 and the transcriptional repressor mark of

H3K27me3. These regions are called bivalent domains [192-194]. Promoters of key developmental transcription factors that control differentiation typically have bivalent domains

[195-197].

During differentiation, cells will take on either the active or the repressive mark depending on which pathway they follow. Adding methyl groups to histones is attributed to three families of enzymes that catalyze the addition of methyl groups donated from S-adenosylmethionine to histones. The SET domain-containing proteins [198], DOT1-like proteins, have been shown to methylate lysines (K) [199, 200], and members of the protein arginine (R) N-methyltransferase family [201] have been shown to methylate arginine. H3K27me3 is catalyzed by the EZH2 subunit of the Polycomb repressive complex 2 (PRC2) [202]. Histone acetylation is catalyzed by histone acetyltransferases (HATs) [203-205], and acetyl groups are removed by histone deacetylases

(HDACs) [206-208]. These enzymes can be targeted HDAC inhibitors such as valproic acid and hydroxamic acids derivatives [209].

24

Cytosine methylation: Methylation of cytosines in the fifth position (5-methylcytosine, 5- mC) is an epigenetic modification. The majority of the genome has a low cytosine/guanine content, and these cytosines usually are methylated, including transposons and other repeated elements. The cytosine nucleotide followed by the guanine nucleotide is expressed as “CpG”.

CpGs often spontaneously undergo deamination and become thymine (CpT), resulting in underrepresentation of CpG [29]. Cytosine-rich regions are organized into CpG islands (CGIs) in the genome [210, 211]; these are short regions, about 300-3,000 bp, characterized by high cytosine content ( > 60%) [212, 213].

These CGIs are enriched on gene promoter regions, and cytosines in CGIs usually are unmethylated and the methylation of promoters negatively correlates with transcript levels. There are two basic models for promoter hypermethylation-induced transcriptional silencing: DNA methylation can repress transcription directly by blocking transcriptional activators from binding to cognate DNA sequences, and methyl binding proteins recognize methylated DNA and recruit co- repressors to silence gene expression directly. Genome-wide studies have also indicated that methylation of regions adjacent to CGIs, known as CpG shore regions, play a key role in diverse biological processes [214]. While the functional effect of gene body methylation has been debated, it has been shown that regulation of splicing through binding of CTCF (which is sensitive to cytosine methylation levels) results in pausing of RNA polymerase II [215-218]. It may also enhance transcription through inhibition of initiation of cryptic intragenic transcription [219-221].

Cytosine methylation levels are usually the highest in fully differentiated cells; however, cytosine methylation is erased after fertilization, with only a handful of imprinted regions remaining methylated in zygotes [222-224]. During cell-type-specific differentiation, cytosine methylation is established by DNA methyltransferases (DNMTs) [225-229]. Dnmt1 is responsible for maintaining the established DNA methylation levels during cell division. Homozygous Dnmt1 embryos are stunted, delayed in development, and do not survive past midgestation [225].

Dnmt3a and Dnm3b participate in de novo DNA methylation of unmethylated cytosines and

25

appear to methylate different genomic DNA regions in vivo. Dnmt3a actively methylates naked

DNA and open nucleosomal DNA [230], while Dnm3b is specifically required for methylation of centromeric minor satellite repeats. It has been shown Dnmt3a and Dnmt3b double-mutant mice have severe growth defects and do not survive over 4 weeks after birth. While Dnmt3a mutant mice develop to term, there is no viable Dnm3b knockout and severe developmental defect at later stages of development is found [225, 226, 231-234]. Mutations of human DNMT3B are found in immunodeficiency, centromeric instability, and facial anomalies syndrome, a developmental defect characterized by hypomethylation of pericentromeric repeats [235, 236].

In addition to cytosine methylation, the TET (Ten-Eleven-Translocation) enzymes are known to oxidize 5-mC to 5-hydroxymethylcytosine (5-hmC), adding a hydroxyl group to the methylation of cytosines at C5 position. 5-hmC was identified as having the ability to regulate gene expression in coordination with transcription factors [237]. When hydroxylmethylation is disrupted, it can result in disordered cell function, for instance, myeloid cancers [238-240].

1.4.5. Methods to study the epigenome

Analyzing condition-relevant cell types is critical to understand the role of epigenetics to disease development (e.g. analysis of renal epithelial cells to understand kidney disease).

Because the cell type is the key determinant of the epigenome, it is unlikely that much mechanistic understanding would be gained by analyzing cells that are not functionally relevant to disease development. Cytosine methylation is one of the most stable epigenetic marks, it can be studied even in archived samples because genomic DNA is stable [241-243].

26

DNA Methylation: There are three major approaches to studying cytosine methylation, based on restriction enzyme digestion, affinity enrichment, and sodium bisulfite conversion. Each method has its strengths and caveats. Currently, all of these methods are available for both locus-specific and genome-wide studies when coupled with whole-genome-covering microarrays or sequencing technologies.

The restriction enzyme digestion method is based on isoschizomer enzymes that recognize the same nucleotides but differentially digest DNA based on cytosine methylation status. The most frequently used isoschizomers for DNA methylation studies are HpaII and MspI, which recognize the same sequences but digest differently according to the methylation levels

[244]. MspI is methylation-insensitive whereas Hpall is sensitive to methylation level. The methylation-insensitive enzymes are not able to cleave the methylated-cytosine residues, leaving methylated DNA intact. By comparing the fragments after restriction-enzyme treatments, site- specific methylation differences can be determined (Figure 1-4A). The digested DNA can be subjected either to polymerase chain reaction (PCR) for locus-specific study, or to either microarray or sequencing for genome-wide screening.

SmaI and its neoschizomer XmaI is another favored enzyme pair. Xmal and Smal both recognize the sequence CCCGGG, but they digest it at different sites. Smal is sensitive to cytosine methylation and is blocked by CG methylation, leaving a blunt end 5’-GGG. Xmal is methylation-insensitive enzyme that cleaves at sites with methylated CGs. Xmal leaves 5’-CCGG overhangs (Figure 1-4B). By comparing the two enzymes or use the restriction enzymes in serial, we can detect the differentially methylated regions [245-247].

27

The advantage of restriction enzyme digestion is its sensitivity to hypomethylated regions

(e.g., CGIs). The drawback of this method is that the coverage and resolution is limited to regions that restriction enzymes recognize. One of the most commonly used restriction enzymes methods is HpaII tiny fragment enrichment by ligation-mediated PCR assay, developed by Suzuki et al

[248, 249] and Thompson et al [250].

Affinity enrichment methods use antibodies against m5C, or methyl group binding protein domains (MBD) that target m5C, such as MeCP2 or MBC. Five proteins with mCpG-binding motifs have been identified in the MBD family: MBD1, MBD2, MBD3, MBD4, and MeCP2 [251].

To capture m5C with antibodies and proteins, the genomic DNA is randomly sheared into small fragments by sonication. The fragments then subject to immunoprecipitation with antibodies targeting methylated cytosines in DNA (MeDIP) [252, 253]. The advantage of the MeDIP method is a true genome-wide coverage, because it is not limited by restriction enzyme sites. However,

Figure 1-3| Schematic diagram of the restriction enzyme based method to identify differentially methylated regions. (A) MspI and HpaII digestion. HpaII is methylation-sensitive enzyme and MspI is methylation- insensitive enzyme. Comparing the fragmentation of the enzyme digestion results, we can discover the differentially methylated regions. (B) methylated DNA and reduction of genome complexity were achieved by serial digestion with SmaI (methylation sensitive) and XmaI (methylation insensitive) restriction enzymes. Both (A) and (B) methods are suitable for whole genome analysis using microarray or sequencing platform.

28

the limited resolution of MeDIP is a key disadvantage. The binding of the antibody is determined by the cumulative methylation level of a region, and the resolution is limited by the degree of fragmentation from the sonication process.

Determination of cytosine methylation by sodium bisulfite conversion is an unbiased and sensitive method, and is currently the “gold standard” for determining cytosine methylation [254-

257]. The method is based on the principle that unmethylated cytosines will be converted to uracil after bisulfite treatment, but methylated cytosines will not. Fragments can later be subjected either to cloning and sequencing or to mass spectrometry (e.g. the Sequenom EpiTYPER-type equipment, Sequenom, San Diego, CA) to interrogate locus-specific cytosine methylation.

Genome-wide quantification of bisulfite-converted cytosines can be performed by either microarray- or sequencing-based platforms. The Infinium HumanMethylation450 BeadChip arrays

(bisulfate conversion-based assays, Illumina, San Diego, CA) gained popularity as an easy and cost-effective method for methylation analysis. These arrays appear to perform quite well for epigenome-wide association studies. One key disadvantage is that they only analyze the methylation of 480,000 preselected loci, representing less than 1% of the whole genome. The reduced representation bisulfite sequencing is another bisulfite-conversion-based method that combines restriction enzyme-based size selection followed by next-generation sequencing. This method provides excellent coverage for CGI and promoter regions. Genome-wide bisulfite sequencing provides the best and highest resolution method to examine cytosine methylation status. The main drawback of this method is the current cost associated with whole-genome sequencing and data analysis.

29

Chromatin Immunoprecipitation Chromatin immunoprecipitation (ChIP), followed by either microarrays or next-generation sequencing (ChIP-chip/ChIP-Seq), is the most frequently used method to map transcription factor binding and histone modifications. Here, genomic DNA and proteins in either live cells or fresh tissues are cross-linked with formaldehyde, and the chromatin is fragmented into 200 size. The sheared DNA/protein complex is then subjected to immunoprecipitation with an antibody against the selected transcription factor or histone tail modification. Once the genomic DNA is reverse cross-linked and released from the protein complex, it can be analyzed using quantitative PCR, microarrays, or sequencing.

Although ChIP has revolutionized our understanding of the epigenome, the method is not without limitations. First, the method does not provide an epigenetic map with base pair resolution, because the resolution is limited by the size of the sheared chromatin. To circumvent this issue, an enzyme digestion-based method, the exo-ChIP, was developed [258]. The key limitation of the ChIP-seq relates mainly to the specificity of the antibodies used for the studies.

To address this issue, several antibody validation databases were created to evaluate antibody quality [259-261].

30

Next-generation sequencing platforms and data analysis: For genome-wide bisulfite sequencing, high genome coverage (usually at least 30X) and long reads are needed.

Sometimes, 500-to-600 bp paired-end sequencing (e.g. MiSeqV2 System generates paired 250 bp or MiSeqV3 that generates paired 300 bp reads) is used [262]. For ChIP-seq, generally 100- bp single-end sequencing is used. ChIP-seq analysis begins by aligning each read to the reference genome and then determining the number of reads that aligned within each genomic region. A frequently used aligner is BWA. After reads are aligned, we can define protein binding sites using sequence reads enrichment in the genome, which is termed peak calling. Several peak calling algorithms are available including SPP [263], PeakSeq [264], and MACS [265], which were used in the ENCODE consortium [266]. Other available tools include SICER [267] and Genome-Wide Event Finding and Motif Discovery (GEM) [268]. Most of the methods target peaks that are around 150-300 bp, e.g. MACS and PeakSeq, whereas SICER identifies rather diffused histone marks that can span kilobases to megabases of the genome. Another special feature of these method is that GEM improves the resolution of peaks by combining peak calling and motif analysis.

Because the accessibility of each genomic region is slightly different, an enrichment score is developed to show the enrichment fraction normalized to input DNA (that is, sonicated

DNA without antibody enrichment). Both the degree of enrichment and the size of the region that is marked by an individual histone mark are important, because several different modifications can be present on an individual region. To address this challenge, several algorithms have been developed using statistical methods that incorporate histone mark information to identify functional elements from high-throughput genomic data sets. Ernst and Kellis et al [269] developed a method that is based on the hidden Markov Model, which can be used to generate gene regulatory region annotation based on a panel of histone ChIP-seq experiments performed on the same sample. The method takes into consideration the presence of multiple different histone marks and their reference sequence-based location. This method is very valuable for

31

comparing multiple different histone marks, either between different cell types, or within the same cell type but expressing different phenotypic characteristics, to highlight key similarities and differences in datasets.

32

CHAPTER 2: Genetic-variation-driven gene-expression changes highlight genes with important functions for kidney disease

2.1. Abstract

Chronic kidney disease (CKD) affects close to 10% of the US population and is associated with a 3- to 5-fold increase in risk of death. Genome-wide association studies

(GWASs) have identified sequence variants, localized to non-coding genomic regions, associated with kidney function. Despite these robust observations, the mechanism by which variants lead to

CKD remains a critical unanswered question. Expression quantitative trait loci (eQTL) analysis is a method to identify genetic variation associated with gene expression changes in specific tissue types. We hypothesized that an integrative analysis combining CKD GWAS and kidney eQTL results can identify candidate genes for mechanistic and causal studies.

We performed eQTL analysis by correlating genotype with RNAseq-based gene expression levels in 96 human kidney samples. Applying stringent statistical criteria, we identified

1,886 genes (eGenes) associated with sequence variants. Using direct overlap and Bayesian methods, we identified new potential target genes for CKD. With respect to one of the target genes, Lysosomal Beta A Mannosidase (MANBA), we observed that SNPs associated with

MANBA expression in the kidney showed statistically significant colocalization with SNPs identified in CKD GWAS studies. The results indicate that MANBA is a likely causative factor in

CKD. The expression of MANBA was significantly lower in kidneys of subjects with risk alleles.

Suppressing MANBA expression in zebrafish resulted in pericardial edema, a phenotype typically induced by kidney dysfunction.

Our analysis shows that gene-expression changes driven by genetic variation in the kidney can highlight potential new target genes for CKD development.

33

2.2. Introduction

The key function of the human kidney is to remove waste from the body. The kidney filters more than 180 liters of plasma every day and selectively reabsorbs or secretes specific electrolytes. The kidney excretes most of the endogenous and exogenous waste products that the body can modify into hydrophilic compounds. Ten percent of the body’s basal energy requirement is used by the kidney to perform its transport function [5, 270]. Chronic kidney disease (CKD) is characterized by at least a 40% decrease in the filtering function or abnormal leakiness to albumin. CKD is a common condition affecting close to 20 million people in the US and 10% of the population worldwide [6, 271]. Impaired kidney function is associated with a 3- to

5-fold increase in risk of death [272-274].

CKD is a gene-environment disease, and diabetes and hypertension are responsible for more than 75% of the CKD cases in the US [275, 276]. The genetic underpinning of CKD has been examined by genotyping and phenotyping large numbers of subjects with CKD and conducting GWAS studies. In most GWAS studies, creatinine-based estimates of kidney function

(eGFRcrea) have been used to define kidney disease. Our review of the literature identified 110 loci that exhibited significant genome-wide association with CKD status. The association between genetic variants and disease has been replicated by several studies, indicating the robustness of the relationship.

Despite the remarkable progress of the GWAS experiments, these results have not translated into improved understanding of the mechanisms of CKD development. One critical bottleneck is that more than 90% of the CKD-associated SNPs are localized to non-coding genomic regions [34]. It is not clear, though, which genes and which cell types are affected by the genetic variants. Data from other complex-trait studies indicate that disease-causing genetic variants are localized to regulatory regions in disease-relevant cell types, altering transcription factor binding and downstream quantitative expression of target genes [69].

Most GWAS studies have proposed the nearest gene as the causal gene for disease

34

development; however, mechanistic research indicates that the nearest gene is often not causal, and identification of GWAS targets remains difficult [277, 278]. The expression quantitative trait loci (eQTL) method has been developed to associate genetic variation with gene expression changes. The method requires genotype- and tissue-specific gene-expression data from large numbers of healthy individuals, and controlled complex computational methods, to identify genotype-driven gene expression changes. Several tissue types have been difficult to collect in large enough numbers to perform eQTL analysis. The largest eQTL datasets have therefore been generated from easily accessible cell types, for example blood [97]. The Genotype Tissue

Expression Project (GTEx) Consortium was established to circumvent tissue-procurement problems, and has been successful in creating a large number of eQTL datasets.[92]

Unfortunately, the number of kidney samples in GTEx (n=26) is limited due to allocation and organ transplantation; hence GTEx has not published kidney-specific eQTL maps.

The aim of the current study was to generate a kidney-specific eQTL dataset and integrate this information with CKD-associated GWAS variants, in order to identify target genes for chronic kidney disease and thereby better understand kidney function and disease development.

2.3. Results

2.3.1 eQTL analysis of human kidney tissues

35

We obtained genotype and gene expression data for 99 normal European descent human kidney cortical samples. Genotype data was obtained from Genome-Wide Human SNP

Array 6.0 and expression data by RNA sequencing. We excluded samples that did not pass quality control (QC), and to avoid detecting genetic variations that are caused by population

Figure 2-1 | PCA plot of kidney samples against 1000 genomes reference genome. PCA analysis of kidney samples against the 1000 Genomes AFR, AMR, EAS, EUR and SAS populations. Color code is provided as shown in the right of the panel. Subjects from TCGA study are shown in pink. One sample labeled with red arrow was removed after PCA. genetic structure differences, we further removed samples using principal component analysis

(PCA) (Figure 2-1) [279-281]. There were 96 individual samples that passed QC and were used for downstream analysis (see Methods). These samples were phased and imputed from the

1,000 Genome Phase 3 haplotype CEU reference panel using SHAPEIT and IMPUTE2 [62, 282].

Gene expression data were normalized using quantile normalization followed by inverse normal transformation. We then added sex and age as covariables for probabilistic estimation of expression residuals (PEER) before the final transformation of the residuals (see Methods).

We determined the association between 5.1 million genetic variants and expression of

17,388 genes using Matrix eQTL (Figure 2-2) [283]. The analysis was limited to cis-eQTLs whose

SNPs and gene pairs were within 1MB distance. We followed the protocol described in GTEx to 36

correct for multiple testing and establish significance [92]. In summary, we obtained permutation- adjusted p-values for each gene using FASTQTL [284], which yielded the most significant gene-

SNP pairs. After deriving the empirical p-value from this analysis, we calculated the q-value using the Storey method [285], with a significance threshold of less than or equal to 0.05. This method identified 1,886 eGenes and 124,612 eQTL SNPs (eSNPs) (Appendix 2). In summary, we have generated human kidney eQTL data from 96 CEU samples.

Figure 2-2 | Cis-eQTL analysis of human kidney genome. (A) Flowchart of eQTL analysis (B) Global significance of eQTL SNPs across the genome. The y- axis shows –log10 p-value; the significance threshold used in our analysis is determined by the corrected permutation test p-value.

2.3.2 Validation and replication of kidney cis-eQTL signals

37

Figure 2-3 | Positive correlation between sample number and identified eGenes. The number of eGenes discovered is strongly correlated positively to the number of samples used for the analyses. Tissues from GTEx (v6) were used as comparison, with the selection criteria that each tissue type has to have more than 90 samples. The number of eGenes ranged from 1628 in liver to 9937 in thyroid. Kidney sample (n=96) is shown in red.

Figure 2-4 | Positive correlation between sample number and replicated significant SNP- eGene pairs. Significant eQTL SNP-gene pairs are overlapped between kidney and the 32 samples from GTEx (v6). We also observed a strong positive correlation of sample number and SNP-gene pair replication. The x-axis shows the sample number of each tissue type; the y-axis is the number of eSNP-eGene pair overlapped between GTEx samples with kidney eQTL.

To evaluate the kidney eQTL signals, we first compared the number of eGenes identified

by our analysis to 32 other tissue samples published in GTEx (v6). The analysis was limited to

38

tissue samples where more than 90 samples were analyzed. The number of significant eGenes

(n=1,886) in the kidney eQTL followed the expected linear relation between sample size and number of detected eGenes observed by GTEx (Figure 2-3). We also surveyed the overlap of eSNP-eGene association between kidney eQTL in the 32 tissue types in GTEx (v6). Consistent with previous studies, larger sample size increased the power of association detection, and we observed higher eQTL eSNP-eGene pair overlap among samples at larger sample sizes (Figure

2-4).

Figure 2-5 | Significant SNPs in the eQTL are enriched at regulatory regions. (A) Density plot of eSNPs and their distance relative to TSS. (B) eSNPs overlap with 6 different histone marks in human kidney proximal tubule epithelial cells, compared with the 1000 null sets of variants matched for number, distance to TSS and frequency. The vertical line represents the 95% confidence interval. Histone marks with significance level of p-value < 1.0E-2 are highlighted with red asterisks. (C) eSNP enrichment in ENCODE TFBSs. Odds ratios of eSNPs are shown in y-axis and the vertical line represents 95% confidence interval. We marked the TFBSs reaching significance level p < 1.0E-2 with red asterisks. Control SNPs were the 1000 null sets of variants that are matched to eSNPs for number, distance to TSS and allele frequency. Next, given that eSNPs influence gene expression, we analyzed whether eSNPs are enriched on regulatory regions. We plotted the significance of association between eSNPs and their distance to transcription start sites (TSSs). We found that the strength of association was greater around the TSSs (Figure 2-5A), indicating that eSNPs were enriched around transcription start sites, likely promoter regions.

To establish that the eSNPs were in fact enriched on regulatory regions, we used histone tail modification-based chromatin immunoprecipitation sequencing (ChIP-seq) to identify kidney- 39

specific regulatory regions. We performed this analysis using different histone tail mark-specific antibodies and human kidney proximal tubule cells (HKC8) [166]. We overlapped eQTL SNPs with each histone mark, quantifying the enrichment ratio by comparing it with 1,000 randomly generated SNP sets that were matched for frequency and distance. To avoid enrichment bias, we removed variants in the same linkage disequilibrium (LD) (see Methods). Kidney eQTL SNPs were statistically significantly enriched (~4-fold) on active promoters, as they showed highest enrichment with active promoter marks H3K4me2 and H3K4me3 (p-value < 0.01). There was no statistically significant enrichment for repressed promoter and gene-body-specific histone marks

(Figure 2-5B). These studies indicate that kidney eQTL SNPs are enriched on promoter/enhancer regions in the kidney.

Another feature of regulatory regions is that they are enriched for transcription factor binding sites. We next examined the overlap of eSNPs with the transcription factor binding ChIP- seq data from ENCODE [186, 286, 287]. We found that eSNPs were statistically significantly enriched for transcription factor binding sites. Specifically, we found enrichment for CBX3,

HDAC1, KDM5b, NRF1, ZEB1, IRF1, SIX5, and ZEB1 binding compared to 1,000 randomly generated SNP sets that were matched for frequency and distance (Figure 2-5C). Many of these transcription factors have been shown to be important in kidney development or kidney disease

[288-290]. An example of a kidney-specific eQTL is shown in Figure 2-6A-B, where rs946213 variant influences the expression of RRP15. The SNP is located at a promoter and the variant interrupted the GATA5 binding site.

40

Figure 2-6 | An example of significant SNPs in the eQTL that is enriched at regulatory regions. (A) One example region at chr1:218.45-218.514 Mb, SNP rs946213, affects expression levels of RRP15, and the gene and the SNP are located in histone marks including H3K4me1, H3K4me3 and H3K27ac. SNP r61823376 also disrupts transcription factor GATA5. (B) RRP15 expression level is significantly lower in samples with rs946213 minor alleles (p-value = 7.81E- 15).

In summary, kidney eQTLs were enriched on kidney-specific regulatory regions, indicating the robustness of the approach. Our results indicate enrichment of the eSNPs on kidney-specific active promoter regions and transcription factor binding areas.

2.3.3 eGenes are enriched for metabolism-associated genes

41

According to GTEx, large numbers of eQTLs are conserved between different organs

[92]. After overlapping kidney eGenes with 9 different tissue types in GTEx (v4) eGenes, we

found 847 eGenes unique to kidney eQTL, indicating that our analysis identified both shared and

unique eQTLs. To understand the functional role of the eGenes, and therefore the eQTL

regulatory mechanism, we performed (GO) analysis using all 1,886 eGenes and

the Database for Annotation, Visualization and Integrated Discovery method (DAVID v6.7), which

includes eGene-enriched pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG)

Figure 2-7 | Network of eGenes. (A) The most significant eGene pathway is enriched in metabolic pathways and xenobiotic and drug metabolism. The KEGG pathways displayed here are those below the Benjamini-Hochberg corrected p-value of 5.0E-2. (B) IPA analysis of eGenes in the pathway related to metabolism of xenobiotics. Genes presented in the significant eGene list are in orange, and the rest of the genes in the network were supplied algorithmically. Genes enclosed in the box are kidney- specific eGenes.

[291]. As shown in Figure 2-7A, we found that the most significantly enriched pathway was

related to metabolism, specifically drug and xenobiotic metabolism (p-value = 2.30E-06), which is

a key function of the kidney.

To further understand the xenobiotic-metabolic eGene properties, we performed network

analyses using Ingenuity Pathway Analysis (IPA). IPA highlights the xenobiotic pathway (p-value

= 1.57E-23) in Figure 2-7B. In summary, many of the kidney eGenes show enrichment for drug,

42

xenobiotic metabolism, and pharmacogenomics, which is highly consistent with the function of the

kidney and could be related to the higher expression these genes in the kidney (Figure 2-8).

2.3.4 Kidney eQTL highlights the genetics of disease traits

A critical goal of eQTL analysis is to provide a framework for biological interpretation of

Figure 2-8 | Density distributions of gene expression of eGenes and all genes in kidney samples. Gene expression distributions are shown for both eGenes and all genes. A shift toward higher expression of eGenes can be seen with the overlap of the translucent plots. All genes are shown in light red and eGenes are in light blue.

disease-related variants. To evaluate the relevance of the eQTLs identified in disease-mapping

studies, we tested kidney eQTL SNPs with disease-associated SNPs using the National Human

Genome Research Institute (NHGRI) catalog (www.ebi.ac.uk/gwas, accessed 9/30/2015) [292].

We analyzed statistical enrichment for GWAS SNPs in our kidney eQTL database (see Methods).

We found that kidney-specific eQTL SNPs are enriched for CKD-associated traits (Figure 2-9).

When compared to other traits (digestive, nervous, immune system diseases, cardiovascular,

body measurement, biological processes, and other diseases) we found a significantly greater

overlap between kidney eSNPs and polymorphisms that are associated with kidney function [35,

36, 293, 294]. These results indicate that genetic variants associated with kidney function and

CKD are enriched in our eQTL studies and likely drive gene expression changes in the kidney.

43

Figure 2-9 | Disease trait GWAS Enrichment analysis. SNPs associated with different traits were taken from National Human Genome Research Institute (NHGRI) database, with SNPs categorized based on disease ontology. Enrichment was calculated using Fisher's exact test; the significance is shown in the y-axis with -log10 p- value.

2.3.5 Defining the genetic network underlying CKD

A key objective of our study was to identify potential target genes for CKD-associated

GWAS variants. We manually curated a list of SNPs that both passed genome wide significance

Figure 2-10 | Integrative analysis of GWAS and eQTL SNPs. (A) We used eSNPs that are also present in CKD-associated GWAS as a means to identify target genes that affect kidney disease. (B) Boxplot of rs1719246 and SPATA5L1, rs6429746 is within the LD block with rs2467853, which is one of the CKD-associated GWAS-identified SNPs (D'= 0.95; r2 = 0.89).

44

and were associated with kidney disease traits. To acquire independent loci, we removed any

SNPs having r2 ≥ 0.2. In total, we analyzed 110 leading SNPs and 2,357 tagging SNPs with r2 ≥

0.8. We performed a direct overlap between CKD-associated GWAS SNPs and eQTL SNPs

(Figure 2-10).

Overlapping SNPs were identified between the two datasets. These SNPs passed genome-wide significance both in GWAS and eQTL studies. They mapped to seven loci (eight genes) and regulated the expression of CST9, SPATA5L1, LOC654433, ACCS, CFHR1,

ALMS1P, PGAP3 and TPRKB (Table 2-1).

Table 2-1 | SNPs identified in human kidney eQTL that are in tight LD with CKD GWAS SNPs Trait GWAS CKD CKD GWAS eGene eSNP eSNP to target gene TSS P- GWAS suggested identified Best P- value SNP target Genes eQTL value BUN 3.30E- rs111231 PAX8 LOC65443 2.67E-07 9,927 bp upstream of TSS 10 70 3 CKD 1.05E- rs246785 GATM/SPATA5 SPATA5L1 3.75E-08 41,152 bp upstream of TSS 42 3 L1 eGFRcr 4.50E- rs13538 ALMS1/NAT8 ALMS1P 2.63E-06 25,029 bp upstream of TSS ea 14 TPRKB 8.24E-07 115,584 bp downstream of TSS eGFRcr 9.00E- rs110789 CDK12 PGAP3 1.28E-10 339,337 bp downstream of ea 13 03 TSS eGFRcy 2.20E- rs130383 NAPB, CSTL1, CST9 2.68E-09 9,027 bp upstream of TSS s 88 05 CST11, CST8, CST9L, CST9, CST3, CST4, CST1, CST2, CST5 IgA-NP 3.93E- rs207403 ACCS ACCS 3.01E-14 510 bp downstream of TSS 09 8 IgA-NP 2.96E- rs667760 CFHR1,3 CFHR1 6.93E-07 77,945 bp upstream of TSS 10 4

The direct overlap method has a few limitations. First, eQTL and GWAS variants were not genotyped on the same platform, and a large number of variants were missing in the eQTL genotypes. Second, the method is very stringent, as it requires independent genome-wide significance for the SNPs in both studies, even though the GWAS and eQTL studies can gain significance from each other as they are independent methods.

To address the latter issue, we also implemented a Bayesian method (coloc [295, 296]) to formally test the hypothesis of association of a given region with eQTL and GWAS signals. Due

45

to the nature of the analyses, we could only import one study from GWAS analyses instead of using our curated list with a multiple cohort of studies. As input to coloc, we used the largest

GWAS of the kidney function measure eGFR from the CKDGen Consortium with the best statistical power, and its known summary statistics and sample sizes [69]. The five hypotheses used in the coloc analyses were: H0: No association with either trait (trait 1 for eGFR GWAS and trait 2 for eQTL associations or vice versa); H1: Association with trait 1, not with trait 2; H2:

Association with trait 2, not with trait 1; H3: Association with trait 1 and trait 2, two independent

SNPs; H4: Association with trait 1 and trait 2, one shared SNP. As sufficient support for significant results, we set PP4 > 0.8 and PP4/PP3 > 5 (posterior probability from H3 and H4) as criteria

[297]. Here we used only SNPs that are associated with the eGFRcrea trait in the GWAS studies.

As expected, this method identified SNPs that passed genome-wide significance in both methods, including Post-GPI Attachment To Proteins 3 (PGAP3), Spermatogenesis Associated 5

Like 1 (SPATA5L1), and Alstrom Syndrome 1 (ALMS1P). Furthermore, several additional potential target genes were identified, including PIGU, METTL10, MANBA, and others

(Table 2-2).

PGAP3 encodes phospholipase for the glycolipid complex (glycosylphosphatidylinositol, or GPI) in the Golgi apparatus. The modification is crucial for linking GPI-anchored proteins to lipid rafts, which plays an important role in protein sorting and trafficking. One of the major phenotypes of PGAP3 knockout mice is that they present with progressive enlarged renal glomeruli with deposition of immune complexes and matrix expansion upon aging.[298]

Furthermore, humans with PGAP3 loss-of-function mutations present with neurological and renal phenotypes [299, 300]. These experimental observations nicely validate our computational methods.

46

Table 2-2 | Significant genes identified through Coloc that are associated to both CKD GWAS- and human kidney-identified SNPs Gene PP_H3 PP_H4 H4/H3 1. PGAP3 0.012715756 0.985344385 77.49 2. SPATA5L1 0.015059344 0.979123189 65.02 3. METTL10 0.009878945 0.978529109 99.05 4. PIGU 0.040348343 0.955231251 23.67 5. KNG1 0.013444728 0.955009245 71.03 6. MANBA 0.063418013 0.913297935 14.40 7. ALMS1P 0.002501869 0.887791453 354.85 8. C9 0.004801323 0.884400778 184.20 9. WHAMM 0.004698951 0.882157986 187.74

2.3.6 Integrative analysis identifies MANBA as a potential gene for CKD

To prioritize the validation of target genes, we performed network analysis for all eGenes using IPA. IPA highlights the metabolic pathway with a score of 38 (the negative log p-value of the odds ratio). MANBA was the only gene predicted in coloc that is presented in the pathway

(Figure 2-11).

The GWAS study revealed a significant association between SNPs and

Figure 2-11 | Integrative analysis of GWAS and eQTL SNPs. (A) We used eSNPs that are also present in CKD-associated GWAS as a means to identify target genes that affect kidney disease. (B) Boxplot of rs1719246 and SPATA5L1, rs6429746 is within the LD block with rs2467853, which is one of the CKD-associated GWAS-identified SNPs (D'= 0.95; r2 = 0.89).

47

eGFR. While the genetic variants were located in the non-coding genomic regions, the proposed

likely causal gene for kidney function in the GWAS study was listed as NFKB1. On the contrary,

our analysis indicates MANBA is a potential target gene for kidney-function associated GWAS

variants. We examined the association of gene expression based on the GWAS variants, and

found that while MANBA shows statistically significant changes based on the genotype,

expression of nearby genes, including NFKB1, was unaffected.

We found that the SNPs associated with kidney function also influenced the expression

of MANBA. While the top GWAS SNP was not present in our eQTL database, we observed that

Figure 2-12 | Genetic and epigenetic annotation of MANBA-locus SNPs. The lower-left part of the square is the analysis of LD between all of the chromosome 4:103.4- 103.8Mb region using Haploview version 4.2 (http://www.broad.mit.edu/mpg/haploview). The strength of the LD is shown in the grey gradient as indicated in the figure. Pair-wise D' values were calculated and plotted using Haploview v4.1. Dashed lines indicate the annotated genes in the locus. The upper-right corner of the square shows chromosome conformation capture (Hi-C) interactions contact probabilities in a human lymphoblastoid cell line (GM12878). The plot shows the 200 kb topologically associating domain at 1kb resolution. Locations of both CKD and eQTL SNPs are labeled. (B) The zoom-in locus of MANBA region at chr4:103.5-103.64 Mb. CKD GWAS SNP, MANBA-associated best eSNP, and the overlapped SNP between two datasets are labeled on the plot. (C) The genome browser display of the MANBA region. Tracks from top to bottom are: UCSC gene annotation; H3K4me1 and H3K27ac overlays of 9 cell lines from ENCODE; human kidney histone mark H3K4me1, H3K4me3, and H3K27ac from Epigenome RoadMap; and ENCODE chromatin states (ChromHMM) showing the regulatory properties of the sequence. (D) Sequence-predicted TF motifs were shown in the bottom for rs227361 (overlapped SNP) and rs170563 (eSNP of MANBA). SNP rs227361 disrupts transcription factor IRF2 motif and rs170563 disrupts FOXP2 motif.

48

Figure 2-13 | Locuszoom plots in MANBA region at chr4:103.4-103.8Mb. (A) CKDGen Phase III GWAS SNPs and the corresponding -log10 p-value. (B)-(D) are eQTL association of MANBA in kidney, whole blood, and transformed fibroblasts in GTEx (v6).

the eGFR GWAS SNPs and eQTL variants were in strong LD. Figure 2-12 shows the LD

structure of the human MANBA region. The top eGFR (rs228611) and eQTL (rs227361) SNPs in

the chr4: 103.5 Mb-103.6 Mb region were within the same LD block (r2=0.88, D’=0.95). Upon

examining the interaction by extended chromosome conformation capture (Hi-C) in the human

lymphoblastoid cell line (GM12878), we found that the eQTL and GWAS regions were likely

interacting with each other, which could explain our results.

49

Figure 2-14 | Genetic and epigenetic annotation of MANBA-locus SNPs. (A) The zoom-in locus of MANBA region at chr4:103.5-103.64 Mb. CKD GWAS SNP, MANBA- associated best eSNP, and the overlapped SNP between two datasets are labeled on the plot. (B) The genome browser display of the MANBA region. Tracks from top to bottom are: UCSC gene annotation; H3K4me1 and H3K27ac overlays of 9 cell lines from ENCODE; human kidney histone mark H3K4me1, H3K4me3, and H3K27ac from Epigenome RoadMap; and ENCODE chromatin states (ChromHMM) showing the regulatory properties of the sequence. (C) Sequence-predicted TF motifs were shown in the bottom for rs227361 (overlapped SNP) and rs170563 (eSNP of MANBA). SNP rs227361 disrupts transcription factor IRF2 motif and rs170563 disrupts FOXP2 motif.

Next, we examined whether the eQTL signal can be replicated in other organs. Upon

examining the GTEx database we found that transformed fibroblasts showed a similar association

between genotype and gene expression as we observed in the kidney (Figure 2-13).

Unfortunately, the GTEx Browser does not contain kidney eQTL, which prevented us from

performing the comparison. However, only the pattern of the transformed fibroblasts eQTL signal

generated by GTEX closely resembled the kidney eQTL signal.

50

To further identify the causal variant (region), we also analyzed histone ChIP-Seq-based regulatory annotation maps. Human-kidney-specific annotation maps indicated that the most enriched histone marks are H3K4me1 and H3K27ac (active enhancer), which overlapped with the eSNPs Figure 2-14A. Examining the ENCODE database, we confirmed that several cell types also showed H3K27ac enrichment of this region. There was no H3K27ac enrichment around the top GWAS variant in human kidneys or other ENCODE cells, although fibroblasts showed enrichment for the latent enhancer mark H3K4me1. The eSNP rs170563 was enriched for the

H3K27ac histone mark and disrupted the FOXP2 transcription factor binding site, suggesting that rs170563 eSNP potentially has a regulatory function. Detailed examination of the MANBA locus is shown in Figure 2-14B-C.

In summary, the observed eQTL signal overlapped with active enhancer regions both in the kidney and other organs, while the GWAS signal has no enrichment of the histone marks, suggesting that the causal variant is most likely in the region of eQTL-identified region and not in the peak GWAS region.

2.3.7 Analysis of MANBA gene expression

Next, we examined the gene expression of MANBA in different organs. Using the protein atlas [301] and Illumina Body Map (GSE30611), we found that the reproductive tract and the kidney both have intermediate expression of MANBA (not shown), and MANBA is expressed in lysosome of kidney tubule cells (Figure 2-15A).[302] Segment-specific expression data obtained from rat kidneys [303] indicate that MANBA is expressed in the long descending limb of the loop of Henle in the inner medulla, with the highest expression level in the inner medullary collecting duct segment.

51

Next, we examined the direction of regulation by the risk allele. We found that expression of MANBA was much lower (close to 50%) in healthy human kidney tissue samples obtained from subjects with risk alleles when compared to reference allele (Figure 2-15B).

Figure 2-15 | Gene expression analyses of MANBA. (A) Immunohistochemistry staining in human kidney tissues (Protein Atlas). (B) Boxplot showing association between rs170563 and MANBA. (C) Gene correlation with eGFR, the leading indicator of kidney function, using microarray and clinical parameters collected in a different independent cohort of 95 human kidneys. (D) Manba expression is significantly lower in mouse UUO model compare with sham (n=3 in each group). The statistical analysis results from DESeq are presented in the table under the plot.

We also analyzed MANBA expression in patients with chronic kidney disease. We have examined MANBA expression analyzed by Affymetrix microarrays on 95 human kidney samples.[290] We found that MANBA expression is lower in patients with CKD and kidney fibrosis

(Figure 2-15C). Similarly, expression of MANBA was lower in a mouse model of fibrosis induced by unilateral obstruction (unpublished data) fibrosis (Figure 2-15D). In summary, MANBA expression was lower in risk allele samples.

2.3.8 Knockdown of Manba results in kidney defect in zebrafish

52

As healthy subjects with risk alleles had lower MANBA expression, we wanted to understand the functional significance of lower MANBA expression. We conducted genetic knockdown experiments in zebrafish (Danio rerio). We used two different forms of morpholino oligonucleotides, translation blocking, and splice blocking. We confirmed the MANBA knockdown efficiency by quantitative reverse transcription polymerase chain reaction (qRT-PCR) analysis

(Figure 2-17F). Zebrafish with MANBA knockdown presented with significantly increased

Figure 2-16 | Kidney defect phenotype observed in zebrafish after knockdown MANBA. (A) Edema in zebrafish: Pericardial edema after gene knockdown is shown on the right panel; the control-injected zebrafish has no edema at 200 �M concentration. (B) Edema rate of Nfkb1 and Manba knocked-down zebrafish; p-values were calculated for each concentration group using Fisher's exact test. We tested the suggested target gene from GWAS study to confirm the eQTL analysis result. pericardial edema and lethality. Pericardial edema is a commonly observed kidney defect phenotype in zebrafish (Figure 2-16A), as the fish is unable to excrete salt and water upon kidney dysfunction. Careful quantification of the phenotype is shown in Figure 2-17A, B. To confirm that the pericardial edema was caused by renal defect, we turned to Tg(enpep:GFP) zebrafish[304], which express GFP in pronephric ducts. GFP expression was reduced in the pronephros of

MANBA knockdown embryos compared to controls. GFP expression was significantly reduced, and indeed often absent, in the Manba knockdown morphants (Figure 2-17C, D). 53

Because heart developmental defects could also contribute to edema formation, we analyzed heart morphology on hematoxylin- and eosin-stained sections. From the histological analysis, we did not observe severe heart abnormalities in zebrafish with severe edema (Figure 2-17E), which indicates that reduced MANBA levels are likely associated with kidney function defect in the zebrafish model. Knockdown of Nfkb1, which was suggested as the causal gene by GWAS studies, did not develop kidney related phenotypes in zebrafish (Figure 2-16B). In summary,

Figure 2-17 | Kidney defect phenotype observed in zebrafish after knockdown MANBA. (A) Edema in zebrafish: Pericardial edema after gene knockdown is shown on the right panel; the control-injected zebrafish has no edema at 200 �M concentration. (B) Edema rate of Nfkb1 and Manba knocked-down zebrafish; p-values were calculated for each concentration group using Fisher's exact test. We tested the suggested target gene from GWAS study to confirm the eQTL analysis result. genetic knockdown of eQTL target gene MANBA, but not the proposed GWAS target NFKB1, was associated with kidney development defect in zebrafish.

54

2.4. Discussion and Conclusion

Genetic variation has significant impact on gene expression regulation, downstream organ function, and phenotype development. Here we present the first comprehensive RNA sequencing-based gene expression analysis for genetically driven gene-expression changes in human kidney samples. While large-scale efforts are in progress to generate comprehensive eQTL analysis for the human body, kidney-specific maps have not been generated. Our project aims to fill that gap by generating kidney-specific maps.

Our results indicate that while a large percentage of genotype-driven gene-expression variation is conserved between different tissue types, kidney-specific maps have highlighted variation in gene levels that play a key role in kidney function, indicating the critical need for tissue-specific maps. In prior work, an algorithm has been developed to impute expression levels for unavailable tissue types on the basis of eQTL. However, the sample size available today does not carry enough power to detect tissue-specific eQTLs [305].

We further found that kidney eSNPs showed tissue-specificity and are enriched on kidney-specific regulatory regions: promoters, enhancers and transcription factor binding sites.

We used the kidney eSNPs we discovered to interpret disease-related variants discovered in the

GWAS studies. We found four major loci where genome-wide significant eSNPs and eGFR- associated GWAS SNPs directly overlapped. We also employed Bayesian methods to confirm that eSNPs and GWAS-associated SNPs are present at the same locus and contribute to both phenotypes and gene expression levels. This analysis highlighted several genes that are likely responsible for the association observed between genotype and kidney disease associated traits in large-scale GWAS. For example, Cystatin 9 (CST9), variants showed association with serum cystatin levels. Another example is PGAP3, the phospholipase that attaches GPI to proteins, which is another likely causal gene for CKD development.

We have further focused our analysis on MANBA as a likely causal gene for genetic 55

variants on chromosome 4 shown to be associated with CKD. Variants that are associated with

CKD traits also altered the expression of MANBA in human kidney samples. The risk genotype was associated with lower levels of MANBA levels in human kidney samples. Interestingly, this association can be recapitulated in the fibroblast cell line in the GTEx dataset, potentially indicating that the signal could come from fibroblasts. A similar direction on gene expression was observed in patients with kidney disease and mouse models with kidney fibrosis. We have used the zebrafish model to show that lower MANBA expression is causally related to kidney disease development. Morpholino knockdown of Manba resulted in pericardial edema formation, indicating the functional role of MANBA in kidney disease development. MANBA is a lysosomal enzyme that participates in the post-translational modification of proteins by hydrolysis of the beta-D-mannose residues. Genetic deletion of Manba in mouse models was associated with neurological defects and tissue deposits such as observed in the kidney.

Recent reports highlight the role of lysosomes and lysosomal dysfunction in CKD development [306]. Lysosomes are acidic organelles that play an important role in macromolecule degradation and recycling. Fabry’s disease, cystinosis and neuronal ceroid lipofuscinosis are primary genetic diseases caused by loss-of-function mutations in lysosomal enzymes [306-308]. While many lysosomal enzyme mutations present with primary central nervous system defects, these mutations are also associated with altered kidney function and ultimately lead to CKD. Renal tubules reabsorb more than 100 liters of fluid and large amounts of organic and inorganic matter to maintain homeostasis. Renal tubule cell lysosomes seem to play an important role in recycling and uptake of urinary substances [307]. Accumulation of toxic substances have been observed in cytinosis and Fabry’s disease that likely result in tubule cell death and dysfunction, further fibroblast activation, and structural defect. Further studies will be needed to study the role of Manba in renal tubule cells.

While our present study mostly focused on functional annotation of CKD GWAS hits, the

56

kidney-specific eQTL maps will likely be essential to the understanding of other traits linked to the kidney, for example metabolite levels and hypertension. Our work nicely illustrates the critical role of the kidney in xenobiotic metabolism as one of the key pathways identified by the eQTL analysis. The genetic regulation of drug metabolism-associated genes is likely important to understand drug effectiveness, toxicity and metabolism. Future studies shall aim to define these connections.

There are several limitations of our work, importantly the sample size. While 100 samples obtained from subjects with a single genetic background can identify genotype-driven gene- expression changes, they have a limited power to identify smaller gene-expression changes.

Furthermore, the data was obtained from TCGA database with limited clinical characterization, including kidney function and structural information. Another limitation of this study is that we used the public dataset form TCGA where the kidney cortex has multiple cell types. We are currently implementing a new method, called multi-tissue meta-analysis (METASOFT) that was developed in GTEx [98] to identify tissue-specific eGenes. METASOFT borrows the statistical power from an additional 44 tissues to identify shared and tissue-specific eGenes. For future studies, we will need to perform cell-type-specific analyses with single-cell resolution.

Despite these potential deficiencies, the dataset will be critically important for interpretation of current and future GWAS studies and traits that are likely related to kidney function, such as metabolite levels and blood pressure-associated genes, to name a few. While proximity-based analysis has been popular in the past, our first pass eQTL analysis does not seem to support previously proposed CKD target genes, but highlights the potential role of a set of different genes and steers the field to refocus our attention. This analysis identified MANBA as a gene that has not previously been implicated in kidney disease development.

2.5. Materials and Methods

Subjects 57

99 normal human kidney cortex samples including genotype (Affymetrix Genome-Wide

Human SNP Array 6.0) and RNA-seq expression profiles were obtained from The Cancer

Genome Atlas (TCGA) through the TCGA Data portal at https://tcga-data.nci.nih.gov/tcga/.

Genotype quality control and imputation

Quality control was performed as follows: Individuals with call rate less than 95%, with ambiguous gender, were removed. Samples that have excessive and reduced proportion of heterozygosity rate were removed for DNA sample contamination or inbreeding. To further examine sample contamination, we examined genome identity-by-descent (IBD), cryptic relationships, and sample duplication. IBD was computed pairwise between all samples using genome-wide genotype data. To avoid genotyping calling error, we excluded SNPs with missing rates > 0.05, minor allele frequency (MAF) < 0.05, Hardy-Weinberg equilibrium (HWE, p-value <

1.0E-3). We then performed PCA against 1000 Genomes samples that have European ancestry and removed samples in our analyses that are deviated from the cluster. After QC, we had 96 individual samples that we used for downstream eQTL analyses.

SNPs that passed quality control were further subject to haplotype phasing using

SHAPEIT2 [282], and genotype imputation using IMPUTE2 [62], with 1000 Genomes Project phase 3 haplotypes as a reference. The imputed genotypes were subjected to second round quality control with missing rates <0.05, MAF > 0.05, HWE P-value > 1.0E-6 using Plink.[309]

Gene expression data processing

We performed RNA-seq data QC and normalization following the standards in the GTEx project.[92] We excluded genes whose mean RNA-Seq by Expectation Maximization (RSEM) value was less than 1, and then transformed the RSEM values as follows: (1) quantile normalization across samples; (2) rank values across samples followed by mapping to normal standard (namely, inverse normal transformation); (3) probabilistic estimation of expression residuals (PEER) with available information from samples (gender and age), because these two

58

covariates could have an effect on gene expression; (4) transformation of the residuals of each gene to standard normal distribution N (0,1) for each. cis-eQTL analysis, identification of eSNPs and eGenes, and network analyses

We used 1 Mb from transcription start site (TSS) to define cis-eQTL window for each gene. The nominal p-values for the SNP-gene pairs in this window were generated using Matrix eQTL[283], which performed linear regression on the transformed residuals with the corresponding imputed genotypes. To address multiple comparisons of many SNPs in LD for a given gene, we applied the permutation procedure using FastQTL[284], which generated an empirical p-value for each gene. We then used the Storey approach to calculate q-values[285], and defined genes with a significant q-value (≤ 0.05) as eGenes. In addition, we generated a list of all significantly associated SNP-gene pairs by calculating the threshold of significant nominal p- value for each eGene based on the permutation method.

The eGene network analyses were generated using of QIAGEN's Ingenuity Pathway

Analysis (IPA, QIAGEN Redwood City, www.qiagen.com/ingenuity).

Histone Mark and Transcription Factor Binding Site Enrichment Analysis

Each histone tail modification marks a different regulatory region of the genome. We use this information to discover which regulatory regions the eSNPs were localized to, and to interpret what their function could be. The enrichment analysis considered the odds ratio of significant

SNPs compared to the 1000 null sets of variants. The null sets of variants were matched for number, distance to TSS, and allele frequency of eSNPs. To avoid enrichment bias, we pruned the eSNPs so that no pair of SNPs had an r2 greater that 0.2 using PLINK (--exclude range -- indep-pairwise 50 5 0.2).

We acquired the histone marks from human kidney proximal tubule cells that we previously published[166], as well as from human kidney tissues at the Roadmap Epigenome

Project (GSM621634, GSM670025, GSM621648, GSM772811, GSM621651, GSM1112806, 59

GSM621638). Human kidney cell (HKC8) ChIP-seq data can be accessed at Gene Expression

Omnibus (GSE49637), and Transcription Factor ChIP-seq (161 factors) from ENCODE project

TFBS clusters (V3)

(http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeRegTfbsClustered/).

GWAS SNPs enrichment in kidney cis-eQTLs

To determine whether disease traits in our SNPs were enriched, we obtained eGFR- associated GWAS SNPs from the CKDGen Consortium (http://ckdgen.imbi.uni-freiburg.de) and several other studies (Table S6). This yielded 67 leading SNPs. We obtained other GWAS- identified SNPs from the National Human Genome Research Institute catalog (NHGRI)

(www.ebi.ac.uk/gwas; accessed date: 9/30/2015).

For the NHGRI GWAS SNPs, we randomly sampled a matched number of SNPs (n=67) from each trait: digestive system disease, nervous system disease, immune system diseases, cardiovascular disease, biological processes, other traits, and body measurement. We took SNPs from traits that have comparable number of SNPs to kidney diseases (n=110). We expanded the list of GWAS SNPs by including tagging SNPs that are in r2 ≥ 0.8. We later removed SNPs that are in the same LD before we performed cis-eQTL analysis using Matrix eQTL. We set the cis- eQTL discovery p-value threshold to 1.0E-2 to compromise for the number of SNPs overlapped at each group and the weak multiple hypothesis effect (Table S1). The enrichment significance was computed using Fisher’s exact test, and we compared each trait against all other traits whether or not they are in the cis-eQTL. We ran the test for all groups as mentioned above.

Integrative statistical analysis of kidney eQTL and kidney function GWAS

To integrate the kidney eQTL and kidney function GWAS results, we applied the

Bayesian colocalization approach coloc.[295, 296] In short, coloc used summary statistics to test for colocalization of GWAS and eQTL signals with the LD information. Coloc supports multiple hypotheses and can distinguish insufficient association from colocalization or distinct 60

associations. Coloc uses posterior probabilities to measure the likelihood of each hypothesis, with the prior beliefs of association and colocalization, and the data on genetic association with disease and gene expression.

The five hypotheses used in the analyses were: H0: No association with either trait; H1:

Association with trait 1, not with trait 2; H2: Association with trait 2, not with trait 1; H3: Association with trait 1 and trait 2, two independent SNPs; H4: Association with trait 1 and trait 2, one shared

SNP. As criteria for significant results, we used both PP5 > 0.8 and PP4/PP3 > 5 (posterior probability from H3 and H4) as criteria [297].

Gene Expression correlation with eGFR

Correlations for the expression of target gene with the eGFR in human kidney samples were examined using microdissected-tubule microarray data published last year from 95 human cortices.[290] Data were uploaded to Array Express and can be accessed under accession code

E-MTAB-2502. Pearson’s correlation was applied to the normalized gene expression and the eGFR level in R (version 3.2.1.).

Zebrafish functional experiment

Zebrafish were maintained according to established Institutional Animal Care and Use

Committee protocols. Briefly, we injected zebrafish embryos with Manba splice-blocking morpholino (MO, GeneTools, Philomath OR) at the one-cell stage at various doses. We compared the number of abnormal morphant embryos to control embryos, injected with a standard control MO designed by GeneTools, using the chi-square test. We documented the development of gross edema at 4 days post-fertilization in live embryos.

61

CHAPTER 3: Cytosine methylation changes in enhancer regions of core pro-fibrotic genes characterize kidney fibrosis development

3.1 Abstract

One in eleven people is affected by chronic kidney disease, a condition characterized by kidney fibrosis and progressive loss of kidney function. Epidemiological studies indicate that adverse intrauterine and postnatal environments have a long-lasting role in chronic kidney disease development. Epigenetic information represents a plausible carrier for mediating this programming effect. Here we demonstrate that genome-wide cytosine methylation patterns of healthy and chronic kidney disease tubule samples obtained from patients show significant differences. We identify differentially methylated regions and validate these in a large replication dataset. The differentially methylated regions are rarely observed on promoters, but mostly overlap with putative enhancer regions, and they are enriched in consensus binding sequences for important renal transcription factors. This indicates their importance in gene expression regulation. A core set of genes that are known to be related to kidney fibrosis, including genes encoding collagens, show cytosine methylation changes correlating with downstream transcript levels. Our report raises the possibility that epigenetic dysregulation plays a role in chronic kidney disease development via influencing core pro-fibrotic pathways and can aid the development of novel biomarkers and future therapeutics.

3.2 Introduction

CKD clinical retrospective data indicate that altered nutrient availability during development can have a long lasting effect on the development of adult diseases, a phenomenon called programming. Hypertension and chronic kidney disease (CKD) present one of the highest sensitivities to intrauterine programming [310]. Epigenetic changes caused by altered intrauterine nutrient availability have been proposed as the mechanistic link for hypertension and CKD 62

development [311]. Epigenetic modifications are inherited during cell division, thus solidifying “the memory or programming” effects of the environment [312].

The epigenome, which includes the covalent modifications of DNA and its associated proteins, and defines DNA accessibility to the transcriptional machinery, is the key determinant of outcome after transcription factor binding. At the root of epigenetic modification is the direct chemical modification of cytosines by methylation [313]. In different cancer types, hypermethylation of tumor suppressor gene promoters has been observed [314]. Increased promoter methylation can interfere with transcription factor binding, causing loss of tumor suppressor expression and thereby contributing to the malignant transformation [315, 316].

Agents that reduce cytosine methylation, for example azacytidine, are now in clinical use and associated with improvements in clinical outcomes, especially for patients with myelodysplastic syndrome [317]. In addition, mutations of different chromatin-modifying enzymes have been described in various cancer types, contributing to alterations in the cancer epigenome [318].

However, little was known about the epigenome of chronic human diseases as opposed to cancers. Most previous studies have been performed on cultured cells, animal models, or surrogate cell types (mostly circulating mononuclear cells) [319]. As the epigenome is cell-type- specific, little mechanistic information can be drawn from cultured cells and surrogate cell types

[320]. To understand whether epigenetic changes occur and potentially contribute to CKD development in human samples, we performed genome-wide cytosine methylation profiling of tubule epithelial cells obtained from CKD and control kidneys. We found that core fibrosis-related genes show cytosine methylation changes in their gene regulatory regions. In vitro studies indicate that cytosine methylation differences play a role in regulating transcript expression.

Examining the CKD epigenome can be an important first step in understanding the role of epigenetics outside the cancer field [321].

63

3.3 Results

3.3.1 CKD kidneys show distinct cytosine methylation profile

For our study, human kidney samples were collected from healthy living transplant and surgical nephrectomies and categorized based on their clinical and pathological characteristics

(Table 3-1). In the initial dataset, we combined samples that are hypertensive and diabetic CKD as cases, as the clinical, histological and gene expression profiles of these samples were very similar. In the replication dataset, only diabetic CKD (DKD) samples were used. In both datasets, the criteria for controls were an estimated glomerular filtration rate (eGFR) greater than 60 cc/minute/1.73m2, absence of significant proteinuria, and less than 10% fibrosis on histology.

Samples with significant hematuria or other signs of glomerulonephritis (HIV, hepatitis, or lupus) were excluded from the analysis. In summary, 26 samples were used for the initial discovery phase, the phenotype analysis was significant for racial diversity, and it included subjects with and without diabetes both as cases and controls (Table 3-1).

To avoid cell-type heterogeneity, we microdissected each renal cortical sample and used the tubular epithelial cell portion for the initial analysis. Our and other labs have previously published that this fraction represents mainly the proximal tubule portion of the human kidney

[322]. Genome-wide cytosine methylation analysis was performed on each sample using methylation-sensitive and -insensitive isoschizomer enzymes (HpaII and MspI) followed by (HpaII) fragment enrichment by ligation-mediated PCR (HELP) [248]. Samples were hybridized on

NimbleGen whole genome-covering microarrays (1.3 million loci). Focusing on loci that showed more than 50% difference in their methylation ratio and a P-value <0.01, we identified 4,751 differentially methylated regions (DMRs) between control and diseased tubule samples (Figure 3-

1A; complete list provided in Appendix 3). The volcano plot analysis (fold change of methylation plotted against the negative log2 of P-value) indicated that 70% of the DMRs showed lower methylation levels in CKD (Figure 3-1A). We found that cytosine methylation differences suffice for proper clustering and supervised classification of control and CKD kidney samples (Figure 3-

64

1B). The computational annotation identified a total of 1,535 unique genes in the vicinity of the

DMRs.

Table 3-1 | Demographic, clinical and histological characteristics of the samples Characteristics Diseased Healthy P-value n 12 14

Age (years) mean ± SD 68.0 ± 10.81 61.14 ± 11.2 0.11 Ethnicity Asian, Pacific Islander 0 1 White, non-Hispanic 4 2 Black, non-Hispanic 4 4 Hispanic 1 3 Other and unknown 3 4 Height (cm) mean ± SD 165 ± 8.69 166.5 ± 8.63 0.6 Weight (kg) mean ± SD 78.0 ± 22.02 88.32 ±15.93 0.2 BMI (kg/m2) mean ± SD 27.85 ± 6.41 31.25 ± 5.58 0.18 Diabetes 6 5 Hypertension 11 12 Proteinuria (dipstick) 3.0 ± 1.83 0.36 ± 0.81 1.80E-04 Serum BUN (mg/dL) mean ± SD 35.0 ± 14.7 17.71 ±5.85 4.60E-04 Serum creatinine (mg/dL) mean ± SD 3.0 ± 1.61 1.08 ± 0.18 2.00E-03 eGFR (ml/minute/1.73 m2) mean ± SD 29.0 ± 13.68 70.94 ± 8.35 1.06E-09 Histology Glomerulosclerosis (%) 31.0 ± 31.35 3.31 ±5.52 4.00E-03 Mesangial matrix expansion 1 ± 0.91 0.17 ± 0.39 0.03 Tubular atrophy (%) 34.0 ± 24.94 9.82 ± 15.76 6.00E-03 Interstitial fibrosis (%) 34.0 ± 25.15 5.68 ± 5.07 4.00E-04 Vascular sclerosis Intima 2.0 ± 0.78 0.9 ± 1.1 1.50E-03 Arterioles 2.0 ± 0.78 0.29 ± 0.62 4.00E-04

Gene ontology annotation showed that genes around the DMRs are enriched for cell adhesion and development-related functions including collagen, fibronectin, TGF-β and Smad proteins (Figure 3-1C), and many of these genes are known to play a critical role in CKD development. In summary, microdissected kidney tubule cells showed distinct differences in their cytosine methylation patterns in CKD.

65

Figure 3-1 | Statistically significant cytosine methylation differences in chronic kidney disease. (A) Volcano plot analysis of cytosine methylation differences. The x-axis represents the relative cytosine methylation difference of control (CTL) versus CKD samples, the y-axis represents the negative log2 of the P-value of that locus. The mean P-value and mean difference of 1.3 million loci present on the chips are plotted on the graph. The green and red lines represent the statistical criteria used for further analysis (P-value and fold change, respectively). (B) Hierarchical cluster analysis of the differentially methylated regions. Each column represents changes from one individual kidney sample; blue indicates hypermethylation in CKD, while red represents hypomethylation in CKD. The chart below shows the clinical parameters of the samples: glomerular filtration rate, diabetes status (DM, diabetes mellitus), sex, and age (aged >65 years or <65 years (C) Gene Ontology analysis of the 1,535 DMRs mapped to unique genes using DAVID gene ontology annotation groups (biological process level 1 annotation).

3.3.2 Validation and external replication of the results

Internal validation of the results was performed using site-specific primer-based amplification of bisulfite-converted genomic DNA and Sequenom MassArray Epityper quantification of modified cytosines [323]. This method is based on mass spectrometry that allows us to determine absolute methylation levels. We correlated these mass array based absolute methylation levels with the HpaII/MspI relative ratios (Appendix 4).

66

External validation was performed on 87 microdissected human kidney tubule epithelial samples, 21 from patients with DKD and 66 controls including hypertension (n = 22), diabetes mellitus (n = 22) or none (n = 22)) (SYH and KS, unpublished observation). Genome-wide methylation profiling of the validation set was performed using Illumina Infinium 450K methylation- sensitive bead arrays. This method uses site-specific probes for bisulfite-converted DNA, which is fundamentally different from the restriction enzyme-based method used in the HELP analysis.

From the 1,535 unique genes found around DMRs in the initial dataset, we examined

1,092, as these genes were present also on the Illumina Infinium (and Affymetrix for gene expression) arrays (Figure 3-2A). Significant methylation differences were detected for 1,061 genes, corresponding to 98% of the genes in the original dataset) (Figure 3-2A). The complete list of DMRs in the original and the replication dataset can be found in Appendix 4.

Locus-specific validation was performed for six different genes, including COLIVA1.

COLIV4A1/A2 are critical basement membrane proteins synthesized by epithelial cells. Increased expression is known to be responsible for increasing the thickness of the basement membrane and it is considered to be an early change in progressive kidney fibrosis [324]. The COLIVA1 and

COLIVA2 transcripts are transcribed from a single promoter (Figure 3-2B). This locus showed significantly lower cytosine methylation of CKD samples (Figure 3-2C). We examined the absolute methylation level of COLIV4A1/2 by MassArray Epityper analysis (Figure 3-2D) in control and CKD samples and confirmed the methylation differences between healthy and diseased tubule epithelial cells. Next we examined COLIVA1/2 methylation in the validation dataset (Infinium arrays from 66 control and 21 DKD samples). Using this dataset, we also confirmed the predominant (2 to 12%) hypomethylation of this locus (Figure 3-2E). The methylation differences correlated with increased COLIVA1 transcript (Figure 3-2F) and protein levels (Figure 3-2G). Using the MassArray Epityper, we also validated the methylation status of additional loci (Appendix 5). In summary, the methylation differences appear to be highly consistent between the original and validation experiments using multiple different methods.

67

Figure 3-2 | External and internal validation of the observed changes. (A) Correlation of the DMRs identified in 26 samples using the HELP method and an external dataset containing 87 human kidney samples analyzed using Illumina Infinium 450K arrays. We found concordant regulation of 1,061 transcripts (98%) from the 1,092 mapped genes using this validation dataset. Of transcripts that showed both differential methylation and expression, 404 (97%) were confirmed. (B) An example of a DMR identified by the HELP assay and confirmed in the validation dataset. This DMR is localized in the intronic region of COLIVA1. (C) Methylation status and (D) gene expression of COLIVA1 in the original HELP dataset in control and CKD samples. (E) MassArray confirmation of methylation status of the COLIVA1 locus (blue is mean +- standard deviation of control samples, red is mean +- standard deviation of CKD samples). (F) Methylation status of the COLIVA1 locus in the validation dataset (Infinium 450K arrays). The data represent the mean differences in absolute methylation levels of individual cytosines at the COLIVA1 locus. (G) Immunohistochemistry of COLIVA1 expression in control (CTL), diabetic (DM), DKD and CKD kidneys.

3.3.3 Differentially methylated loci are enriched in kidney-specific gene regulatory regions

Cytosine methylation of promoters is critically important, because it can interfere with TF

68

binding and thereby modulate transcription (You and Jones, 2012). The number of DMRs localized to RefSeq annotated promoters and 5' UTRs was significantly lower (about 50%) than the expected ratio (Figure 3-3A). On the other hand, more than half of the DMRs were in gene body-related regions. Only a few DMRs localized to exons (approximately 200); the majority of the differences we observed were in the intronic regions (Figure 3-3A). We also examined the

RefSeq annotated genomic distribution of the hypo- and hypermethylated regions (Appendix 6).

The percentage of hypermethylated regions was similar in the different RefSeq-based annotation groups. We found that more loci showed increased methylation at the 3' UTR. In summary, the genomic regions that showed differences in their cytosine methylation pattern in CKD were not promoters, but intronic and transcription termination regions and 3' UTRs.

To further understand the functional significance of the DMRs, we generated genome- wide chromatin annotation maps using cultured human proximal tubular epithelial cells (HKC8).

First, we performed chromatin immunoprecipitation followed by next-generation sequencing

(ChIP-seq) for a panel of important histone modifications: H3K4me1, H3K4me2, H3K4me3,

H3K27ac, H3K27me3, and H3K36me3. Next, we generated gene regulatory annotation maps from the panel of ChIP-seq data using the hidden Markov model-based ChromHMM chromatin segmentation program [325, 326]. Consistent with the RefSeq-based annotation, there were very few DMRs localized to ChromHMM-annotated kidney promoter regions (Figure 3-3B). The analysis indicated that 30% of the DMRs localized to enhancer regions, which was the most significant enrichment. Similar results were obtained when we generated adult kidney cortex

ChromHMM maps from published ChIP-seq data (Figure 3-3B) [1].

69

Figure 3-3 | Chronic kidney disease differentially methylated regions are localized to kidney-specific enhancer regions. (A) RefSeq annotation of the DMRs. Relative enrichment ratio of the DMRs compared with the representation of the different elements on the methylation microarray. TSS, transcription start site; TTS, transcription termination site. (B) DMRs overlap with regulatory element (chromatin state) annotation maps of renal tubule epithelial cells (HKC8) and adult kidney cortex, indicating that most differentially methylated cytosines are localized to enhancer (yellow and orange) regions in kidney epithelial cells. The color code annotation of the chromatin state map is provided bottom right. (C) Chromatin annotation of the DMRs in five different ENCODE cell lines (H1, human embryonic stem cells; HepG2, hepatocytes; HUVEC, endothelial cells; K562, erythroid cells; NHLF, human lung fibroblasts).

Next, we compared CKD-specific DMRs with chromatin annotation maps of other,

different cell types using the publicly available ENCODE database (Figure 3-3C). We found that

CKD-specific DMRs localized mostly to repressed chromatin regions, while transcription and

enhancer regions showed the second- highest enrichment. The results indicated that DMRs in

CKD were enriched in kidney-specific gene regulatory regions, mainly (intronic) enhancers.

Gene regulatory regions are usually characterized by DNase I hypersensitivity (DHS)

[327], because DNA is usually histone-free in gene regulatory regions so transcription factors can

bind to these regions. Therefore, we overlapped the DMRs with human fetal kidney and human

proximal tubule epithelial cell DHS-seq data (Gene Expression Omnibus (GEO) accession

70

GSM530655). The statistical analysis confirmed that DMRs are enriched in DHS sites in both the

fetal kidney epithelial dataset and the cultured tubule epithelial cell dataset. In addition, we

examined whether DMRs that overlap with DHS sites show similarities, by identifying the top 10

consensus sequences using the MEME software package [328]. To search for TF binding motifs

among the top 10 sequences, we mapped the sequences to the JASPAR, UniProbe and Transfac

databases [329]. This analysis highlighted that the DMRs contain consensus-binding sequences

for transcription factors that play important roles in proximal tubule development, including SIX2,

Figure 3-4 | Chronic kidney disease differentially methylated regions are enriched for kidney-specific transcription factor binding sites. (A) The DMR and DHS sites contain consensus sequences. The transcription factor binding site motifs and their statistical enrichment from the de novo searched consensus sequences in DMR and DHS sites. (B) A specific example of an intronic DMR (within the EZR gene). The genomic location of the DMR is at the top, followed by the RefSeq representation of EZR; fetal kidney (FK)-specific DHS tracks (in blue); HKC8 cell-specific H3K4me1 and H3K4me3 tracks; HKC8 cell specific ChromHMM annotation of the locus (yellow, enhancer; red, promoter; green, transcription-associated region; the full color coding key is shown bottom right) - the sequences contain consensus-binding sites for the key kidney transcription factor SIX2/3, with the SIX2/3 binding motif illustrated as a sequence logo plot below; and adult kidney (AK)-specific H3K4me1 (blue) and H3K4me3 (green) tracks [1].

HNF, and TCFAP. The list of computationally identified transcription factor consensus motifs is

shown in Figure 3-4A. 71

Figure 3-4B illustrates our motif analysis. Here a DMR is located in the intronic region of the EZR (ezrin) gene. The DMR overlapped with adult kidney and renal tubular epithelial cell- specific H3K4me1 histone modification, but not with H3K4me3 enrichment. H3K4me1 is a specific histone tail modification for enhancer regions, while H3K4me3 is a marker of promoters.

The ChromHMM-based gene regulatory region annotation confirmed that this region is an enhancer in renal epithelial cells (yellow region in the genome).

These results indicate that by multiple different approaches this DMR is located in a gene regulatory region, an enhancer (Figure 3-4B). In addition, this region contained a consensus- binding sequence for SIX2, further confirming that this is a gene regulatory region. In summary, our results indicate that CKD-specific DMRs are located in non-promoter gene regulatory regions, mainly enhancers, and contain consensus-binding motifs for renal-specific transcription factors.

3.3.4 Differentially methylated regions are functionally relevant and correlate with transcript levels

To study the functional significance of DMRs, we first, we examined whether they correlate with downstream transcript levels. Gene expression changes were analyzed using RNA samples extracted from the same microdissected tubule samples used in the methylation assay.

Individual RNA samples were hybridized to Affymetrix U133 arrays and the data ware normalized and analyzed using established pipelines [322]. From the 1,092 transcripts in close proximity to the DMR regions (Figure 3-5A), we identified 415 (approximately 40%) genes showing significantly differential expression in the CKD samples (Figure 3-5A).

72

As most DMRs were in non-promoter regions, most transcript changes correlated with intronic DMRs (Supplementary file 8). Gene ontology and network analyses highlighted differences in cell adhesion (collagens and laminins) and development-related pathways (Figure

3-5B, C). Specifically, we observed significant enrichment for differential expression and methylation in the TGF-β pathway, especially in TGFBR3, SMAD3, SMAD6 and other targets

Figure 3-5 | Differentially methylated regions correlate with transcript changes. (A) The 4,751 DMRs mapped to 1,092 unique genes that were present in the Affymetrix arrays. There were 415 transcripts that showed differences both in their methylation status and their expression in CKD samples. The RefSeq-based locations of the DMRs are also shown. While most differentially methylated regions localize to gene body regions, they also show correlation with the expression of many of those genes. Not only are the 415 transcripts differentially expressed, they also show differences in their cytosine methylation profiles as well. (B) DAVID-based gene ontology annotation of the 415 transcripts. (C) Network chart of the genes that are both differentially expressed and methylated (pathways related to development are highlighted in orange). (D) Methylation and gene expression level of key molecules (RUNX3, RARB, SMAD6) identified by the network analysis in control (CTL) and CKD samples.

73

(Figure 3-5D). These genes are known to be critical in CKD development [330, 331]. In summary, cytosine methylation changes showed correlation with gene expression differences and identified concordant changes in the TGF-β pathway, a well-known regulator of kidney fibrosis development.

3.3.5 Methylation in differentially methylated regions affect gene expression in vitro

To further examine the relationship between cytosine methylation and transcript level changes, we analyzed gene expression and cytosine methylation patterns of tubule epithelial

Figure 3-6 | Regulation of transcripts by a DNA methyltransferase inhibitor in in vitro cultured human tubular epithelial cells. Gene ontology terms of transcripts showing differential expression in the decitabine-treated cells. (A) Illustration of regions that showed differential methylation of cultured HKC8 cells treated with 0.5 µM decitabine (5'DAC). CTL, control. (B) The interconnected network analysis highlighted the differential expression of cell adhesion and developmental pathways. These genes are also differentially expressed and methylated in the original CKD dataset. GO, gene ontology. cells at both baseline and 9 days after treatment with a DNA methyltransferase inhibitor, decitabine (5-aza-2-dexoycytidine). We used AffymetrixST1.0 arrays to compare gene expression changes and the Infinium 450K arrays to examine cytosine methylation changes in control (n = 3) and decitabine-treated cells (n = 4).

We tested whether there is correlation between in vivo (CKD) and in vitro (decitabine treatment) of the DMR and gene expression changes. Decitabine is a cytosine analogue; therefore, we hypothesize that after decitabine treatment, the cytosine methylation changes are 74

the primary cause of transcript level changes. A limitation of the experiment is that decitabine induces demethylation of genomic loci that could be different from the CKD DMRs. Large numbers of loci showed concordant differential methylation and gene expression changes in CKD in vivo as well as after (0.5 µM) decitabine treatment in vitro, indicating that cytosine methylation changes in CKD might be the functional drivers of transcript level changes (Supplementary file 9).

Genes related to cell adhesion (for example, collagen molecules) showed differential methylation following decitabine treatment as well (Figure 3-6A, B). In addition, just as we observed before, we found that genes related to development and cell adhesion were also differentially expressed following decitabine treatment (Figure 3-6B, C).

SMAD3 appears to be one of the most important mediators of the pro-fibrotic effect in the

TGF-β and angiotensin II pathways [331]. The SMAD3 locus contained DMRs in both the initial and validation datasets. SMAD3 expression levels were lower in both the original and confirmation datasets. Decitabine changed the methylation of this locus and subsequently also changed SMAD3 transcript levels.

To illustrate our findings, while RUNX1 clearly plays an important role in leukemia development, it is expressed in both mice and humans in the developing and adult kidneys [332].

RUNX1 was also shown to be differentially expressed in CKD tubules [322]. Both the original and replication dataset showed differential methylation of this locus (Figure 3-7A, C) and RUNX1 transcript levels were increased in both datasets (Figure 3-7B, D). RUNX1 DMRs clustered in

ChromHMM annotated enhancer regions (Figure 3-7G) (with H3K4me1 and DHSs) In vitro treatment with decitabine changed the cytosine methylation of this locus, and the changes overlapped with the enhancer DMRs (Figure 3-7E).

Subsequent to the DMR change of this locus, we also observed an increase in RUNX1 transcript levels both in vivo and in vitro (Figure 3-7F). As decitabine did not change the methylation of the RUNX1 promoter and affected only the methylation levels of the enhancer site, the result potentially indicates a causal relationship between enhancer-related DMRs and gene

75

expression changes. The concordant changes in cytosine methylation and gene expression in

CKD and in vitro (following DNA methyltransferase inhibitors) indicate that DMRs are potential drivers of critical CKD gene expression.

Figure 3-7 | Gene body cytosine methylation changes drive gene expression differences. RUNX1 methylation and gene expression were examined. only(A,B) In the original discovery dataset, the gene body region of RUNX1 was hypomethylated (A) and the corresponding transcript level was increased (B) in the CKD (discovery) dataset. CTL, control. (C,D) The differential methylation (C) and expression (D) of RUNX1 in the DKD replication dataset. (E,F) Transcript levels are increased (F) in vitro in cultured tubules after decreasing the methylation level of the locus following 0.5 µM decitabine (DAC) treatment (E). (G) Genomic representation of the RUNX1 locus showing DMRs in the DKD dataset and in the CKD dataset. Different tracks are shown for the RUNX1 locus, including RefSeq gene, DMRs in the DKD dataset, DMRs in the CKD cells, and histone ChIP-seq data for H3K4me1 and H3K4me3 for adult kidney cortex and DHS sites from fetal kidneys. In addition, ENCODE-based transcription factor binding sites are also shown.

3.4 Discussion and Conclusion

While epigenetic dysregulation has been suggested as a mechanism for the 76

development of many diseases, little is known about the epigenome of normal and diseased human cells and organs. Here we describe cytosine methylation differences in tubule cells obtained from patients with CKD. We found that CKD DMRs have many special features. First, most loci showed consistent cytosine methylation differences in CKD. These changes were smaller than what has been described in the cancer literature previously. While the absolute differences were modest, the identified loci showed highly consistent changes even across different datasets and platforms. Unexpectedly, we found that most methylation differences were localized outside of promoter areas, with promoter regions markedly spared from cytosine methylation differences.

Our results indicate that the differentially methylated regions were located mainly at candidate enhancers. We found that the DMRs contain consensus-binding motifs for key renal transcription factors (HNF, TCFAP, SIX2). Furthermore, cytosine methylation levels correlated with baseline gene expression changes. These epigenetically distinct but morphologically similar cells also showed differences in their cytokine response. We illustrated our findings in a model hypothesizing that enhancer DMRs might modify TF binding and thereby modify downstream transcript levels. Based on our results, we propose that cytosine methylation changes are causally linked to transcript levels and phenotype development.

As hypertensive and diabetic tubule samples showed similarities both in cytosine methylation and gene expression changes, the observed changes are likely to be part of a common mechanism of progression. This may be expected, as phenotypically the tubulointerstitial fibrosis of DKD and hypertensive CKD is similar. In addition, we found that DMRs were enriched for genes related to kidney development, many of them no longer expressed in the adult kidney. The DMR regions also contained binding sites for key kidney developmental factors such as SIX2, HNF, and TCFAP.

One possible interpretation of our findings is that the epigenetic differences are established during development. This is the time when the cell-type-specific epigenome is

77

established and these genes and transcription factors play functional roles. Therefore, they can possibly provide the mechanistic link between fetal programming and CKD development - the

Brenner-Barker hypothesis put forward many decades ago [333, 334], proposed that nutrient availability during development has a longlasting programming role in hypertension and CKD development.

In addition, reactivation of the developmental pathways is also needed during organ injury repair [335]. We can also speculate that the altered developmental wiring of these pathways continues to play a role later on as alterations observed after repair. Indeed, control and CKD kidney epithelial cells showed not only cytosine methylation differences but also different responses to cytokine treatment.

A limitation of our results remains that our samples were collected in a single center.

Furthermore, base pair resolution results will likely help to refine the more precise location of

DMRs and the methylation differences in the future. While microdissection is an excellent separation method to generate a homogenous tubular epithelial cell population from the kidney, the potential risk for increased cell type heterogeneity in CKD remains. As isolated and cultured cells continued to show many of the epigenetic and transcriptional differences, it is more likely that the observed differences are not related to cell type heterogeneity. In summary, while it has long been speculated that epigenetic dysregulation might occur in non-cancerous diseases, including CKD, here we provide experimental evidence for cytosine methylation changes in human kidney tissue samples, opening the possibility that they play a role in CKD development.

3.5 Materials and Methods

Ethics statement

The clinical study used the cross-sectional design. Kidney samples were obtained from routine surgical nephrectomies. Samples were de-identified and the corresponding clinical information was collected by an individual who was not involved in the research protocol. The study was approved by the Institutional Review Boards of the Albert Einstein College of Medicine 78

Montefiore Medical Center (IRB#2002-202) and the University of Pennsylvania. Histological analysis was performed by an expert pathologist (IRB#815796).

Tissue handling and microdissection

Tissue was placed into RNALater and manually microdissected at 4C for glomerular and tubular compartments as described earlier. Dissected tissue was homogenized and RNA was prepared using RNAeasy mini columns (Qiagen, Valencia, CA, USA) according to the manufacturer's instructions. RNA quality and quantity was determined using Lab-on-Chip Total

RNA PicoKit (Agilent BioAnalyzer, Santa Clara, CA, USA). Only samples without evidence of degradation were used. Genomic DNA was extracted by phenol chloroform protocol for HELP analysis and the DNAeasy kit was used for the Infinium platform.

DNA methylation analysis by HELP

The HELP assay was carried out as previously published [336]. Intact DNA of high molecular weight was corroborated by electrophoresis on 1% agarose gels in all cases. One microgram of genomic DNA was digested overnight with either HpaII or MspI (NEB, Ipswich, MA,

USA). The digested DNA was used to set up an overnight ligation of the HpaII adapter using T4

DNA ligase. The adapter-ligated DNA was used to carry out the PCR amplification of the HpaII- and MspI-digested DNA as previously described [248]. Both amplified fractions were submitted to

Roche NimbleGen Inc. (Madison, WI, USA) for labeling and hybridization onto a human hg18 high-density custom-designed oligonucleotide array (50-mers) containing 2.6 million loci. HpaII- amplifiable fragments are defined as genomic sequences contained between two flanking HpaII sites found within 200 to 2,000 bp of each other. All microarray hybridizations were subjected to extensive quality control using the following strategies. First, uniformity of hybridization was evaluated using a modified version of a previously published algorithm [337] adapted for the

NimbleGen platform, and any hybridization with strong regional artifacts was discarded and repeated. The raw data can be accessed under GSE49557.

HELP data processing and analysis

79

Signal intensities at each HpaII amplifiable fragment were calculated as a robust (25% trimmed) mean of their component probe-level signal intensities. Any fragments found within the level of background MspI signal intensity, measured as 2.5 mean absolute differences (MAD) above the median of random probe signals, were categorized as “failed”. The failed loci therefore represent the population of fragments that did not amplify by PCR, whatever the biological or experimental cause (for example, genomic deletions and other sequence errors). On the other hand, “methylated” loci were so designated when the level of HpaII signal intensity was similarly indistinguishable from background. PCR-amplifying fragments (those not flagged as either

“methylated” or “failed”) were normalized using an intra-array quantile approach wherein

HpaII/MspI ratios are aligned across density-dependent sliding windows of fragment size-sorted data. The log2 (HpaII/MspI) was used as a representative for methylation and analyzed as a continuous variable. For most loci, each fragment was categorized as either methylated, if the centered log HpaII/MspI ratio was less than zero, or hypomethylated if the log ratio was greater than zero.

Statistical analysis of HELP data was performed using the statistical software R version

2.13.1 [337]. A two-sample t-test was used for each gene or locus to summarize methylation differences between the two clinical groups (cases and controls). Genes were ranked based on the magnitude of this test statistic and a set of differentially methylated loci with P-value <0.01 and a fold change >0.5 was identified.

Quantitative DNA methylation analysis by MassArray epityping

Validation of HELP microarray findings was carried out by matrix-assisted laser desorption/ionisation-time of flight (MALDI-TOF) mass spectrometry using EpiTyper by

MassArray (Sequenom, San Diego, CA, USA) on bisulfite-converted DNA as previously described (Zhou, Opalinska et al. 2011). MassArray primers were designed to cover the flanking

HpaII sites for a given HpaII-amplifiable fragments (HAF), as well as any other HpaII sites found up to 2,000 bp upstream of the downstream site and up to 2,000 bp downstream of the upstream

80

site, in order to cover all possible alternative sites of digestion. HAF is defined by those fragments where two HpaII sites are located 200-2000 bp apart with at least some unique sequence between them and selected those located at gene promoters and imprinted regions.

Gene expression analysis using Affymetrix arrays

Transcript levels were analyzed using Affymetrix U133A and 1.0ST arrays. Probes were prepared using an Affymetrix 3′ IVT kit. After hybridization and scanning, raw data files were imported into Genespring GX software (Agilent Technologies). Raw expression levels were normalized using the RMA16 summarization algorithm. Genespring GX software was then used for statistical analysis; the data were above the 20th percentile when filtered by expression. We used a Benjamini-Hochberg multiple testing correction with a P-value <0.05. Heatmaps of methylation data and gene expression data were generated using an unsupervised hierarchical clustering method calculated by squared Euclidean distances. Methylation data used in clustering have a P-value <0.00015 and a fold change ≥0.5. The raw data can be accessed through accession GSE48944.

Gene ontology and transcription factor binding sites

The Database for Annotation, Visualization and Integrated Discovery (DAVID) bioinformatics package was used for gene ontology and pathway analysis. In addition, Ingenuity

Pathway Analysis (IPA, Redwood City, CA, USA) was used to generate networks.

Sequences of DMRs (n = 4,751) were lifted over from hg18 to hg19 using UCSC

Genome Browser Utilities. The regions were then intersected with fetal kidney or human kidney epithelial-specific DHS peaks (data from GEO GSM530655); a total of 364 overlapping regions were used. Motif weight matrices overrepresented in the overlapped sequences were identified using MEME version 4.8.0 [328] on the 364 regions with parameter -oc -nmotifs 10 -minw 8 - maxw 50.

81

Adult kidney ChIP-seq data were acquired from the Roadmap database (GEO accessions number GSM670025, GSM621638). The overlap was set to be a minimum of 1 bp in length.

Motif searching

We compared de novo motifs to motifs available as part of various databases, including

Transfac, version 2011.1, Jaspar Core, and UniPROBE using TOMTOM software [329], version

4.8.1. TOMTOM parameters were set to their default values during motif comparisons. When partitioning the de novo motifs, assigning each to a single category, the order of match assignment preference was to Transfac, Jaspar Core, UniPROBE, and then to the novel motif category.

Cell lines

HKC8 cells were kindly provided by Lorainne Racusen (Johns Hopkins University) and were cultured in DMEM/F12 medium supplemented with 2.5% fetal bovine serum, antibiotics and insulin, transferrin and selenium. Cells were incubated with 0.5 µM decitabine on days 2, 4, 6, and 8 and harvested on day 9. RNA was isolated using a Qiagen RNeasy kit labeled using an

Ovation transcript labeling kit and hybridized onto Affymetrix Human ST1.0 arrays.

Chromatin immunoprecipitation sequencing

HKC8 cells were harvested and crosslinked with 1% formaldehyde when they reached

80% confluency on culture plates. Chromatin was sheared using a Bioruptor and immunoprecipitated using H3K4me1 (Abcam ab8895, Cambridge, MA, USA), H3K4me2 (Abcam ab11946), H3K4me3 (Abcam ab8580), H3K36me3 (Abcam ab9050), H3K27ac (Abcam ab4729) and H3K27me3 (Millipore 07-499, Billerica, MA, USA) marks. ChIP was performed as described in the manual of MAGnify™ Chromatin Immunoprecipitation System (Invitrogen, Grand Island,

NY, USA). Quantitative real-time PCR was performed to ensure the quality of the ChIP product.

The ChIP product was assessed for size, purity, and quantity using an Agilent 2100 Bioanalyzer

(Agilent Technologies). Library preparation and sequencing were performed at the Einstein

82

Epigenome Center. Sequence reads (100 bp) were generated from llumina HiSeq 2000 [338].

Reads were aligned to the reference genomes (NCBI build 37, hg19) using Bowtie (v 0.12.7)

[339]. Repetitively mapped and duplicate reads were excluded. The data can be accessed using accession GSE49637.

ChIP-seq data analysis

We used the MACS version 1.4.1 (model-based analysis of ChIP-Seq) peak-finding algorithm to identify regions of ChIP-Seq enrichment over background [340]. A false discovery rate threshold of enrichment of 0.01 was used for all datasets. The resulting genomic coordinates in BED format were further used in ChromHMM v1.06 for chromatin annotation. The following parameters were used: -Xmx1600M -jar ChromHMM.jar BinarizeBed hg19 -Xmx2000M -jar

ChromHMM.jar LearnModel 10 hg19.

DNase I hypersensitive site analysis

Human kidney DHS sequencing data (GEO GSM530655) was analyzed with MACS

(v.1.4.1). The resulting peaks were overlapped with the differentially methylated regions. The control random genomic loci were generated using Regulatory Sequence Analysis Tools [341].

Based on the data property of differentially methylated regions, we used the same number of fragments (4,751) and the same average fragment size (443 bp) as parameters for the random loci.

Illumina infinium 450K BeadChip arrays

Genomic DNA (200 ng) was purified using the DNeasy Blood and Tissue Kit (Qiagen,

Hilden, Germany) following the manufacturer’s protocol. Purified DNA quality and concentration were assessed with a NanoDrop ND-1000 (Thermo Fisher Scientific, Waltham, MA, USA) and by

Quant-iT™ PicoGreen® dsDNA Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA) prior to bisulfite conversion. Purified genomic DNA was bisulfite converted using the EZ DNA Methylation

Kit (Zymo Research, Orange, CA, USA) following the manufacturer’s protocol. Bisulfite DNA quality and concentration were assessed, following the Illumina 450K array protocol, the bisulfite

83

converted sample was whole-genome amplified, enzymatically digested, and hybridized to the array, and then single nucleotide extension was performed.

Chips were scanned using an Illumina HiScan on a two-color channel to detect Cy3- labeled probes on the green channel and Cy5-labeled probes on the red channel. Illumina

GenomeStudio Software 2011.1 Methylation Module 1.8 was used to read the array output and conduct background normalization. The level of DNAm for 428,216 probes in our sample dataset was intersected with the expanded annotation for further analyses. All samples were run together to eliminate the batch effect according to the pipelines established by Illumina Genome Studio.

The full dataset can be accessed in GEO under GSE50874.

84

CHAPTER 4: Conclusion 4.1. Summary of results and discussion

The work described in the dissertation aimed to understand genetic and epigenetic factors related to CKD and CKD development. Our eQTL analysis opens the door to considerable future research in gene regulation in the human kidney. While a human-body eQTL map has been generated through GTEx, a kidney-specific eQTL map was not available through the public database due to organ allocation and transplantation. In addition, published results showed that eQTL performed from imputed gene expression for unavailable tissue types such as the human kidney does not have enough power to detect tissue-specific eQTLs [305]. For these reasons, our kidney eQTL analysis using RNA sequencing-based gene expression in human kidney samples is an important resource for the research community.

The eQTL results showed that kidney eSNPs are localized to kidney-specific regulatory regions: promoters, enhancers and TF binding sites, which suggested regulatory functions for these eSNPs. We used the kidney eSNPs to interpret CKD-associated SNPs using both direct overlap and Bayesian colocalization methods. We identified a short list of genes that are likely responsible for the association observed between genotype and kidney disease traits in large- scale GWAS. Among the 9 identified genes, a few target genes have presented with kidney phenotypes in knockout transgenic mice, including Cystatin 9 (CST9) and PGAP3.

These examples validated that the colocalization method can identify kidney-relevant target genes. We validated one of the target gene, MANBA. as a likely causal gene for genetic variants on chromosome 4 that are associated with CKD. We found that the risk allele is associated with lower MANBA gene expression level in human kidney samples.

From zebrafish experiments, we observed that lower MANBA expression is causally related to kidney disease development. Morpholino knockdown of Manba resulted in pericardial edema, indicating a functional role of MANBA in kidney disease development. MANBA is a lysosomal enzyme that participates in the post-translational modification of proteins by hydrolysis

85

of the beta-D-mannose residues. Manba knockout mice were associated with neurological defects and presented with cellular deposits in the kidney.

Lysosomal dysfunction in CKD development has been observed through several loss-of- function mutations in lysosomal enzymes that causes genetic diseases with altered kidney function and ultimately lead to CKD. For example, accumulation of toxic substances has been observed in Fabry’s disease and cystinosis that likely result in tubule cell death and dysfunction, further fibroblast activation, and structural defect. Further studies will be needed to study the role of Manba in kidney tubule cells.

As GWAS explain little of the heritability of CKD, another possible mediator could be epigenetics. Little was known about the epigenome of normal and diseased human kidneys when we performed our cytosine methylation experiments and analysis in 2012. We were the first researchers to describe genome-wide cytosine methylation differences in tubule cells obtained from patients with CKD.

We found that while DMRs associated with CDK do not have the scale of difference observed in cancer studies, the differences are very stable and consistent in patient samples across different sample cohorts and different assay platforms. To our surprise, we found that

CKD DMRs are depleted from promoter regions and enriched in enhancer regions. These enhancer regions are enriched with consensus-binding motifs for key renal transcription factors

(HNF, TCFAP, SIX2). We observed consistent methylation and gene expression correlation in different cohorts as well.

We illustrated our findings in a model hypothesizing that enhancer DMRs modify TF binding and thereby modify downstream transcript levels. Based on our results, we propose that cytosine methylation changes are causally linked to transcript levels and phenotype development.

4.2. Future direction

Our present study mostly focused on functional annotation of CKD GWAS variants, but

86

kidney-specific eQTL maps would be useful to the understanding of other phenotypes that are associated with kidney, including metabolite levels and hypertension. Our work illustrates the critical role of the kidney in xenobiotic metabolism as one of the key pathways identified by the eQTL analysis. The genetic regulation of drug metabolism-associated genes is likely important to understand drug effectiveness, toxicity and metabolism. Further experiments are needed to validate these connections.

One remaining question for eQTL study is how to identify the causal variant among a cluster of variants that all present similar significance association with the target gene. To perform the experimental validation and to understand how many variants at each locus function together, a high-throughput method including massively parallel reporter assay (MPRA) [342], or a modified version that shows improved throughput and sensitivity [343], can perform systematic examination. For in vivo analysis, expression of the identified regulatory regions in zebrafish can elucidate the cell-type specificity of the regions. This can be performed by attaching the regulatory regions to the green fluorescence protein construct, and then using the transposon system to integrate the sequence to the zebrafish genome.

There remain some limitations on our work, importantly sample size. While 100 samples obtained from subjects with a single genetic background can identify genotype-driven gene- expression changes, they have a limited power to identify smaller gene-expression changes.

Furthermore, the genotyping and RNA-seq data were obtained from the TCGA database with limited clinical characterization of kidney function and structural information. Greater sample size as well as clear clinical parameters will increase the power to detect kidney-specific gene expression changes associated with variants.

For the cytosine methylation study, we also had the limitation that our sample size was small and the samples were collected in a single center. In addition, our analysis was performed using array-based method, but in order to refine the DMR location, we can apply bisulfite sequencing to the samples, which will increase the resolution (location and methylation level).

87

With current genotyping results available for some of the human kidney samples, it is also possible to perform methylation QTL. With the advance of experimental tools, site-specific manipulation will allow us to modify DNA methylation and differentiate the functional consequences using CRISPR/Cas9 system [344].

For both studies, we encountered issues arising from the cell type heterogeneity.

Although the majority of the cell types in the cortex are proximal, distal and convoluted tubules, there are other cell types in the cortex, including mesangial cells, podocytes, and endothelial cells. Ideally, we would have to perform microdissection and separated the tubule and glomeruli compartments, but it was not feasible using the publicly available dataset that we used to perform eQTL analysis. Future studies using single-cell resolution transcriptome analysis in the kidney will be essential to define the trancriptome signature of different cell types and the expression of cell- type-specific genes.

In summary, an integrative analysis performed using clinical parameters, genotype, gene expression and methylation datasets may advance the identification of causal genomic loci that regulate gene expression and cell phenotypes involved in the development of CKD.

88

APPENDIX

Appendix 1 | List of CKD associated SNPs. List of SNPs curated manually from all kidney- function-associated GWAS. Gene SNP Chr Pos (b37) Freq Effect Size 95%CI/S p- Referenc Associatio Symbols E value e n parameter SYPL2 rs1213606 1 110014170 0.3 -0.0045 0.0008 4.71E- Pattaro eGFRcrea 3 08 2015 SDCCAG8 rs2802729 1 243501763 0.43 -0.0046 0.0008 2.20E- Pattaro eGFRcrea 08 2015 SARS, CELSR2, rs1933182 1 109999588 0.3 -0.008 0.001 1.30E- Köttgen eGFRcrea PSRC1, 08 2010 MYBPHL, SORT1, PSMA5, SYPL2, ATXN7L2, CYB561D1, AMIGO1, GPR61, GNAI3, AMPD2, GSTM2, GSTM4, GSTM1, GNAT2 VAV3 rs1701960 1 108188858 0.19 1.17 1.11-1.24 6.80E- Kiryluk IgA-NP 2 09 2014 CFHR1,3 rs6677604 1 196686918 0.073 0.68 2.96E- Gharavi IgA-NP 10 2011 CACNA1S rs3850625 1 201016296 0.12 0.0083 0.0013 6.82E- Pattaro eGFRcrea 11 2015 MTX1-GBA rs2049805 1 153461604 0.17 0.0072 0.001 1.80E- Okada BUN 12 2012 ANXA9/LASS2 rs267734 1 150951477 0.2 0.01 0.002 1.20E- Köttgen eGFRcrea 12 2010 DNAJC16 rs1212407 1 15869899 0.3 -0.0097 0.0011 1.50E- Pattaro eGFRcrea 8 17 2012 SOX11 rs1686417 2 5907880 0.05 -0.011 0.003 4.50E- Köttgen CKD 0 08 2010 DDX1 rs6431731 2 15863002 0.06 0.0127 0.0023 4.30E- Pattaro eGFRcrea 08 2012 CDCA7 rs4972593 2 174462854 0.14 1.81 1.47-2.24 3.85E- Sandholm T1DK- 08 2013 FemaleOnly LRP2 rs4667594 2 170008506 0.47 0.0044 0.0008 3.52E- Pattaro eGFRcrea 08 2015 AFF3, TSC22D4, rs7583877 2 100460654 0.289 1.29 1.18-1.40 1.20E- Sandholm T1-ESRD NYAP1 08 2012 CPS1 rs7422339 2 211540507 0.32 1.11 1.07-1.15 7.54E- Pattaro CKD 09 2015 RND3 rs7560163 2 151637936 0.14 0.75 0.68–0.87 7.00E- Palmer T2-ESRD 09 2012 PAX8 rs1112317 2 113695411 0.35 0.0051 0.0008 3.30E- Okada BUN 0 10 2012 IGFBP5 rs2712184 2 217682779 0.42 0.0053 0.0008 1.33E- Pattaro eGFRcrea 10 2015 ALMS1/NAT8 rs13538 2 73868328 0.23 0.009 0.002 4.50E- Köttgen eGFRcrea 14 2010 GCKR rs1260326 2 27730940 0.41 0.009 0.001 3.00E- Köttgen eGFRcrea 14 2010 PLA2R1 rs4664308 2 160917497 0.43 0.43859649 0.38-0.51 8.60E- Stanescu IMN 1 29 2011 WNT7A rs6795744 3 13906850 0.15 0.006 0.0011 3.33E- Pattaro eGFRcrea 08 2015 SKIL rs9682041 3 170091902 0.13 0.0068 0.0012 2.58E- Pattaro eGFRcrea 08 2015 ETV5 rs1051380 3 185822353 0.13 -0.0072 0.0012 1.03E- Pattaro eGFRcrea 1 09 2015 ST6GAL1 rs7634389 3 186738421 0.44 1.13 1.09-1.18 7.27E- Li 2015 IgA-NP 10

89

TFDP2 rs347685 3 141807137 0.28 0.009 0.001 3.00E- Köttgen eGFRcrea 11 2010 MECOM rs1685372 3 170633326 0.29 0.0064 0.0008 2.70E- Okada BUN 2 14 2012 LRIG1-KBTBD8 rs1306900 3 66881640 0.17 0.0093 0.001 1.40E- Okada BUN 0 19 2012 BCL6-LPP rs1093732 3 189196412 0.69 0.01 0.0009 8.80E- Okada BUN 9 30 2012 FAM47E, rs1314635 4 77412140 0.21 -0.0117 0.0013 6.60E- Okada201 eGFRcrea STBD1, 5 11 2 CCDC158, SHROOM3 NFKB1 rs228611 4 103561709 0.47 -0.0056 0.0008 3.58E- Pattaro eGFRcrea 12 2015 SHROOM3 rs1731972 4 77368847 0.44 -0.012 0.002 1.20E- Köttgen eGFRcrea 1 12 2009 RGS14 rs1265481 5 176726797 0.36 -0.0056 0.001 2.90E- Okada eGFRcrea 2 08 2012 DAB2 rs1195992 5 39397132 0.44 -0.009 0.001 1.40E- Köttgen eGFRcrea 8 17 2010 ZNF204 rs7759001 6 27341409 0.24 0.0051 0.0009 1.75E- Pattaro eGFRcrea 08 2015 HLA-DPB2 rs1883414 6 33086448 0.24 0.78 4.84E- Gharavi IgA-NP 09 2011 MHC region rs3828890 6 31440669 0.11 -0.0079 0.0013 1.20E- Okada eGFRcrea 09 2012 HLA region rs3115573 6 32218843 0.45 1.62 1.39-1.90 1.00E- Feehally IgA-NP 09 2010 HLA-A rs2523946 6 29941943 0.5 0.82644628 0.78-0.87 1.74E- Yu 2011 IgA-NP 1 11 RSPO3 rs1936800 6 127477757 0.5 0.0052 0.0008 1.20E- Okada BUN 11 2012 SLC22A2 rs2279463 6 160668389 0.12 -0.013 0.002 5.50E- Köttgen eGFRcrea 12 2010 TAP1/2/PSMB8/ rs9357155 6 32809848 0.2 0.71 2.11E- Gharavi IgA-NP 9 12 2011 HLA-DQA/B rs1794275 6 32671248 0.15 1.3 1.21-1.39 3.43E- Yu 2011 IgA-NP 13 VEGFA rs881858 6 43806609 0.28 0.011 0.002 9.40E- Köttgen eGFRcrea 14 2010 HLA-DQB1 rs9275424 6 32702799 NA NA NA 4.00E- Kiryluk IgA-NP 14 2012 HLA-DQA1 rs1129740 6 32609105 0.49 2.11 2.14E- Gbadegesi SSNS 19 n 2015 HLA-DRB1 rs660895 6 32577380 0.24 1.34 1.26-1.42 4.13E- Yu 2011 IgA-NP 20 HLA-DQB1 rs9275596 6 32681631 0.22 0.63 1.59E- Gharavi IgA-NP 26 2011 HLA-DR–HLA- rs7763262 6 32424882 0.31 0.70921985 0.67-0.75 1.80E- Kiryluk IgA-NP DQ 8 38 2014 HLA-DQA1 rs2187668 6 32605884 0.13 4.32 3.73-5.01 8.00E- Stanescu IMN 93 2011 KBTBD2 rs3750082 7 32919927 0.33 0.0045 0.0008 3.22E- Pattaro eGFRcrea 08 2015 UNCX rs1027504 7 1240371 0.34 0.0069 0.0012 4.30E- Okada BUN 4 09 2012 TMEM60 rs6465825 7 77416439 0.39 -0.008 0.001 1.50E- Köttgen eGFRcrea 09 2010 RNF32 rs6459680 7 156,258,56 0.26 0.0055 0.0009 1.07E- Pattaro eGFRcrea 8 09 2015 UNCX rs1027711 7 1285195 0.35 -0.0058 0.0009 1.00E- Okada eGFRcrea 5 10 2012 PRKAG2 rs7805747 7 151.407.80 0.24 1.19 1.13-1.25 4.20E- Köttgen CKD 1 12 2010 STC1 rs1010941 8 23751151 0.42 -0.008 0.001 1.00E- Köttgen eGFRcrea 4 08 2010 KLF10/ODF1 rs2033562 8 103547739 0.48 0.88495575 0.85-0.92 1.41E- Li 2015 IgA-NP 2 09 DEFA rs1008656 8 6900336 0.33 1.16 1.11-1.22 1.00E- Kiryluk IgA-NP 8 09 2014 DEFA rs9314614 8 6697731 0.36 1.14 1.09-1.18 9.48E- Li 2015 IgA-NP 10 DEFA rs2738048 8 6822785 0.32 0.79 0.74-0.84 3.80E- Yu 2011 IgA-NP 14 90

DEFA rs1271664 8 6898998 0.257 0.81300813 0.77-0.85 1.13E- Li 2015 IgA-NP 1 18 CARD9 rs4077515 9 139266496 0.4 1.16 1.11-1.22 1.20E- Kiryluk IgA-NP 09 2014 PIP5K1B rs4744712 9 71434707 0.39 -0.008 0.001 8.30E- Köttgen eGFRcrea 14 2010 WDR37 rs1079472 10 1156165 0.08 -0.014 0.002 1.20E- Köttgen eGFRcrea 0 08 2010 CUBN rs1801239 10 16919052 0.098 0.0835 1.10E- Böger CUBN 5 11 2011 A1CF rs1099486 10 52645424 0.19 0.0077 0.0011 1.07E- Pattaro eGFRcrea 0 12 2015 ACCS rs2074038 11 44087989 0.29 1.14 1.09-1.19 3.93E- Li 2015 IgA-NP 09 KCNQ1 rs163160 11 2789955 0.18 -0.0065 0.0011 2.26E- Pattaro eGFRcrea 09 2015 AP5B1 rs4014195 11 65506822 0.36 -0.0055 0.0008 1.10E- Pattaro eGFRcrea 11 2015 MPPED2-DCDC5 rs1076787 11 30725254 0.69 0.0068 0.0008 4.50E- Okada BUN 3 16 2012 MPPED2 rs3925584 11 30760335 0.46 0.0075 0.0009 8.40E- Pattaro eGFRcrea 18 2012 INHBC rs1106766 12 57809456 0.22 0.0061 0.001 2.41E- Pattaro eGFRcrea 09 2015 C12orf51 rs2074356 12 111129784 0.23 0.0064 0.0011 1.80E- Okada BUN 09 2012 SLC6A13 rs1077402 12 349298 0.36 0.008 0.001 1.40E- Köttgen eGFRcrea 1 09 2010 CUX2, rs653178 12 112007756 0.5 0.003 0.001 3.50E- Köttgen eGFRcys FAM109A, 11 2010 SH2B3, ATXN2, BRAP, ACAD10, ALDH2 PTPRO rs7956634 12 15321194 0.19 0.0068 0.001 7.17E- Pattaro eGFRcrea 12 2015 TSPAN9 rs1049196 12 3368093 0.1 -0.0095 0.0013 5.18E- Pattaro eGFRcrea 7 14 2015 DACH1 rs626277 13 72347696 0.4 0.009 0.001 2.50E- Köttgen eGFRcrea 11 2010 INO80 rs2928148 15 41401550 0.48 -0.0051 0.0009 4.00E- Pattaro eGFRcrea 08 2012 no gene in <250 rs1243785 15 94141833 0.038 1.8 1.48-2.17 2.00E- Sandholm T1-ESRD kb distance 4 09 2012 WDR72 rs491567 15 53946593 0.22 0.009 0.002 2.70E- Köttgen eGFRcrea 13 2010 UBE2Q2 rs1394125 15 76158983 0.35 -0.009 0.001 3.30E- Köttgen eGFRcrea 17 2010 GATM/SPATA5L rs2467853 15 45698793 0.62 0.0126 0.0009 1.05E- Pattaro CKD 1 42 2015 DPEP1 rs164748 16 89708292 0.47 -0.0046 0.0008 1.95E- Pattaro eGFRcrea 08 2015 PDILT rs1186490 16 20400839 0.81 -0.0062 0.0011 3.60E- Okada eGFRcrea 9 10 2012 ITGAM-ITGAX rs1115061 16 31357760 0.36 1.18 1.12-1.24 1.30E- Kiryluk IgA-NP 2 11 2014 UMOD rs1291770 16 20367690 0.18 0.8 0.75-0.85 2.30E- Köttgen CKD 7 12 2009 ITGAM-ITGAX rs1157463 16 31368874 0.18 0.75757575 0.70-0.82 8.10E- Kiryluk IgA-NP 7 8 13 2014 SLC47A1 rs2453580 17 19438321 0.41 -0.0059 0.001 2.10E- Pattaro eGFRcrea 09 2012 BCAS3 rs1186844 17 56594003 0.75 0.0059 0.001 2.10E- Okada BUN 1 09 2012 MPDU1 rs4227 17 7491177 0.24 1.23 1.16-1.32 4.31E- Yu 2011 IgA-NP 10 TNFSF13 rs3803800 17 7462969 0.35 1.21 1.14-1.28 9.40E- Yu 2011 IgA-NP 11 CDK12 rs1107890 17 37631924 0.24 0.0096 0.0013 9.00E- Pattaro eGFRcrea 3 13 2012 BCAS3 rs9895661 17 59456589 0.19 -0.011 0.002 1.10E- Köttgen eGFRcrea 15 2010 NFATC1 rs8091180 18 77164243 0.44 0.006 0.001 1.28E- Pattaro eGFRcrea 09 2015 91

SLC14A2 rs7227483 18 41441128 0.8 0.0083 0.001 6.70E- Okada BUN 18 2012 SIPA1L3 rs1166649 19 38464262 0.18 -0.0058 0.0011 4.25E- Pattaro eGFRcrea 7 08 2015 SLC7A9 rs1246087 19 33356891 0.39 0.008 0.001 3.20E- Köttgen eGFRcrea 6 15 2010 MKKS, SLX4IP, rs6040055 20 10633313 0.39 -0.017 0 1.00E- Köttgen eGFRcrea JAG1 08 2009 GNAS rs6026584 20 56902468 0.32 0.0052 0.0009 8.80E- Okada BUN 09 2012 TP53INP2 rs6088580 20 33285053 0.47 -0.0049 0.0008 1.79E- Pattaro eGFRcrea 09 2015 BCAS1 rs1721670 20 52732362 0.21 0.0077 0.001 8.83E- Pattaro eGFRcrea 7 15 2015 NAPB, CSTL1, rs1303830 20 23610262 0.21 0.076 0.004 2.20E- Köttgen eGFRcys CST11, CST8, 5 88 2009 CST9L, CST9, CST3, CST4, CST1, CST2, CST5 APOL1 rs4821475 22 36669095 0.79 NA NA 3.51E- Genovese FSGS 08 2010 MYH9 rs2157256 22 36311616 0.91 NA NA 1.36E- Genovese FSGS 09 2010 APOL2 rs4821469 22 36220399 0.8 NA NA 9.31E- Genovese FSGS 10 2010 MYH9 rs1699667 22 36329925 0.56 NA NA 2.56E- Genovese FSGS 2 10 2010 MTMR3 rs12537 22 30423460 0.16 0.78 0.72-0.84 1.17E- Yu 2011 IgA-NP 11 APOL1 rs2239785 22 36265284 0.94 NA NA 4.59E- Genovese FSGS 13 2010 This table is restricted to single nucleotipe polymorphisms (SNPs) that reached genome-wide significance (p<5E-08). Associated loci are reported from the study in which they were identified with lowest p-value. The same locus is only reported from a subsequent study if the index SNP in the later study is in low LD with the previously reported index SNP (r2<0.2 in the corresponding 1000 Genomes population). If SNPs within an associated locus showed independent evidence for association by conditional analyses, all independent index SNPs within the locus are reported. Gene Symbols, Official gene symbol; SNP, SNP ID; Chr, chromosome number; Pos (b37), chromosome position; Freq, for case-control studies, the minor allele frequency in controls is presented. If available, the combined frequency (discovery+replication) is used; otherwise, the frequency in the discovery group is reported; Effect Size,the effect estimate is reported with respect to the minor allele; Combined effect estimates (beta/OR, SE/95% CI, p-value) from discovery + replication were used when available. Otherwise, effect estimates for the discovery sample are shown; 95%CI, 95% confidence interval; p-value, raw p-value; Reference, GWAS Publication where it is found, all papers are cited in Chapter 2; Association parameter, disease that was reported to have a GWAS hit for the disease.

Please see supplementary digital files for appendix 2-6.

Appendix 2 | All significant eGenes

Appendix 3 | Differentially methylated loci and the corresponding differentially expressed transcripts

Appendix 4 | Infinium DKD dataset and HELP CKD dataset differentially methylated loci

Appendix 5 | Technical validation of methylation level using MassArray Epityper

Appendix 6 | RefSeq annotated genomic distribution of the hypo- and hypermethylated regions

Appendix 7 | List of differentially methylated regions with p-value methylation level, genomic locus and nearest annotated transcript

92

BIBLIOGRAPHY

1. Aiden, A.P. et al. (2010) Wilms tumor chromatin profiles highlight stem cell properties and a renal developmental network. Cell Stem Cell 6 (6), 591-602. 2. US-Renal-Data-System (2015) 2015 annual data report: atlas of chronic kidney disease and end-stage renal disease in the United States. Bethesda: National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases. 3. Hill, N.R. et al. (2016) Global Prevalence of Chronic Kidney Disease - A Systematic Review and Meta-Analysis. PLoS One 11 (7), e0158765. 4. Go, A.S. et al. (2004) Chronic kidney disease and the risks of death, cardiovascular events, and hospitalization. N Engl J Med 351 (13), 1296-305. 5. Stein, J.H. and Fadem, S.Z. (1978) The renal circulation. JAMA 239 (13), 1308-12. 6. Eckardt, K.U. et al. (2013) Evolving importance of kidney disease: from subspecialty to global health burden. Lancet 382 (9887), 158-69. 7. Hildebrandt, F. (2010) Genetic kidney diseases. Lancet 375 (9722), 1287-95. 8. Fox, C.S. et al. (2004) Genomewide linkage analysis to serum creatinine, GFR, and creatinine clearance in a community-based population: the Framingham Heart Study. J Am Soc Nephrol 15 (9), 2457-61. 9. Bochud, M. et al. (2005) Heritability of renal function in hypertensive families of African descent in the Seychelles (Indian Ocean). Kidney Int 67 (1), 61-9. 10. Langefeld, C.D. et al. (2004) Heritability of GFR and albuminuria in Caucasians with type 2 diabetes mellitus. Am J Kidney Dis 43 (5), 796-800. 11. Lei, H.H. et al. (1998) Familial aggregation of renal disease in a population-based case- control study. J Am Soc Nephrol 9 (7), 1270-6. 12. Seaquist, E.R. et al. (1989) Familial clustering of diabetic kidney disease. Evidence for genetic susceptibility to diabetic nephropathy. N Engl J Med 320 (18), 1161-5. 13. Hunt, S.C. et al. (2004) Linkage of serum creatinine and glomerular filtration rate to chromosome 2 in Utah pedigrees. Am J Hypertens 17 (6), 511-5. 14. Spray, B.J. et al. (1995) Familial risk, age at onset, and cause of end-stage renal disease in white Americans. J Am Soc Nephrol 5 (10), 1806-10. 15. Freedman, B.I. et al. (1993) The familial risk of end-stage renal disease in African Americans. Am J Kidney Dis 21 (4), 387-93. 16. Reeders, S.T. et al. (1985) A highly polymorphic DNA marker linked to adult polycystic kidney disease on . Nature 317 (6037), 542-4.

93

17. Atkin, C.L. et al. (1988) Mapping of Alport syndrome to the long arm of the X chromosome. Am J Hum Genet 42 (2), 249-55. 18. Imperatore, G. et al. (1998) Sib-pair linkage analysis for susceptibility genes for microvascular complications among Pima Indians with type 2 diabetes. Pima Diabetes Genes Group. Diabetes 47 (5), 821-30. 19. Paterson, A.D. et al. (2007) Genome-wide linkage scan of a large family with IgA nephropathy localizes a novel susceptibility locus to chromosome 2q36. J Am Soc Nephrol 18 (8), 2408-15. 20. Gharavi, A.G. et al. (2000) IgA nephropathy, the most common cause of glomerulonephritis, is linked to 6q22-23. Nat Genet 26 (3), 354-7. 21. Fox, C.S. et al. (2005) Genome-wide linkage analysis to urinary microalbuminuria in a community-based sample: the Framingham Heart Study. Kidney Int 67 (1), 70-4. 22. Krolewski, A.S. et al. (2006) A genome-wide linkage scan for genes controlling variation in urinary albumin excretion in type II diabetes. Kidney Int 69 (1), 129-36. 23. Freedman, B.I. et al. (2003) A genome-wide scan for urinary albumin excretion in hypertensive families. Hypertension 42 (3), 291-6. 24. Collins, F.S. et al. (1997) Variations on a theme: cataloging human DNA sequence variation. Science 278 (5343), 1580-1. 25. Risch, N. and Merikangas, K. (1996) The future of genetic studies of complex human diseases. Science 273 (5281), 1516-7. 26. Lander, E.S. (1996) The new genomics: global views of biology. Science 274 (5287), 536-9. 27. Reich, D.E. and Lander, E.S. (2001) On the allelic spectrum of human disease. Trends Genet 17 (9), 502-10. 28. Venter, J.C. et al. (2001) The sequence of the human genome. Science 291 (5507), 1304-51. 29. Lander, E.S. et al. (2001) Initial sequencing and analysis of the human genome. Nature 409 (6822), 860-921. 30. Sherry, S.T. et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29 (1), 308-11. 31. International HapMap, C. et al. (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467 (7311), 52-8. 32. International HapMap, C. (2003) The International HapMap Project. Nature 426 (6968), 789- 96. 33. Spencer, C.C. et al. (2009) Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 5 (5), e1000477. 34. Hindorff, L.A. et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106 (23), 9362-7. 94

35. Kottgen, A. et al. (2009) Multiple loci associated with indices of renal function and chronic kidney disease. Nat Genet 41 (6), 712-7. 36. Kottgen, A. et al. (2010) New loci associated with kidney function and chronic kidney disease. Nat Genet 42 (5), 376-84. 37. O'Seaghdha, C.M. and Fox, C.S. (2012) Genome-wide association studies of chronic kidney disease: what have we learned? Nat Rev Nephrol 8 (2), 89-99. 38. Pattaro, C. et al. (2016) Genetic associations at 53 loci highlight cell types and biological pathways relevant for kidney function. Nat Commun 7, 10023. 39. Wright, F.A. et al. (2014) Heritability and genomics of gene expression in peripheral blood. Nat Genet 46 (5), 430-7. 40. Dixon, A.L. et al. (2007) A genome-wide association study of global gene expression. Nat Genet 39 (10), 1202-7. 41. Stranger, B.E. et al. (2007) Population genomics of human gene expression. Nat Genet 39 (10), 1217-24. 42. Grundberg, E. et al. (2012) Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat Genet 44 (10), 1084-9. 43. Petronis, A. (2010) Epigenetics as a unifying principle in the aetiology of complex traits and diseases. Nature 465 (7299), 721-7. 44. Mendel, G.J. (1866) Experiments Concerning Plant Hybrids (Versuche über Pflanzen- Hybriden). Proceedings of the Natural History Society of Brünn (Verhandlungen des naturforschenden Vereines in Brünn) IV, 3–47. 45. Morgan, T.H. (1911) Random Segregation Versus Coupling in Mendelian Inheritance. Science 34 (873), 384. 46. Sturtevant, A.H. (1913) A Third Group of Linked Genes in Drosophila Ampelophila. Science 37 (965), 990-2. 47. Robbins, R.B. (1918) Some Applications of Mathematics to Breeding Problems III. Genetics 3 (4), 375-89. 48. Mohr, J. (1951) A search for linkage between the Lutheran blood group and other hereditary characters. Acta Pathol Microbiol Scand 28 (2), 207-10. 49. Botstein, D. et al. (1980) Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32 (3), 314-31. 50. Donis-Keller, H. et al. (1987) A genetic linkage map of the human genome. Cell 51 (2), 319- 37. 51. Gusella, J.F. et al. (1983) A polymorphic DNA marker genetically linked to Huntington's disease. Nature 306 (5940), 234-8. 95

52. Scambler, P.J. et al. (1985) Linkage of COL1A2 collagen gene to cystic fibrosis, and its clinical implications. Lancet 2 (8466), 1241-2. 53. Wainwright, B.J. et al. (1985) Localization of cystic fibrosis locus to human chromosome 7cen-q22. Nature 318 (6044), 384-5. 54. Tsui, L.C. et al. (1985) Cystic fibrosis locus defined by a genetically linked polymorphic DNA marker. Science 230 (4729), 1054-7. 55. Lee, E.Y. and Lee, W.H. (1986) Molecular cloning of the human esterase D gene, a genetic marker of retinoblastoma. Proc Natl Acad Sci U S A 83 (17), 6337-41. 56. Friend, S.H. et al. (1987) Deletions of a DNA sequence in retinoblastomas and mesenchymal tumors: organization of the sequence and its encoded protein. Proc Natl Acad Sci U S A 84 (24), 9059-63. 57. 1000-Genomes-Project (2016). 58. Davey, J.W. et al. (2011) Genome-wide genetic marker discovery and genotyping using next- generation sequencing. Nat Rev Genet 12 (7), 499-510. 59. International HapMap, C. (2005) A haplotype map of the human genome. Nature 437 (7063), 1299-320. 60. Servin, B. and Stephens, M. (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 3 (7), e114. 61. McCarthy, S. et al. (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet 48 (10), 1279-83. 62. Howie, B.N. et al. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5 (6), e1000529. 63. Hinds, D.A. et al. (2005) Whole-genome patterns of common DNA variation in three human populations. Science 307 (5712), 1072-9. 64. de Bakker, P.I. et al. (2005) Efficiency and power in genetic association studies. Nat Genet 37 (11), 1217-23. 65. Browning, B.L. and Browning, S.R. (2009) A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 84 (2), 210-23. 66. Pe'er, I. et al. (2008) Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet Epidemiol 32 (4), 381-5. 67. Hoggart, C.J. et al. (2008) Genome-wide significance for dense SNP and resequencing data. Genet Epidemiol 32 (2), 179-85. 68. Inker, L.A. et al. (2014) KDOQI US commentary on the 2012 KDIGO clinical practice guideline for the evaluation and management of CKD. Am J Kidney Dis 63 (5), 713-35. 96

69. Pattaro, C. et al. (2016) Genetic associations at 53 loci highlight cell types and biological pathways relevant for kidney function. Nat Commun 7, 10023. 70. Teumer, A. et al. (2016) Genome-wide Association Studies Identify Genetic Loci Associated With Albuminuria in Diabetes. Diabetes 65 (3), 803-17. 71. Boger, C.A. et al. (2011) CUBN is a gene locus for albuminuria. J Am Soc Nephrol 22 (3), 555-70. 72. Ellis, J.W. et al. (2012) Validated SNPs for eGFR and their associations with albuminuria. Hum Mol Genet 21 (14), 3293-8. 73. Foster, M.C. et al. (2007) Cross-classification of microalbuminuria and reduced glomerular filtration rate: associations between cardiovascular disease risk factors and clinical outcomes. Arch Intern Med 167 (13), 1386-92. 74. Pezzolesi, M.G. et al. (2009) Genome-wide association scan for diabetic nephropathy susceptibility genes in type 1 diabetes. Diabetes 58 (6), 1403-10. 75. Schelling, J.R. et al. (2008) Genome-wide scan for estimated glomerular filtration rate in multi- ethnic diabetic populations: the Family Investigation of Nephropathy and Diabetes (FIND). Diabetes 57 (1), 235-43. 76. Iyengar, S.K. et al. (2015) Genome-Wide Association and Trans-ethnic Meta-Analysis for Advanced Diabetic Kidney Disease: Family Investigation of Nephropathy and Diabetes (FIND). PLoS Genet 11 (8), e1005352. 77. Mooyaart, A.L. et al. (2011) Genetic associations in diabetic nephropathy: a meta-analysis. Diabetologia 54 (3), 544-53. 78. Palmer, N.D. et al. (2012) A genome-wide association search for type 2 diabetes genes in African Americans. PLoS One 7 (1), e29202. 79. Sandholm, N. et al. (2013) Chromosome 2q31.1 associates with ESRD in women with type 1 diabetes. J Am Soc Nephrol 24 (10), 1537-43. 80. Sandholm, N. et al. (2012) New susceptibility loci associated with kidney disease in type 1 diabetes. PLoS Genet 8 (9), e1002921. 81. Bostrom, M.A. et al. (2010) Candidate genes for non-diabetic ESRD in African Americans: a genome-wide association study using pooled DNA. Hum Genet 128 (2), 195-204. 82. Freedman, B.I. et al. (2010) The apolipoprotein L1 (APOL1) gene and nondiabetic nephropathy in African Americans. J Am Soc Nephrol 21 (9), 1422-6. 83. Genovese, G. et al. (2010) Association of trypanolytic ApoL1 variants with kidney disease in African Americans. Science 329 (5993), 841-5. 84. Tzur, S. et al. (2010) Missense mutations in the APOL1 gene are highly associated with end stage kidney disease risk previously attributed to the MYH9 gene. Hum Genet 128 (3), 345-50. 97

85. Wyatt, R.J. and Julian, B.A. (2013) IgA nephropathy. N Engl J Med 368 (25), 2402-14. 86. Li, M. et al. (2015) Identification of new susceptibility loci for IgA nephropathy in Han Chinese. Nat Commun 6, 7270. 87. Kiryluk, K. et al. (2014) Discovery of new risk loci for IgA nephropathy implicates genes involved in immunity against intestinal pathogens. Nat Genet 46 (11), 1187-96. 88. Brem, R.B. et al. (2002) Genetic dissection of transcriptional regulation in budding yeast. Science 296 (5568), 752-5. 89. Schadt, E.E. et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature 422 (6929), 297-302. 90. Morley, M. et al. (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430 (7001), 743-7. 91. Goring, H.H. et al. (2007) Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nat Genet 39 (10), 1208-16. 92. GTEx-Consortium (2015) Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348 (6235), 648-60. 93. Lappalainen, T. et al. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501 (7468), 506-11. 94. Delaneau, O. et al. (2013) Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods 10 (1), 5-6. 95. Browning, B.L. and Browning, S.R. (2016) Genotype Imputation with Millions of Reference Samples. Am J Hum Genet 98 (1), 116-26. 96. Das, S. et al. (2016) Next-generation genotype imputation service and methods. Nat Genet 48 (10), 1284-7. 97. Westra, H.J. et al. (2013) Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat Genet 45 (10), 1238-43. 98. Aguet, F. et al. (2016) Local genetic effects on gene expression across 44 human tissues. bioRxiv. 99. Frayling, T.M. et al. (2007) A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316 (5826), 889-94. 100. Dina, C. et al. (2007) Variation in FTO contributes to childhood obesity and severe adult obesity. Nat Genet 39 (6), 724-6. 101. Claussnitzer, M. et al. (2015) FTO Obesity Variant Circuitry and Adipocyte Browning in Humans. N Engl J Med 373 (10), 895-907. 102. Smemo, S. et al. (2014) Obesity-associated variants within FTO form long-range functional connections with IRX3. Nature 507 (7492), 371-5. 98

103. Nicolae, D.L. et al. (2010) Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 6 (4), e1000888. 104. Schadt, E.E. et al. (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 37 (7), 710-7. 105. He, X. et al. (2013) Sherlock: detecting gene-disease associations by matching patterns of expression QTL and GWAS. Am J Hum Genet 92 (5), 667-80. 106. Giambartolomei, C. et al. (2014) Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet 10 (5), e1004383. 107. Kanzaki, G. et al. (2015) Factors associated with a vicious cycle involving a low nephron number, hypertension and chronic kidney disease. Hypertens Res 38 (10), 633-41. 108. Manalich, R. et al. (2000) Relationship between weight at birth and the number and size of renal glomeruli in humans: a histomorphometric study. Kidney Int 58 (2), 770-3. 109. Jones, S.E. et al. (2006) The effect of intrauterine environment and low glomerular number on the histological changes in diabetic glomerulosclerosis. Diabetologia 49 (1), 191-9. 110. Jones, S.E. et al. (2001) Intra-uterine environment influences glomerular number and the acute renal adaptation to experimental diabetes. Diabetologia 44 (6), 721-8. 111. Hershkovitz, D. et al. (2007) Fetal programming of adult kidney disease: cellular and molecular mechanisms. Clin J Am Soc Nephrol 2 (2), 334-42. 112. Tobi, E.W. et al. (2014) DNA methylation signatures link prenatal famine exposure to growth and metabolism. Nat Commun 5, 5592. 113. Ravelli, A.C. et al. (2005) Cardiovascular disease in survivors of the Dutch famine. Nestle Nutr Workshop Ser Pediatr Program 55, 183-91; discussion 191-5. 114. Reik, W. et al. (2001) Epigenetic reprogramming in mammalian development. Science 293 (5532), 1089-93. 115. Yudkin, J.S. et al. (1997) Proteinuria and progressive renal disease: birth weight and microalbuminuria. Nephrol Dial Transplant 12 Suppl 2, 10-3. 116. Gillis, E.E. et al. (2015) The Dahl salt-sensitive rat is a spontaneous model of superimposed preeclampsia. Am J Physiol Regul Integr Comp Physiol 309 (1), R62-70. 117. Vom Saal, F.S. et al. (2012) The estrogenic endocrine disrupting chemical bisphenol A (BPA) and obesity. Mol Cell Endocrinol 354 (1-2), 74-84. 118. Gonzalez-Rodriguez, P. et al. (2016) Alterations in expression of imprinted genes from the H19/IGF2 loci in a multigenerational model of intrauterine growth restriction (IUGR). Am J Obstet Gynecol 214 (5), 625 e1-625 e11.

99

119. Nebendahl, C. et al. (2016) Early postnatal feed restriction reduces liver connective tissue levels and affects H3K9 acetylation state of regulated genes associated with protein metabolism in low birth weight pigs. J Nutr Biochem 29, 41-55. 120. Bruce, M.A. et al. (2009) Social environmental stressors, psychological factors, and kidney disease. J Investig Med 57 (4), 583-9. 121. Carone, B.R. et al. (2010) Paternally induced transgenerational environmental reprogramming of metabolic gene expression in mammals. Cell 143 (7), 1084-96. 122. Anway, M.D. et al. (2005) Epigenetic transgenerational actions of endocrine disruptors and male fertility. Science 308 (5727), 1466-9. 123. Lalande, M. (1996) Parental imprinting and human disease. Annu Rev Genet 30, 173-95. 124. Wei, Y. et al. (2015) Environmental epigenetic inheritance through gametes and implications for human reproduction. Hum Reprod Update 21 (2), 194-208. 125. Dabelea, D. and Crume, T. (2011) Maternal environment and the transgenerational cycle of obesity and diabetes. Diabetes 60 (7), 1849-55. 126. del Rosario, M.C. et al. (2014) Potential epigenetic dysregulation of genes associated with MODY and type 2 diabetes in humans exposed to a diabetic intrauterine environment: an analysis of genome-wide DNA methylation. Metabolism 63 (5), 654-60. 127. Gut, P. and Verdin, E. (2013) The nexus of chromatin regulation and intermediary metabolism. Nature 502 (7472), 489-98. 128. Lu, C. and Thompson, C.B. (2012) Metabolic regulation of epigenetics. Cell Metab 16 (1), 9- 17. 129. Wellen, K.E. et al. (2009) ATP-citrate lyase links cellular metabolism to histone acetylation. Science 324 (5930), 1076-80. 130. Lee, J.V. et al. (2014) Akt-dependent metabolic reprogramming regulates tumor cell histone acetylation. Cell Metab 20 (2), 306-19. 131. Marmorstein, R. and Roth, S.Y. (2001) Histone acetyltransferases: function, structure, and catalysis. Curr Opin Genet Dev 11 (2), 155-61. 132. Roth, S.Y. et al. (2001) Histone acetyltransferases. Annu Rev Biochem 70, 81-120. 133. Kaelin, W.G., Jr. and McKnight, S.L. (2013) Influence of metabolism on epigenetics and disease. Cell 153 (1), 56-69. 134. Strahl, B.D. and Allis, C.D. (2000) The language of covalent histone modifications. Nature 403 (6765), 41-5. 135. Li, B. et al. (2007) The role of chromatin during transcription. Cell 128 (4), 707-19.

100

136. Baroncelli, L. et al. (2016) Experience Affects Critical Period Plasticity in the Visual Cortex through an Epigenetic Regulation of Histone Post-Translational Modifications. J Neurosci 36 (12), 3430-40. 137. Feil, R. and Fraga, M.F. (2011) Epigenetics and the environment: emerging patterns and implications. Nat Rev Genet 13 (2), 97-109. 138. Lefevre, G.M. et al. (2010) Altering a histone H3K4 methylation pathway in glomerular podocytes promotes a chronic disease phenotype. PLoS Genet 6 (10), e1001142. 139. Hannum, G. et al. (2013) Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol Cell 49 (2), 359-67. 140. Gronniger, E. et al. (2010) Aging and chronic sun exposure cause distinct epigenetic changes in human skin. PLoS Genet 6 (5), e1000971. 141. Garagnani, P. et al. (2012) Methylation of ELOVL2 gene as a new epigenetic marker of age. Aging Cell 11 (6), 1132-4. 142. Koch, C.M. and Wagner, W. (2013) Epigenetic biomarker to determine replicative senescence of cultured cells. Methods Mol Biol 1048, 309-21. 143. Bechtel, W. et al. (2010) Methylation determines fibroblast activation and fibrogenesis in the kidney. Nat Med 16 (5), 544-50. 144. Marumo, T. et al. (2008) Epigenetic regulation of BMP7 in the regenerative response to ischemia. J Am Soc Nephrol 19 (7), 1311-20. 145. Li, H.F. et al. (2010) ATF3-mediated epigenetic regulation protects against acute kidney injury. J Am Soc Nephrol 21 (6), 1003-13. 146. Pang, M. and Zhuang, S. (2010) Histone deacetylase: a potential therapeutic target for fibrotic disorders. J Pharmacol Exp Ther 335 (2), 266-72. 147. Zager, R.A. et al. (2011) Acute unilateral ischemic renal injury induces progressive renal inflammation, lipid accumulation, histone modification, and "end-stage" kidney disease. Am J Physiol Renal Physiol 301 (6), F1334-45. 148. Havasi, A. et al. (2013) Histone acetyl transferase (HAT) HBO1 and JADE1 in epithelial cell regeneration. Am J Pathol 182 (1), 152-62. 149. Marumo, T. et al. (2010) Histone deacetylase modulates the proinflammatory and -fibrotic changes in tubulointerstitial injury. Am J Physiol Renal Physiol 298 (1), F133-41. 150. Hsing, C.H. et al. (2012) alpha2-Adrenoceptor agonist dexmedetomidine protects septic acute kidney injury through increasing BMP-7 and inhibiting HDAC2 and HDAC5. Am J Physiol Renal Physiol 303 (10), F1443-53.

101

151. Yoshikawa, M. et al. (2007) Inhibition of histone deacetylase activity suppresses epithelial- to-mesenchymal transition induced by TGF-beta1 in human renal epithelial cells. J Am Soc Nephrol 18 (1), 58-65. 152. Imai, N. et al. (2007) Inhibition of histone deacetylase activates side population cells in kidney and partially reverses chronic renal injury. Stem Cells 25 (10), 2469-75. 153. Pippin, J.W. et al. (2009) Inducible rodent models of acquired podocyte diseases. Am J Physiol Renal Physiol 296 (2), F213-29. 154. Pang, M. et al. (2009) Inhibition of histone deacetylase activity attenuates renal fibroblast activation and interstitial fibrosis in obstructive nephropathy. Am J Physiol Renal Physiol 297 (4), F996-F1005. 155. Advani, A. et al. (2011) Long-term administration of the histone deacetylase inhibitor vorinostat attenuates renal injury in experimental diabetes through an endothelial nitric oxide synthase-dependent mechanism. Am J Pathol 178 (5), 2205-14. 156. Gilbert, R.E. et al. (2011) Histone deacetylase inhibition attenuates diabetes-associated kidney growth: potential role for epigenetic modification of the epidermal growth factor . Kidney Int 79 (12), 1312-21. 157. Reddy, M.A. et al. (2015) Epigenetic mechanisms in diabetic complications and metabolic memory. Diabetologia 58 (3), 443-55. 158. Chen, Z. et al. (2016) Epigenomic profiling reveals an association between persistence of DNA methylation and metabolic memory in the DCCT/EDIC type 1 diabetes cohort. Proc Natl Acad Sci U S A 113 (21), E3002-11. 159. Pirola, L. et al. (2011) Genome-wide analysis distinguishes hyperglycemia regulated epigenetic signatures of primary vascular cells. Genome Res 21 (10), 1601-15. 160. Cooper, M.E. and El-Osta, A. (2010) Epigenetics: mechanisms and implications for diabetic complications. Circ Res 107 (12), 1403-13. 161. Villeneuve, L.M. et al. (2008) Epigenetic histone H3 lysine 9 methylation in metabolic memory and inflammatory phenotype of vascular smooth muscle cells in diabetes. Proc Natl Acad Sci U S A 105 (26), 9047-52. 162. Sun, G. et al. (2010) Epigenetic histone methylation modulates fibrotic gene expression. J Am Soc Nephrol 21 (12), 2069-80. 163. Yuan, H. et al. (2013) Involvement of p300/CBP and epigenetic histone acetylation in TGF- beta1-mediated gene transcription in mesangial cells. Am J Physiol Renal Physiol 304 (5), F601- 13.

102

164. Sapienza, C. et al. (2011) DNA methylation profiling identifies epigenetic differences between diabetes patients with ESRD and diabetes patients without nephropathy. Epigenetics 6 (1), 20-8. 165. Ciavatta, D.J. et al. (2010) Epigenetic basis for aberrant upregulation of autoantigen genes in humans with ANCA vasculitis. J Clin Invest 120 (9), 3209-19. 166. Ko, Y.A. et al. (2013) Cytosine methylation changes in enhancer regions of core pro-fibrotic genes characterize kidney fibrosis development. Genome Biol 14 (10), R108. 167. Choi, S.W. and Friso, S. (2010) Epigenetics: A New Bridge between Nutrition and Health. Adv Nutr 1 (1), 8-16. 168. Jirtle, R.L. and Skinner, M.K. (2007) Environmental epigenomics and disease susceptibility. Nat Rev Genet 8 (4), 253-62. 169. Maston, G.A. et al. (2006) Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 7, 29-59. 170. Jenuwein, T. and Allis, C.D. (2001) Translating the histone code. Science 293 (5532), 1074- 80. 171. Su, X. et al. (2016) Metabolic control of methylation and acetylation. Curr Opin Chem Biol 30, 52-60. 172. Janke, R. et al. (2015) Metabolism and epigenetics. Annu Rev Cell Dev Biol 31, 473-96. 173. Parle-McDermott, A. and Ozaki, M. (2011) The impact of nutrition on differential methylated regions of the genome. Adv Nutr 2 (6), 463-71. 174. Maston, G.A. et al. (2012) Characterization of enhancer function from genome-wide analyses. Annu Rev Genomics Hum Genet 13, 29-57. 175. Thomas, J.O. and Kornberg, R.D. (1975) An octamer of histones in chromatin and free in solution. Proc Natl Acad Sci U S A 72 (7), 2626-30. 176. Segal, E. et al. (2006) A genomic code for nucleosome positioning. Nature 442 (7104), 772- 8. 177. Luger, K. et al. (1997) Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature 389 (6648), 251-60. 178. Hennig, W. (1999) Heterochromatin. Chromosoma 108 (1), 1-9. 179. Shilatifard, A. (2006) Chromatin modifications by methylation and ubiquitination: implications in the regulation of gene expression. Annu Rev Biochem 75, 243-69. 180. Struhl, K. (1998) Histone acetylation and transcriptional regulatory mechanisms. Genes Dev 12 (5), 599-606. 181. Dillon, S.C. et al. (2005) The SET-domain protein superfamily: protein lysine methyltransferases. Genome Biol 6 (8), 227. 103

182. Dou, Y. et al. (2006) Regulation of MLL1 H3K4 methyltransferase activity by its core components. Nat Struct Mol Biol 13 (8), 713-9. 183. Hamamoto, R. et al. (2004) SMYD3 encodes a histone methyltransferase involved in the proliferation of cancer cells. Nat Cell Biol 6 (8), 731-40. 184. Tessarz, P. and Kouzarides, T. (2014) Histone core modifications regulating nucleosome structure and dynamics. Nat Rev Mol Cell Biol 15 (11), 703-8. 185. Kouzarides, T. (2007) Chromatin modifications and their function. Cell 128 (4), 693-705. 186. Wang, J. et al. (2012) Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res 22 (9), 1798-812. 187. Ernst, J. et al. (2011) Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473 (7345), 43-9. 188. Follows, G.A. et al. (2006) Identifying gene regulatory elements by genomic microarray mapping of DNaseI hypersensitive sites. Genome Res 16 (10), 1310-9. 189. Shibata, Y. and Crawford, G.E. (2009) Mapping regulatory elements by DNaseI hypersensitivity chip (DNase-Chip). Methods Mol Biol 556, 177-90. 190. Sabo, P.J. et al. (2004) Genome-wide identification of DNaseI hypersensitive sites using active chromatin sequence libraries. Proc Natl Acad Sci U S A 101 (13), 4537-42. 191. Won, K.J. et al. (2013) Comparative annotation of functional regions in the human genome using epigenomic data. Nucleic Acids Res 41 (8), 4423-32. 192. Mikkelsen, T.S. et al. (2007) Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448 (7153), 553-60. 193. Bernstein, B.E. et al. (2006) A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125 (2), 315-26. 194. Azuara, V. et al. (2006) Chromatin signatures of pluripotent cell lines. Nat Cell Biol 8 (5), 532-8. 195. Weishaupt, H. et al. (2010) Epigenetic chromatin states uniquely define the developmental plasticity of murine hematopoietic stem cells. Blood 115 (2), 247-56. 196. Howard, L. et al. (2013) Profiling of transcriptional and epigenetic changes during directed endothelial differentiation of human embryonic stem cells identifies FOXA2 as a marker of early mesoderm commitment. Stem Cell Res Ther 4 (2), 36. 197. Cacci, E. et al. (2016) Histone methylation and microRNA-dependent regulation of epigenetic activities in neural progenitor self-renewal and differentiation. Curr Top Med Chem. 198. Rea, S. et al. (2000) Regulation of chromatin structure by site-specific histone H3 methyltransferases. Nature 406 (6796), 593-9.

104

199. Sawada, K. et al. (2004) Structure of the conserved core of the yeast Dot1p, a nucleosomal histone H3 lysine 79 methyltransferase. J Biol Chem 279 (41), 43296-306. 200. Min, J. et al. (2003) Structure of the catalytic domain of human DOT1L, a non-SET domain nucleosomal histone methyltransferase. Cell 112 (5), 711-23. 201. Cheng, X. et al. (2005) Structural and sequence motifs of protein (histone) methylation enzymes. Annu Rev Biophys Biomol Struct 34, 267-94. 202. Cao, R. et al. (2002) Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science 298 (5595), 1039-43. 203. Kuo, M.H. et al. (1998) Histone acetyltransferase activity of yeast Gcn5p is required for the activation of target genes in vivo. Genes Dev 12 (5), 627-39. 204. Wang, L. et al. (1998) Critical residues for histone acetylation by Gcn5, functioning in Ada and SAGA complexes, are also required for transcriptional function in vivo. Genes Dev 12 (5), 640-53. 205. Kuo, M.H. et al. (1996) Transcription-linked acetylation by Gcn5p of histones H3 and H4 at specific lysines. Nature 383 (6597), 269-72. 206. Emiliani, S. et al. (1998) Characterization of a human RPD3 ortholog, HDAC3. Proc Natl Acad Sci U S A 95 (6), 2795-800. 207. Yang, W.M. et al. (1997) Isolation and characterization of cDNAs corresponding to an additional member of the human histone deacetylase gene family. J Biol Chem 272 (44), 28001- 7. 208. Taunton, J. et al. (1996) A mammalian histone deacetylase related to the yeast transcriptional regulator Rpd3p. Science 272 (5260), 408-11. 209. Pal, D. and Saha, S. (2012) Hydroxamic acid - A novel molecule for anticancer therapy. J Adv Pharm Technol Res 3 (2), 92-9. 210. Bird, A.P. (1986) CpG-rich islands and the function of DNA methylation. Nature 321 (6067), 209-13. 211. Bird, A. et al. (1985) A fraction of the mouse genome that is derived from islands of nonmethylated, CpG-rich DNA. Cell 40 (1), 91-9. 212. Lilischkis, R. et al. (2001) Methylation analysis of CpG islands. Methods Mol Med 57, 271- 83. 213. Ehrlich, M. et al. (1982) Amount and distribution of 5-methylcytosine in human DNA from different types of tissues of cells. Nucleic Acids Res 10 (8), 2709-21. 214. Irizarry, R.A. et al. (2009) The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat Genet 41 (2), 178-86.

105

215. Schwartz, S. et al. (2009) Chromatin organization marks exon-intron structure. Nat Struct Mol Biol 16 (9), 990-5. 216. Chodavarapu, R.K. et al. (2010) Relationship between nucleosome positioning and DNA methylation. Nature 466 (7304), 388-92. 217. Laurent, L. et al. (2010) Dynamic changes in the human methylome during differentiation. Genome Res 20 (3), 320-31. 218. Shukla, S. et al. (2011) CTCF-promoted RNA polymerase II pausing links DNA methylation to splicing. Nature 479 (7371), 74-9. 219. Ball, M.P. et al. (2009) Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells. Nat Biotechnol 27 (4), 361-8. 220. Rauch, T.A. et al. (2009) A human B cell methylome at 100-base pair resolution. Proc Natl Acad Sci U S A 106 (3), 671-8. 221. Flanagan, J.M. and Wild, L. (2007) An epigenetic role for noncoding RNAs and intragenic DNA methylation. Genome Biol 8 (6), 307. 222. Iqbal, K. et al. (2011) Reprogramming of the paternal genome upon fertilization involves genome-wide oxidation of 5-methylcytosine. Proc Natl Acad Sci U S A 108 (9), 3642-7. 223. Smith, Z.D. et al. (2012) A unique regulatory phase of DNA methylation in the early mammalian embryo. Nature 484 (7394), 339-44. 224. Mayer, W. et al. (2000) Demethylation of the zygotic paternal genome. Nature 403 (6769), 501-2. 225. Li, E. et al. (1992) Targeted mutation of the DNA methyltransferase gene results in embryonic lethality. Cell 69 (6), 915-26. 226. Okano, M. et al. (1999) DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development. Cell 99 (3), 247-57. 227. Montano, C.M. et al. (2013) Measuring cell-type specific differential methylation in human brain tissue. Genome Biol 14 (8), R94. 228. Bloushtain-Qimron, N. et al. (2008) Cell type-specific DNA methylation patterns in the human breast. Proc Natl Acad Sci U S A 105 (37), 14076-81. 229. Novak, P. et al. (2012) Cell-type specific DNA methylation patterns define human breast cellular identity. PLoS One 7 (12), e52299. 230. Takeshima, H. et al. (2006) Distinct DNA methylation activity of Dnmt3a and Dnmt3b towards naked and nucleosomal DNA. J Biochem 139 (3), 503-15. 231. Tsumura, A. et al. (2006) Maintenance of self-renewal ability of mouse embryonic stem cells in the absence of DNA methyltransferases Dnmt1, Dnmt3a and Dnmt3b. Genes Cells 11 (7), 805- 14. 106

232. Jackson, M. et al. (2004) Severe global DNA hypomethylation blocks differentiation and induces histone hyperacetylation in embryonic stem cells. Mol Cell Biol 24 (20), 8862-71. 233. Chen, T. et al. (2003) Establishment and maintenance of genomic methylation patterns in mouse embryonic stem cells by Dnmt3a and Dnmt3b. Mol Cell Biol 23 (16), 5594-605. 234. Bestor, T.H. (1992) Activation of mammalian DNA methyltransferase by cleavage of a Zn binding regulatory domain. EMBO J 11 (7), 2611-7. 235. Hansen, R.S. et al. (1999) The DNMT3B DNA methyltransferase gene is mutated in the ICF immunodeficiency syndrome. Proc Natl Acad Sci U S A 96 (25), 14412-7. 236. Ehrlich, M. et al. (2008) ICF, an immunodeficiency syndrome: DNA methyltransferase 3B involvement, chromosome anomalies, and gene dysregulation. Autoimmunity 41 (4), 253-71. 237. Yildirim, O. et al. (2011) Mbd3/NURD complex regulates expression of 5- hydroxymethylcytosine marked genes in embryonic stem cells. Cell 147 (7), 1498-510. 238. Moran-Crusio, K. et al. (2011) Tet2 loss leads to increased hematopoietic stem cell self- renewal and myeloid transformation. Cancer Cell 20 (1), 11-24. 239. Tahiliani, M. et al. (2009) Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 324 (5929), 930-5. 240. Ko, M. et al. (2010) Impaired hydroxylation of 5-methylcytosine in myeloid cancers with mutant TET2. Nature 468 (7325), 839-43. 241. Pikor, L.A. et al. (2011) DNA extraction from paraffin embedded material for genetic and epigenetic analyses. J Vis Exp (49). 242. Lin, J. et al. (2009) High-quality genomic DNA extraction from formalin-fixed and paraffin- embedded samples deparaffinized using mineral oil. Anal Biochem 395 (2), 265-7. 243. Li, Q. et al. (2014) A method to evaluate genome-wide methylation in archival formalin-fixed, paraffin-embedded ovarian epithelial cells. PLoS One 9 (8), e104481. 244. Waalwijk, C. and Flavell, R.A. (1978) MspI, an isoschizomer of hpaII which cleaves both unmethylated and methylated hpaII sites. Nucleic Acids Res 5 (9), 3231-6. 245. Toyota, M. et al. (1999) Identification of differentially methylated sequences in colorectal cancer by methylated CpG island amplification. Cancer Res 59 (10), 2307-12. 246. Jelinek, J. et al. (2012) Conserved DNA methylation patterns in healthy blood cells and extensive changes in leukemia measured by a new quantitative technique. Epigenetics 7 (12), 1368-78. 247. Hou, P. et al. (2012) The BRAF(V600E) causes widespread alterations in gene methylation in the genome of melanoma cells. Cell Cycle 11 (2), 286-95. 248. Suzuki, M. and Greally, J.M. (2010) DNA methylation profiling using HpaII tiny fragment enrichment by ligation-mediated PCR (HELP). Methods 52 (3), 218-22. 107

249. Suzuki, M. et al. (2010) Optimized design and data analysis of tag-based cytosine methylation assays. Genome Biol 11 (4), R36. 250. Thompson, R.F. et al. (2008) An analytical pipeline for genomic representations used for cytosine methylation studies. Bioinformatics 24 (9), 1161-7. 251. Zou, X. et al. (2012) Recognition of methylated DNA through methyl-CpG binding domain proteins. Nucleic Acids Res 40 (6), 2747-58. 252. Weber, M. et al. (2007) Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat Genet 39 (4), 457-66. 253. Weber, M. et al. (2005) Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat Genet 37 (8), 853-62. 254. Meissner, A. et al. (2008) Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature 454 (7205), 766-70. 255. Rollins, R.A. et al. (2006) Large-scale structure of genomic methylation patterns. Genome Res 16 (2), 157-63. 256. Eckhardt, F. et al. (2006) DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet 38 (12), 1378-85. 257. Frommer, M. et al. (1992) A genomic sequencing protocol that yields a positive display of 5- methylcytosine residues in individual DNA strands. Proc Natl Acad Sci U S A 89 (5), 1827-31. 258. Rhee, H.S. and Pugh, B.F. (2011) Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147 (6), 1408-19. 259. Rothbart, S.B. et al. (2015) An Interactive Database for the Assessment of Histone Antibody Specificity. Mol Cell 59 (3), 502-11. 260. Helsby, M.A. et al. (2014) CiteAb: a searchable antibody database that ranks antibodies by the number of times they have been cited. BMC Cell Biol 15, 6. 261. Egelhofer, T.A. et al. (2011) An assessment of histone-modification antibody quality. Nat Struct Mol Biol 18 (1), 91-3. 262. Qu, W. et al. (2016) Assessing Cell-to-Cell DNA Methylation Variability on Individual Long Reads. Sci Rep 6, 21317. 263. Kharchenko, P.V. et al. (2008) Design and analysis of ChIP-seq experiments for DNA- binding proteins. Nat Biotechnol 26 (12), 1351-9. 264. Rozowsky, J. et al. (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 27 (1), 66-75. 265. Zhang, Y. et al. (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol 9 (9), R137.

108

266. Landt, S.G. et al. (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res 22 (9), 1813-31. 267. Zang, C. et al. (2009) A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics 25 (15), 1952-8. 268. Guo, Y. et al. (2012) High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput Biol 8 (8), e1002638. 269. Ernst, J. and Kellis, M. (2010) Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol 28 (8), 817-25. 270. Hall, J.E. et al. (1977) Dissociation of renal blood flow and filtration rate autoregulation by renin depletion. Am J Physiol 232 (3), F215-21. 271. Levey, A.S. and Coresh, J. (2012) Chronic kidney disease. Lancet 379 (9811), 165-80. 272. Astor, B.C. et al. (2011) Lower estimated glomerular filtration rate and higher albuminuria are associated with mortality and end-stage renal disease. A collaborative meta-analysis of kidney disease population cohorts. Kidney Int 79 (12), 1331-40. 273. Rhee, C.M. and Kovesdy, C.P. (2015) Epidemiology: Spotlight on CKD deaths-increasing mortality worldwide. Nat Rev Nephrol 11 (4), 199-200. 274. Sud, M. et al. (2016) Progression to Stage 4 chronic kidney disease and death, acute kidney injury and hospitalization risk: a retrospective cohort study. Nephrol Dial Transplant 31 (7), 1122- 30. 275. Rossignol, P. et al. (2015) The double challenge of resistant hypertension and chronic kidney disease. Lancet 386 (10003), 1588-98. 276. Thompson, S. et al. (2015) Cause of Death in Patients with Reduced Kidney Function. J Am Soc Nephrol 26 (10), 2504-11. 277. Edwards, S.L. et al. (2013) Beyond GWASs: illuminating the dark road from association to function. Am J Hum Genet 93 (5), 779-97. 278. Lee, I. et al. (2011) Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res 21 (7), 1109-21. 279. Zheng, X. et al. (2012) A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28 (24), 3326-8. 280. Genomes Project, C. et al. (2015) A global reference for human genetic variation. Nature 526 (7571), 68-74. 281. Genomes Project, C. et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491 (7422), 56-65. 282. Delaneau, O. et al. (2012) A linear complexity phasing method for thousands of genomes. Nat Methods 9 (2), 179-81. 109

283. Shabalin, A.A. (2012) Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28 (10), 1353-8. 284. Halit Ongen, A.B., Andrew Brown, Emmanouil Dermitzakis, Olivier Delaneau (2015) Fast and efficient QTL mapper for thousands of molecular phenotypes bioRxiv. 285. Storey, J.D. (2002) A direct approach to false discovery rates. Journal of the Royal Statistical Society 64, 479-498. 286. Wang, J. et al. (2013) Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Res 41 (Database issue), D171-6. 287. Gerstein, M.B. et al. (2012) Architecture of the human regulatory network derived from ENCODE data. Nature 489 (7414), 91-100. 288. Bonner, C. et al. (2015) Inhibition of the glucose transporter SGLT2 with dapagliflozin in pancreatic alpha cells triggers glucagon secretion. Nat Med 21 (5), 512-7. 289. Craciun, F.L. et al. (2015) RNA Sequencing Identifies Novel Translational Biomarkers of Kidney Fibrosis. J Am Soc Nephrol. 290. Kang, H.M. et al. (2015) Defective fatty acid oxidation in renal tubular epithelial cells has a key role in kidney fibrosis development. Nat Med 21 (1), 37-46. 291. Kanehisa, M. et al. (2016) KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 44 (D1), D457-62. 292. Welter, D. et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42 (Database issue), D1001-6. 293. Pattaro, C. et al. (2012) Genome-wide association and functional follow-up reveals new loci for kidney function. PLoS Genet 8 (3), e1002584. 294. Chasman, D.I. et al. (2012) Integration of genome-wide association studies with biological knowledge identifies six novel genes related to kidney function. Hum Mol Genet 21 (24), 5329-43. 295. Wallace, C. et al. (2012) Statistical colocalization of monocyte gene expression and genetic risk variants for type 1 diabetes. Hum Mol Genet 21 (12), 2815-24. 296. Fortune, M.D. et al. (2015) Statistical colocalization of genetic risk variants for related autoimmune diseases in the context of common controls. Nat Genet 47 (7), 839-46. 297. Guo, H. et al. (2015) Integration of disease association and eQTL data using a Bayesian colocalisation approach highlights six candidate causal genes in immune-mediated diseases. Hum Mol Genet 24 (12), 3305-13. 298. Wang, Y. et al. (2013) Significance of glycosylphosphatidylinositol-anchored protein enrichment in lipid rafts for the control of autoimmunity. J Biol Chem 288 (35), 25490-9. 299. Howard, M.F. et al. (2014) Mutations in PGAP3 impair GPI-anchor maturation, causing a subtype of hyperphosphatasia with mental retardation. Am J Hum Genet 94 (2), 278-87. 110

300. Kovesdy, C.P. et al. (2010) Outcome predictability of serum alkaline phosphatase in men with pre-dialysis CKD. Nephrol Dial Transplant 25 (9), 3003-11. 301. Uhlen, M. et al. (2015) Proteomics. Tissue-based map of the human proteome. Science 347 (6220), 1260419. 302. Zhu, M. et al. (2006) Beta-mannosidosis mice: a model for the human lysosomal storage disease. Hum Mol Genet 15 (3), 493-500. 303. Lee, J.W. et al. (2015) Deep Sequencing in Microdissected Renal Tubules Identifies Nephron Segment-Specific Transcriptomes. J Am Soc Nephrol 26 (11), 2669-77. 304. Seiler, C. and Pack, M. (2011) Transgenic labeling of the zebrafish pronephric duct and tubules using a promoter from the enpep gene. Gene Expr Patterns 11 (1-2), 118-21. 305. Wang, J. et al. (2016) Imputing Gene Expression in Uncollected Tissues Within and Beyond GTEx. Am J Hum Genet 98 (4), 697-708. 306. Surendran, K. et al. (2014) Lysosome dysfunction in the pathogenesis of kidney diseases. Pediatr Nephrol 29 (12), 2253-61. 307. Birn, H. and Christensen, E.I. (2006) Renal albumin absorption in physiology and pathology. Kidney Int 69 (3), 440-9. 308. Alroy, J. et al. (2002) Renal pathology in Fabry disease. J Am Soc Nephrol 13 Suppl 2, S134-8. 309. Gavin Band, J.M. 310. Jensen, B.L. (2004) Reduced nephron number, renal development and 'programming' of adult hypertension. J Hypertens 22 (11), 2065-6. 311. Woroniecki, R. et al. (2011) Fetal environment, epigenetics, and pediatric renal disease. Pediatr Nephrol 26 (5), 705-11. 312. Pujadas, E. and Feinberg, A.P. (2012) Regulated noise in the epigenetic landscape of development and disease. Cell 148 (6), 1123-31. 313. Laird, P.W. (2010) Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet 11 (3), 191-203. 314. Weber, W. (2010) Cancer epigenetics. Prog Mol Biol Transl Sci 95, 299-349. 315. You, J.S. and Jones, P.A. (2012) Cancer genetics and epigenetics: two sides of the same coin? Cancer Cell 22 (1), 9-20. 316. Jones, P.A. and Laird, P.W. (1999) Cancer epigenetics comes of age. Nat Genet 21 (2), 163-7. 317. O'Dwyer, K. and Maslak, P. (2008) Azacitidine and the beginnings of therapeutic epigenetic modulation. Expert Opin Pharmacother 9 (11), 1981-6.

111

318. Ryan, R.J. and Bernstein, B.E. (2012) Molecular biology. Genetic events that shape the cancer epigenome. Science 336 (6088), 1513-4. 319. Villeneuve, L.M. and Natarajan, R. (2010) The role of epigenetics in the pathology of diabetic complications. Am J Physiol Renal Physiol 299 (1), F14-25. 320. Labrie, V. et al. (2012) Epigenetics of major psychosis: progress, problems and perspectives. Trends Genet 28 (9), 427-35. 321. Stenvinkel, P. et al. (2008) Emerging biomarkers for evaluating cardiovascular risk in the chronic kidney disease patient: how do new pieces fit into the uremic puzzle? Clin J Am Soc Nephrol 3 (2), 505-21. 322. Woroniecka, K.I. et al. (2011) Transcriptome analysis of human diabetic kidney disease. Diabetes 60 (9), 2354-69. 323. Alvarez, H. et al. (2011) Widespread hypomethylation occurs early and synergizes with gene amplification during esophageal carcinogenesis. PLoS Genet 7 (3), e1001356. 324. Djavani, M. et al. (1993) Alterations of collagen content in kidney of diabetic rabbits. Biochem Soc Trans 21 ( Pt 3) (3), 275S. 325. Maher, B. (2012) ENCODE: The human encyclopaedia. Nature 489 (7414), 46-8. 326. Ernst, J. and Kellis, M. (2012) ChromHMM: automating chromatin-state discovery and characterization. Nat Methods 9 (3), 215-6. 327. Thurman, R.E. et al. (2012) The accessible chromatin landscape of the human genome. Nature 489 (7414), 75-82. 328. Bailey, T.L. (2002) Discovering novel sequence motifs with MEME. Curr Protoc Bioinformatics Chapter 2, Unit 2 4. 329. Tanaka, E. et al. (2011) Improved similarity scores for comparing motifs. Bioinformatics 27 (12), 1603-9. 330. Schiffer, M. et al. (2000) Smad proteins and transforming growth factor-beta signaling. Kidney Int Suppl 77, S45-52. 331. Inazaki, K. et al. (2004) Smad3 deficiency attenuates renal fibrosis, inflammation,and apoptosis after unilateral ureteral obstruction. Kidney Int 66 (2), 597-604. 332. Yu, J. et al. (2012) Identification of molecular compartments and genetic circuitry in the developing mammalian kidney. Development 139 (10), 1863-73. 333. Roseboom, T.J. et al. (1999) Blood pressure in adults after prenatal exposure to famine. J Hypertens 17 (3), 325-30. 334. Godfrey, K.M. et al. (1994) Maternal nutritional status in pregnancy and blood pressure in childhood. Br J Obstet Gynaecol 101 (5), 398-403.

112

335. Bielesz, B. et al. (2010) Epithelial Notch signaling regulates interstitial fibrosis development in the kidneys of mice and humans. J Clin Invest 120 (11), 4040-54. 336. Khulan, B. et al. (2006) Comparative isoschizomer profiling of cytosine methylation: the HELP assay. Genome Res 16 (8), 1046-55. 337. Thompson, R.F. et al. (2009) A pipeline for the quantitative analysis of CG dinucleotide methylation using mass spectrometry. Bioinformatics 25 (17), 2164-70. 338. Feng, J. et al. (2012) Identifying ChIP-seq enrichment using MACS. Nat Protoc 7 (9), 1728- 40. 339. Langmead, B. et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3), R25. 340. Thomas-Chollier, M. et al. (2011) RSAT 2011: regulatory sequence analysis tools. Nucleic Acids Res 39 (Web Server issue), W86-91. 341. van Helden, J. (2003) Regulatory sequence analysis tools. Nucleic Acids Res 31 (13), 3593- 6. 342. Melnikov, A. et al. (2012) Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat Biotechnol 30 (3), 271-7. 343. Tewhey, R. et al. (2016) Direct Identification of Hundreds of Expression-Modulating Variants using a Multiplexed Reporter Assay. Cell 165 (6), 1519-29. 344. Liu, X.S. et al. (2016) Editing DNA Methylation in the Mammalian Genome. Cell 167 (1), 233-247 e17.

113