DOCTOR OF PHILOSOPHY

Next Generation Sequencing and Genome-Wide Association Studies to Identify Mitochondrial Genomic Features Associated with Diabetic Kidney Disease

Skelly, Ryan

Award date: 2020

Awarding institution: Queen's University Belfast

Link to publication

Terms of use All those accessing thesis content in Queen’s University Belfast Research Portal are subject to the following terms and conditions of use

• Copyright is subject to the Copyright, Designs and Patent Act 1988, or as modified by any successor legislation • Copyright and moral rights for thesis content are retained by the author and/or other copyright owners • A copy of a thesis may be downloaded for personal non-commercial research/study without the need for permission or charge • Distribution or reproduction of thesis content in any format is not permitted without the permission of the copyright holder • When citing this work, full bibliographic details should be supplied, including the author, title, awarding institution and date of thesis

Take down policy A thesis can be removed from the Research Portal if there has been a breach of copyright, or a similarly robust reason. If you believe this document breaches copyright, or there is sufficient cause to take down, please contact us, citing details. Email: [email protected]

Supplementary materials Where possible, we endeavour to provide supplementary materials to theses. This may include video, audio and other types of files. We endeavour to capture all content and upload as part of the Pure record for each thesis. Note, it may not be possible in all instances to convert analogue formats to usable digital formats for some supplementary materials. We exercise best efforts on our behalf and, in such instances, encourage the individual to consult the physical thesis for further information.

Download date: 07. Oct. 2021

Next Generation Sequencing and Genome-Wide Association Studies to Identify Mitochondrial Genomic Features Associated with Diabetic Kidney Disease

A thesis submitted to the School of Medicine, Dentistry and Biomedical Sciences of the Queen’s University of Belfast for the degree of:

Doctor of Philosophy (PhD) 30th September 2019

Ryan Skelly

B.Sc. (Hons) Human Biology (QUB, 2013) M.Sc. Bioinformatics and Computational Genomics (QUB, 2015)

1

Preface

The research described in this thesis was carried out within the Nephrology Research Group at the Belfast City Hospital and the Centre of Public Health, Queen’s University Belfast between September 2016 and September 2019. Illumina sequencing was performed by the Genomics Core Technology Unit, Queen’s University Belfast. Funding was provided by the Department for the Economy. For any Tables and Figures reproduced or adapted from other sources suitable permissions have been requested and relevant references provided in the bibliography.

2

Acknowledgements

I would like to thank my supervisors, Dr Amy Jayne McKnight and Professor Peter Maxwell, for their guidance and support throughout this project. Thank you for giving me the opportunity to carry out this research and your encouragement. I would also like to thank Jill Kilner and Christopher Wooster for providing support with lab work as well as Dr Laura Smyth and Dr Marisa Cañadas Garre for their help and guidance at various stages of the project. To everyone else within the NephRes group I offer my sincere thanks and gratitude and I am pleased to have been a part of this research team. I would also like to thank my family and friends for their support and understanding over the past three years.

3

Publications and Presentations from Work in this Thesis

Genetic and epigenetic analysis in affecting mitochondrial function are associated with chronic kidney disease in an older population

Cappa, R., Smyth, L., Cañadas-Garre, M., Polpo de Campos, C., Skelly, R., Cruise, S., Kee, F., McGuinness, B., Godson, C., Maxwell, A. & McKnight, A., 2018, (Submitted).

Research output: Contribution to conference › Abstract

Genetic susceptibility to chronic kidney disease – some more pieces for the heritability puzzle: Genetic predisposition to kidney disease

Garre, M. C., Anderson, K., Cappa, R., Skelly, R., Smyth, L., McKnight, A. J. & Maxwell, A. P., 31 May 2019, In : Frontiers in Genetics.

Research output: Contribution to journal › Review article

Mitochondria and chronic kidney disease: a molecular update

Skelly, R., Maxwell, A. & McKnight, A., 01 May 2019, In : SPG Biomed.

Research output: Contribution to journal › Article

Genetic risk factors in mitochondrial DNA associated with diabetic kidney disease – GWAS discovery and meta-analysis

Skelly, R., Smyth, L., Cole, J., Maxwell, A., GENIE Consortium, DNCRI Consortium & McKnight, A., 2019, (Accepted).

Research output: Contribution to conference › Poster

4

Abbreviations Used in this Thesis Abbreviation Long form 11βHSD1 11β-hydroxysteroid dehydrogenase 1 2-h PG one-off random plasma glucose concentration ACE angiotensin-converting enzyme ACE2 angiotensin-converting enzyme 2 ACR albumin to creatinine ratio AGEs advanced glycation end products AKI acute kidney injury AMPK 5' adenosine monophosphate-activated kinase

AT1 angiotensin II receptor type 1

AT2 angiotensin II receptor type 2 ATP Adenosine Triphosphate AusDiane Austrian Diabetic Nephropathy Study BCH Belfast City Hospital BCL binary base call format BRTx Belfast renal transplant BST-DNA bisulfite-treated DNA CACTI Coronary Artery Calcification in Type 1 Diabetes cAMP cyclic adenosine monophosphate CD95L CD95 ligand CD-CV common disease/common variant CKD chronic kidney disease CVD cardiovascular disease DCCT/EDIC Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications DKD diabetic kidney disease D-loop displacement-loop DM diabetes mellitus DNA deoxyribonucleic acid DNMT DNA methyltransferase DNCRI Diabetic Nephropathy Collaborative Research Initiative DPP-4i dipeptidyl peptidase-4 inhibitors

5

ECM extra cellular matrix EDC Pittsburgh Epidemiology of Diabetes Complications Study eGFR estimated glomerular filtration rate EGFR epidermal growth factor receptor ESRD end-stage renal disease ET-1 endothelin-1 ETC electron transfer chain FFA free fatty acid FGF21 fibroblast growth factor 21 FIND Family Investigation of Nephropathy and Diabetes FinnDiane Finnish Diabetic Nephropathy Study FPG fasting plasma glucose GAD glutamic acid decarboxylase 65 GADPH glyceraldehyde-3-phosphate dehydrogenase GBM glomerular basement membrane GCTU Genomics Core Technology Unit GDM gestational diabetes mellitus GENEDIAB French and Belgian subjects from the Genetics of Diabetic Nephropathy GENIE Genetics of Nephropathy – an International Effort GFR glomerular filtration rate GLP1 glucagon-like peptide 1 GoKinD Genetics of Kidneys in Diabetes GoKinD Genetics of Kidneys in Diabetes US Study GSH reduced glutathione GWAS genome-wide association studies HapMap International Haplotype Map HbA1c haemoglobin A1c HGNC HUGO Nomenclature Committee HLA human leucocyte antigens HSP heavy strand promoter H-Strand heavy strand iDCs immature dendritic cells

6

IDF International Diabetes Federation IFG impaired fasting glucose IFNγ interferon-γ IGT impaired glucose tolerance IR insulin resistance JAK2 janus kinase-2 JOSLIN Joslin Kidney Study LADA latent autoimmune diabetes of adults LatDiane Latvian Diabetic Nephropathy Study LD linkage disequilibrium LitDiane Lithuanian Diabetic Nephropathy Study lncRNA long ncRNA LSP Light Strand Promoter L-Strand Light Strand MAF minor allele frequency MHC major histocompatibility complex miRNA micro RNA mmol/L millimoles per litre MODY maturity onset diabetes of the young mRNA messenger RNA mtDNA mitochondrial DNA mtSSB mitochondrial single-stranded DNA-binding protein NADH reduced nicotinamide adenine dinucleotide NCR non-coding region ncRNA non-coding RNAs nDNA nuclear DNA NEMGs nuclear encoded mitochondrial genes NephRes QUB Nephrology Research Group NF-κB nuclear factor kappa-light-chain-enhancer of activated B cells NGS next-generation sequencing NICOLA Northern Ireland Cohort for the Longitudinal study of Ageing NOD non-obese diabetic

7

NP natriuretic peptide

OH origin of replication for the H-strand OR odds ratio OXPHOS oxidative phosphorylation PCR polymerase chain reaction PIF1 petite integration frequency 1 piRNA piwi-interacting RNA PKC protein kinase C POLRMT mitochondrial RNA polymerase POLγ DNA polymerase γ POLγB DNA polymerase γ processivity factor PPARγ peroxisome proliferator-activated receptor gamma PTM post-translational modification QC quality control QUB Queen's University Belfast RAAS renin-angiotensin-aldosterone system RNA ribonucleic acid RNaseH1 ribonuclease H1 ROCK1 Rho-associated coiled coil-containing protein kinase 1 RomDiane Romanian Diabetic Nephropathy Study ROS reactive oxygen species rRNA Ribosomal RNA RRT renal replacement therapy SBS sequencing by synthesis SDM strand-displacement model SDRNT1BIO Scottish Diabetes Research Network Type 1 Bioresource SGLT2 sodium/glucose co-transporter 2 siRNA small interfering RNA SNP single nucleotide polymorphism snRNA small nuclear RNA SOD superoxide dismutase SOLiD Sequencing by Oligonucleotide Ligation and Detection

8

SPPARM selective peroxisome proliferator-activated receptor modulator SUMMIT Surrogate markers for Micro- and Macro-vascular har d endpoints for Innovative diabetes Tools SUV3 suppressor of Var1 3-Like Protein 1 T1DM type 1 diabetes mellitus T2DM type 2 diabetes mellitus TFAM mitochondrial transcription factor A TFBM mitochondrial transcription factor B TGF-β transforming growth factor beta TLRs toll-like receptors TNF-α tumour necrosis factor-α

Tregs activated regulatory T cells tRNA transfer RNA tRNAIle mitochondrial isoleucine tRNA gene TZD thiazolidinediones UAE urinary albumin excretion UCSC University of California, Santa Cruz genome browser UK United Kingdom UK-ROI All Ireland / Warren 3 GoKinD UK Collection VCF variant call format VEGF vascular endothelial growth factor WES whole exome sequencing WESDR Wisconsin Epidemiologic Study of Diabetic Retinopathy WGS whole genome sequencing WHO World Health Organisation

9

Abstract

Diabetic kidney disease (DKD) affects ~40% of persons with diabetes and is the leading cause of chronic kidney disease and end-stage renal disease globally. Mitochondrial dysfunction is implicated in the pathophysiology of DKD. Previous research reported SNPs in nuclear genes, which influence mitochondrial function, are significantly associated with DKD. Furthermore, these genetic and functional data prompted further investigation of single nucleotide polymorphisms (SNPs) affecting mitochondrial function for association with DKD.

Targeted genome wide association analyses focusing on mitochondrial DNA (mtDNA) and nuclear genes involved with mitochondrial function were performed using DNA samples from the All Ireland / Warren 3 Genetics of Kidneys in Diabetes UK Collection. SNPs in NEMGs found to be suggestive of association were followed up in a larger collection which included up to 19,406 individual and updated imputation to approximately 49 million SNPs. The SNP that showed most evidence for association with decreased glomerular filtration rate after adjusting for covariates was MitoG11915A (P=0.0003) which is a synonymous variant found in the mitochondrial gene coding for the NADH-ubiquinone oxidoreductase chain 4 protein. In nuclear genes involved with mitochondrial function there were eight SNPs in four genes associated with DKD related phenotypes.

Next generation sequencing offers the ability to investigate genetic and epigenetic features in more detail than genetic association studies and these have been successful used for research into various disease including DKD. In Chapter 3 of this thesis Illumina sequencing technology is used to investigate mtDNA, gene expression and methylation in DKD with the aim of establishing a sequencing and analyses workflow for targeted analysis of mtDNA.

The work in this thesis has utilised genome wide association studies and next generation sequencing to investigate mitochondrial genetic variants which may be used to identify predisposition to DKD in individuals with type 1 diabetes. While substantial progress has been made in DKD research there is still a long way to go and genetics and epigenetics represent only a small piece of the puzzle. The widespread use of precision medicine will require seamless integration of clinical data with full “omics” profiles for each patient with large data repositories which can identify and generate treatment options at an individual patient level.

10

List of Figures

Figure Title Page

1.1 Comparative Prevalence of Diabetes by IDF Region, 2017 and 2045 22 1.2 Pathophysiology of Type 1 Diabetes Mellitus 27 1.3 Factors Involved with Development of Type 2 Diabetes Mellitus 28 1.4 Current and Future Diabetes Treatments and Intervention Site 33 1.5 Anatomy of the Kidney 34 1.6 Internal Structure of an Individual Nephron 35 1.7 Leading Causes of Years of Life Lost in 2016 and Predictions for 2040 36 1.8 Pathogenesis of Chronic Kidney Disease 39 1.9 Structural Changes in the Glomeruli in Diabetic Kidney Disease 41 1.10 Pathophysiology of Diabetic Kidney Disease 44 1.11 Key Components of the Metabolic Pathway of Diabetic Kidney Disease 47 1.12 Basic Structure of a Mitochondrion 52 1.13 Mitochondrial DNA and of the Electron Transport Chain 54 1.14 Mitochondrial Dysfunction in CKD 62 1.15 Effects of Reactive Oxygen Species Generated from Mitochondria 64 1.16 Epigenetic Modifications and Locations 67 1.17 Risk Allele Frequency and Strength of Genetic Effect in Association Studies 74 1.18 Indirect and Direct Associations in GWAS 77 1.19 Overview of SNP Array Technology. 78 1.20 Positions of Mitochondrial SNPs on Each Array in Relation to Mitochondrial 82 Genes and mRNA From the UCSC Database 1.21 Falling Cost of Genome Sequencing 2001 - 2019 90 1.22 Overview of Library Preparation for NGS 92 1.23 Library Amplification by Emulsion PCR 93 1.24 Library Amplification by Bridge PCR 94

2.1 Phenotype Definitions for DNCRI Follow Up 110 2.2 Summary of Gene Selection Process 111 2.3 Regional Association Plot for rs58721031 123 2.4 Regional Association Plot for rs56122389 126 2.5 Regional Association Plot for rs72851722 129

11

2.6 Regional Association Plot for rs17727979 136 2.7 Summary of SNPs Suggestive of Association with Each Phenotype 142 Investigated in nDNA and mtDNA

3.1 High Throughput Methods Used for Detection of Functional Genetic and 163 Epigenetic Features 3.2 Four Main Steps in an Illumina NGS Workflow 165 3.3 Two Methods of Dual Indexed Sequencing used by Illumina 167 3.4 Data Aggregation Workflow for Multiple FASTQ Files. 175 3.5 mtDNA Variant Processor User Interface with Settings used for Successful 178 Analyses 3.6 Workflow summary for mtDNA Variant Processor 179 3.7 mtDNA Variant Analyser Results Viewer 178 3.8 RNA Express User Interface with Settings Used for Successful Analyses 182 3.9 Workflow Summary for RNA Express 183 3.10 Summary of Comparisons Performed using RNA Express 184 3.11 MethylSeq User Interface with Settings Used for Successful Analyses 186 3.12 Workflow summary for MethylSeq 187 3.13 DRAGEN Methylation Pipeline User Interface with Settings Used for 189 Successful Analyses 3.14 MethylKit User Interface with Settings Used for Successful Analyses 191 3.15 Summary of comparisons performed using MethylKit 192 3.16 Full workflow for sequencing and analysis of mtDNA, RNA and BST-DNA 216

4.1 Reduced TFAM Expression May Contribute to Inflammation and Fibrosis in 227 CKD and DKD

12

List of Tables

Table Title Page

1.1 Criteria for Diagnosing Diabetes and Related Conditions 25 1.2 Clinical Criteria for Diagnosis of Diabetic Kidney Disease 42 1.3 Comparison of Whole Genome SNPs Found on Commonly used Genotyping 80 Platforms 1.4 Comparison of mtDNA SNPs found on Commonly used Genotyping 81 Platforms. 1.5 Uses of Genome-Wide SNP Arrays in Human Genetic Discoveries 85 1.6 Clinical Applications of Molecular Genetic Discoveries 88 1.7 Comparison of Next-Generation Sequencing Instrument Specifications 96

2.1 Summary of Major Loci Identified Through GWAS of DKD and DKD-Related 103 Traits 2.2 Clinical Characteristics of UK-ROI Samples Used for Targeted GWAS 107 2.3 Phenotypes Used for Targeted Mitochondrial Association Studies 112 2.4 Summary of Quality Control 112 2.5 SNPs in NEMGs Associated with the Diabetic Kidney Disease Phenotype in 114 UK-ROI samples 2.6 Minor Allele Frequencies in Various Populations for SNPs Associated with 115 Diabetic Kidney Disease Phenotype 2.7 SNPs in NEMGs Significantly Associated with End-Stage Renal Disease in 117 UK-ROI Samples 2.8 Minor Allele Frequencies in Various Populations for SNPs Significantly 118 Associated with End-Stage Renal Disease 2.9 SNPs in NEMGs on 2 significantly associated with estimated 120 glomerular filtration rate in UK-ROI samples 2.10 SNPs in NEMGs on Chromosome 3 Significantly Associated with Estimated 121 Glomerular Filtration Rate in UK-ROI Samples 2.11 SNPs in NEMGs on Chromosome 4 Significantly Associated with Estimated 122 Glomerular Filtration Rate in UK-ROI Samples 2.12 SNPs in NEMGs on Chromosome 6 Significantly Associated with Estimated 124 Glomerular Filtration Rate in UK-ROI Samples 2.13 SNPs in NEMGs on Chromosome 12 Significantly Associated with Estimated 125 Glomerular Filtration Rate in UK-ROI Samples 2.14 SNPs in NEMGs on Chromosome 17 Significantly Associated with Estimated 127 - Glomerular Filtration Rate in UK-ROI Samples 128

13

2.15 SNPs in NEMGs on Chromosome 22 Significantly Associated with Estimated 131 Glomerular Filtration Rate in UK-ROI Samples 2.16 SNPs in NEMGs on , 16 and 20 Significantly Associated with 134 - Estimated Glomerular Filtration Rate in UK-ROI Samples 135 2.17 SNPs in mtDNA Significantly Associated with DKD, ESRD and eGFR in 138 - Genotyped Data from UK-ROI Samples 140 2.18 SNPs Suggestive of Significant Association in the UK-ROI data and Carried 143 - Forward to the DNCRI Lookup 145 2.19 SNPs in nDNA Suggestive of Significant Association in UK-ROI Analyses and 147 - DNCRI Lookup 149 2.20 MAFs for SNPs Significant in Targeted Association Analysis of UK-ROI Data 159 and DNCRI Lookup

3.1 Approximate Times for Each Step on Illumina Platforms Used for 166 Sequencing 3.2 Samples Used for Next-Generation Sequencing Project 169 - 170 3.3 Phenotype Groups to be Used for Data Analysis 171 3.4 Pre-Sequencing Quality Control and Quantification by GCTU 194 - 195 3.5 mtDNA Sequencing Performed by QUB GCTU 196 3.6 Overview of Run Metrics from mtDNA Sequencing Performed by GCTU 197 3.7 mtDNA Post-Sequencing QC for Each Read Including Index Reads 197 3.8 FASTQ Generation for mtDNA Sequencing by QUB GCTU 198 3.9 Post Alignment QC Summary for all 40 mtDNA Samples 198 3.10 RNA Sequencing Performed by QUB GCTU 199 3.11 Overview of Run Metrics from RNA Sequencing Performed by GCTU 200 3.12 RNA Post-Sequencing QC for Each Read Including Index Reads 200 3.13 FASTQ generation from RNA Seq performed by GCTU 201 3.14 Post Alignment QC Summary for 37 RNA Samples 202 3.15 BST-DNA sequencing run performed by QUB GCTU 203 3.16 Overview of Run Metrics from BST-DNA Sequencing Performed by GCTU 204 3.17 BST-DNA Post-Sequencing QC for Each Read Including Index Reads 204 3.18 FASTQ Generation from BST-DNA Sequencing Performed by GCTU 205 3.19 BST-DNA FastQC summary for all 40 samples 205 3.20 Summary of mtDNA Variant Processor Output from BaseSpace 206

14

3.21 Summary of Results Exported from mtDNA Variant Analyser 207 3.22 Summary of RNA Express Output from BaseSpace General Info 208 3.23 Summary of RNA Express Output from BaseSpace Run Specific Info 208 3.24 Summary of Differential Expression Counts from RNA Express Output 209 3.25 BST-DNA FASTQ Alignment performed with MethylSeq App in BaseSpace 210 3.26 BST-DNA FASTQ Alignment Performed with DRAGEN Methylation Pipeline 210 in BaseSpace 3.27 General Info from Methyl Kit Analyses with MethylSeq Output 211 3.28 Summary of Methyl Kit Analyses using MethylSeq Output 211 3.29 Differential Methylation Summary from MethylKit Runs using MethylSeq 212 Output 3.30 General Info from Methyl Kit Analyses with DRAGEN Methylation Pipeline 213 Output 3.31 Summary of Methyl Kit Analyses using DRAGEN Methylation Pipeline 213 Output 3.32 Summary of Differential Methylation from the MethylKit Analyses Using 213 the DRAGEN Methylation Pipeline Output 3.33 Summary of Totals for Full Workflow Including Time, Data Generated and 215 Cost 3.34 Number of Genes Differentially Expressed in More than One Comparison 218 3.35 Number of Differentially Methylated Bases in More than One Comparison 219 3.36 Differentially Methylated Bases from Individuals with DKD and ESRD 219 Compared to those with ESRD without Diabetes 3.37 Projections for Larger Study 223

4.1 Pros and Cons of Third Generation Sequencing Platforms 234

15

Table of Contents Preface 2 Acknowledgements 3 Publications and Presentations from Work in this Thesis 4 Abbreviations Used in this Thesis 5 Abstract 10 List of Figures 11 List of Tables 13 Chapter 1 Introduction and Project Overview 21

1.1 Introduction 21

1.2 Defining Diabetes 23

1.2.1 Type 1 Diabetes Mellitus 25

1.2.2 Type 2 Diabetes Mellitus 28

1.2.3 Rare Forms of Diabetes Mellitus 30

1.2.4 Diabetes Management and Treatment 31

1.3 Kidneys in Health and Disease 34

1.3.1 Global Burden of Chronic Kidney Disease 35

1.3.1.1 Pathophysiology of Chronic Kidney Disease 37

1.3.2 Diabetic Kidney Disease 41

1.3.2.1 Diagnosing Diabetic Kidney Disease 42

1.3.2.2 Pathophysiology of Diabetic Kidney Disease 43

1.3.2.2.1 Haemodynamic Pathway 45

1.3.2.2.2 Metabolic Pathway 45

1.3.2.2.3 Inflammation 48

1.3.2.3 Treatment of Diabetic Kidney Disease 48

1.3.2.4 Genetics of Diabetic Kidney Disease 50

1.4 Mitochondria and Kidney Disease 52

1.4.1 The Mitochondrial Genome 53

16

1.4.2 Regulation of Mitochondria by Nuclear Genes 56

1.4.3 Mitochondrial Energetics of the Kidney 57

1.4.4 Mitochondria and Chronic Kidney Disease 58

1.4.5 Mitochondria in Diabetic Kidney Disease 63

1.5 Epigenetics 65

1.5.1 Types of Epigenetic Modification 66

1.5.1.1 DNA Methylation 67

1.5.1.2 Non-Coding RNAs 68

1.5.1.3 Histone Modification 68

1.5.2 Epigenetics in nDNA and mtDNA 69

1.5.3 Epigenetics and Kidney Disease 70

1.6 Genome Wide Association Studies 72

1.6.1 Single Nucleotide Polymorphisms 72

1.6.2 Linkage Disequilibrium 74

1.6.3 Imputation 75

1.6.4 GWAS Methodology 77

1.6.4.1 SNP Arrays 78

1.6.4.2 Statistical Analysis and Significance Threshold 83

1.6.5 Uses for Genome-Wide Association Studies 83

1.6.5.1 GWAS in Research 83

1.6.5.2 From GWAS to Function 85

1.6.5.3 Clinical Applications of GWAS 68

1.7 Next Generation Sequencing 88

1.7.1 Sequencing Principles 90

1.7.2 Template Preparation Methods 91

1.7.3 Sequencing Methods and Platforms 94

1.7.4 Applications of Next Generation Sequencing 98

17

1.8 Project Rationale 99

1.8.1 Hypotheses 99

Chapter 2 Mitochondrial-Genome-Wide Association Studies 100

2.1 Introduction 100

2.1.1 GWAS in Diabetic Kidney Disease Research 100

2.1.2 Targeted GWAS Approach 105

2.2 Methods and Materials 106

2.2.1 Gene Selection 106

2.2.2 Study Populations 106

2.2.2.1 UK-ROI Collection 106

2.2.2.2 DNCRI Collection 107

2.2.3 Targeted Mitochondrial Association Analysis of UK-ROI Data 108

2.2.4 Follow Up in DNCRI Meta-Analysis Data 109

2.3 Results 110

2.3.1 Gene selection 110

2.3.2 Targeted Mitochondrial Association Analysis of UK-ROI Data 112

2.3.2.1 NEMGs Associated with DKD Phenotype in UK-ROI 112

2.3.2.2 NEMGs Associated with ESRD Phenotype in UK-ROI 116

2.3.2.3 NEMGs Associated with eGFR Phenotype in UK-ROI 119

2.3.2.3.1 Genes on Chromosome 2 Associated with eGFR 119

2.3.2.3.2 Genes on Chromosome 3 Associated with eGFR 121

2.3.2.3.3 Genes on Chromosome 4 Associated with eGFR 121

2.3.2.3.4 Genes on Chromosome 6 Associated with eGFR 122

2.3.2.3.5 Genes on Chromosome 12 Associated with eGFR 125

2.3.2.3.6 Genes on Chromosome 17 Associated with eGFR 126

2.3.2.3.7 Genes on Chromosome 22 Associated with eGFR 130

18

2.3.2.3.8 Genes on Chr. 15, 16 and 20 Associated with eGFR 132

2.3.2.4 Mitochondrial DNA SNP Associations in UK-ROI Data 136

2.3.2.5 Summary of UK-ROI Analyses Results 141

2.3.3 DNCRI Lookup of Suggestively Significant SNPs 145

2.4 Discussion 150

Chapter 3 Designing a Next Generation Sequencing Workflow for mtDNA 162 3.1 Introduction 162

3.1.1 Illumina Sequencing 164

3.2 Methods and Materials 168

3.2.1 Samples Collections and Comparison Groups 168

3.2.2 Samples Collections and Quality Control 171

3.2.3 Illumina Sequencing by QUB Genomics Core Technology Unit 172

3.2.3.1 Mitochondrial DNA Sequencing 173

3.2.3.2 RNA Sequencing 174

3.2.3.3 Methylation Sequencing 176

3.2.4 Data Analysis with BaseSpace 176

3.2.4.1 Mitochondrial DNA Sequencing Data Analysis 176

3.2.4.2 RNA Sequencing Data Analysis 180

3.2.4.3 Bisulfite-Treated DNA Sequencing Data Analysis 185

3.2.4.3.1 MethylSeq Alignment 185

3.2.4.3.2 DRAGEN Methylation Pipeline Alignment 187

3.2.4.3.3 MethylKit Analyses 190

3.3 Results 192

3.3.1 Sample Preparation and Quality Control 193

3.3.2 Illumina Sequencing by Genomics Core Technology Unit 195

3.3.2.1 Mitochondrial DNA Sequencing 195

19

3.3.2.2 RNA Sequencing 199

3.3.2.3 BST-DNA Sequencing 202

3.3.3 Data analysis with BaseSpace 205

3.3.3.1 Mitochondrial DNA Sequencing Analysis 205

3.3.3.2 RNA Sequencing Analysis 208

3.3.3.3 BST-DNA Sequencing Analysis 209

3.3.3.3.1 Analysis of MethylSeq Output 210

3.3.3.3.2 Analysis of DRAGEN Methylation Pipeline Output 212

3.3.4 Workflow Overview 214

3.4 Discussion 217

Chapter 4 Conclusions and Future Directions 226

4.1 Introduction 226

4.2 Summary of Findings 228

4.3 Strengths and Limitations of Research 229

4.4 Future Work 232

4.5 Concluding Remarks 235

Bibliography 237

Appendices Electronic Only

20

Chapter 1| Introduction and Project Overview

1.1 Introduction

Since 1980 the global prevalence of diabetes mellitus (DM) in adults has increased from approximately 108 million to around 425 million in 20161; if current trends continue, this is expected to reach 629 million by 2045 (Figure 1.1)2. A World Health Organisation (WHO) report described 1.5 million deaths worldwide in 2012 directly related to diabetes, with a further 2.2 million deaths attributed to other conditions exacerbated by diabetes such as cardiovascular disease (CVD) and chronic kidney disease (CKD)3. These predictions led to the development of a number of worldwide interventions such as the International Diabetes Federation’s Global Diabetes Plan 2011 – 20214 and the WHOs Global action plan for the prevention and control of non-communicable diseases 2013 – 20205. These initiatives attempt to address the rising rates of DM by increasing awareness, supporting research, improving treatment and reducing associated modifiable risk factors. The term DM encompasses a range of metabolic disorders involving abnormal blood glucose levels and should not be confused with diabetes insipidus which is a characterised by polyuria as a result of vasopressin deficiency, vasopressin resistance, or excessive water intake6. Although many landmark studies1,7 have been performed, most of these focus on type 2 diabetes mellitus (T2DM) which is characterised by insufficient insulin production as a consequence of β-cell dysfunction and insulin resistance (IR) in peripheral tissues resulting in sustained hyperglycaemia8. Type 1 diabetes mellitus (T1DM) accounts for an estimated 8.5% of DM diagnoses in the United Kingdom (UK)9 and 5- 10% of cases worldwide10. T1DM involves the destruction of pancreatic β-cells leading to an absolute insulin deficiency11. Two forms of T1DM have been identified: type 1A in which an autoimmune response initiates the destruction of β-cells12, and less common type 1B, known as idiopathic diabetes, where the pathogenesis is much less clear but does not appear to be autoimmune and often involves absolute insulin inhibition with frequent ketoacidosis10. T1DM is most often identified and diagnosed during childhood and for this reason it has historically been known as early-onset or juvenile-onset DM13,14. In recent years, it has become evident that autoimmune-mediated DM can develop in adulthood and T2DM can also develop in children adding a further layer of complexity to this condition15–17. Global estimates for the number of adults and children diagnosed with diabetes can be seen in Figure 1.1. In addition to the immediate burden of glycaemic control and daily insulin injections, DM may also lead to secondary complications and diabetes-related distress, which can have a significant impact on quality of life and also

21 places substantial pressure on the health service18. In 2017, the estimated global cost of DM was $727 billion USD1. The most recent estimates from the UK put the total cost of DM at £23.7 billion in 2010/2011 with £9.8 billion spent on direct costs of diagnosis, treatment, and disease management and a further £13.9 billion as a result of indirect factors such as loss of productivity, social costs and mortality19,20. If the incidence of diabetes continues to rise at the predicted rate, the associated costs to the health service and economy will also increase considerably. It has been estimated that if current trends continue the cost of treatment alone will reach £16.9 billion per year by 2035 and the total cost of DM to the UK could reach £39.8 billion21.

Figure 1.1| Comparative Prevalence of Diabetes by IDF Region, 2017 and 204522 If current trends continue by 2045 it is estimated that 629 million people aged 20-79 years, will have diabetes. Around 79% of those with diabetes live in low- and middle-income countries and the largest increases in prevalence will take place in regions where economies are moving from low income to middle income levels. Worldwide diabetes expenditure is estimated to be USD 727 billion, which translates to one for every eight dollars spent on healthcare. In 2017 there were 1,106,500 children and adolescents with type 1 diabetes. AFR, Africa; EUR, Europe; MENA, Middle East and North Africa; NAC, North America and Caribbean; SACA, South and Central America; SEA, South East Asia; WP, Western Pacific.

22

In many cases, poor disease management leads to prolonged durations of hyperglycaemia. This poor glycaemic control is associated with onset of secondary complications and those associated with DM include loss of vision due to diabetic retinopathy23, increased risk of CVD24,25, amputation of lower extremities26 and end-stage renal disease (ESRD) as a result of diabetic kidney disease (DKD)27,28. DKD is the underlying diagnosis in 12-55% of ESRD cases worldwide and in the UK in 2018 it was the leading incident cause of ESRD requiring renal replacement therapy29. Despite improved understanding of DKD and more effective treatments, which have improved the outlook for patients with DKD30, the increasing prevalence of DM worldwide is translating into a higher incidence of ESRD attributed to DKD in individuals with T1DM and T2DM31–35. In many cases the risk of developing secondary complications of DM, including DKD, can be significantly reduced by better control of modifiable risk factors such as hyperglycaemia, obesity, and hypertension. However, in reality it is often difficult to ensure patients are achieving these clinical targets36. Although the aforementioned risk factors are important in the development of DKD, these alone do not explain the fact that many patients with poorly controlled DM will not develop DKD (and vice-versa) and evidence for a genetic predisposition to the development of DKD has been reported as early as 198137. More recent genome-wide association studies (GWAS) have reinforced this idea of an underlying genetic basis for the development of DKD and ESRD in patients with both T1DM and T2DM38–45. Further investigation of this genetic predisposition to DKD through a combination of GWAS and studies of epigenetic features should shed more light on the complex genetic mechanisms that may be involved in development of DKD.

1.2 Defining diabetes

The WHO defines DM as a chronic, progressive disease caused by insufficient insulin production from the pancreas, or due to ineffectual use of insulin produced, resulting in elevated levels of glucose in the blood known as hyperglycaemia3. This hyperglycaemia often results in the ‘classic’ symptoms of DM which include polyuria (production of excessive volumes of urine), polydipsia (abnormal thirst), weight loss frequently combined with polyphagia (excessive eating) and blurred vision10. Clinical diagnosis of DM is typically based on the concentration of glucose in plasma, either with (symptomatic) or without (asymptomatic) the ‘classic’ symptoms of DM, and a summary of these diagnostic criteria can be found in Table 1.1. If symptoms of DM are present, diagnosis can be confirmed with a one- off random plasma glucose concentration (2-h PG) of ≥11.1 mmol/L or a fasting (at least 8 h) plasma glucose (FPG) concentration ≥7.0 mmol/L. An asymptomatic DM diagnosis requires

23 repeat measurements (on at least 2 separate occasions) of either a 2-h PG of ≥11.1 mmol/L or a FPG concentration ≥7.0 mmol/L, or a combination of 2-h PG and FPG results10,46. In certain situations, glycated haemoglobin may be employed to diagnose DM using an haemoglobin A1c (HbA1c) test, which is reflective of average plasma glucose concentration over 8-12 weeks47. Although much debate has surrounded the use of HbA1c in DM diagnoses, recent reports suggest that an HbA1c ≥ 48 mmol / mol may be indicative of DM, provided the test is carried out as described by the WHO47,4849. However, the use of HbA1c is not suitable for diagnosis of DM in children, pregnant or recently pregnant women, cases of suspected T1DM, those with short duration of DM symptoms, acutely ill patients, patients taking drugs which may increase glucose levels for less than 2 months, in cases of acute pancreatic damage, during renal failure or in patients with HIV infection50.

Historically DM has been classified into T1DM and T2DM, and although both types involve periods of prolonged hyperglycaemia due to on-going failure of β-cells, the pathogenesis of these two diseases is markedly different, with several mechanistic subtypes identified8. In healthy individuals, β-cells in the pancreatic islets of Langerhans secrete insulin in response to hyperglycaemia resulting from glucose secretion from the liver, as part of a negative feedback system. Insulin acts at various sites in the body to reduce blood glucose levels including increasing the uptake of glucose into a range of cells, particularly in skeletal muscle and to facilitate the conversion of glucose to glycogen. Insulin also increases protein and fatty acid synthesis and is therefore an essential hormone for development, growth and repair of various tissue51. In DM, the failure to produce insulin or IR leads to extended periods of hyperglycaemia that can be detrimental to various body systems and often leads to a range of microvascular and macrovascular complications. Despite these similarities, the pathological mechanisms, risk factors, environmental influences, and underlying genetics responsible for T1DM and T2DM differ considerably.

24

Table 1.1| Criteria for Diagnosing Diabetes and Related Conditions52,53 Diagnosis HbA1c* (%) FPG (mmol/L) 2-h PG (mmol/L Healthy <5.7 <6.1 <7.75 Pre-diabetes 6.0 - 6.4† - - IFG - 6.1 – 7.0 - IGT - <7.0 7.8 – 11.1 Symptomatic DM ≥6.5% Single ≥7.0 Single ≥11.1 Asymptomatic DM‡ Twice ≥6.5% Twice ≥7.0 Twice ≥11.1 2-h PG, 2hour plasma glucose; DM, diabetes mellitus; FPG, fasting plasma glucose; HbA1c, glycated haemoglobin; IFG, impaired fasting glucose; IGT, impaired glucose tolerance. *HbA1c should not be used for diagnosis of type 1 diabetes; †The American Diabetes Association extends this to HbA1c 5.7–6.4%; ‡Any one of the mentioned repeated measurements or an HbA1c ≥6.5% AND a single elevated plasma glucose (fasting ≥7 or random ≥11.1)

1.2.1 Type 1 Diabetes Mellitus T1DM was traditionally thought of as an immune-associated disorder of childhood and adolescence, however, it is now known that immune-mediated DM can also develop in adulthood, and age-at-onset should no longer be considered a limiting factor in diagnoses of T1DM16,54. In most cases of T1DM, progressive, immune-mediated destruction of insulin- producing β-cells in the pancreas leads to loss of blood glucose control and requires daily insulin replacement therapy to reduce the risk of acute complications, such as ketoacidosis and hypoglycaemia, and secondary conditions which include CVD, diabetic retinopathy and diabetic nephropathy12,55. Since the 1980s, much of the research into T1DM has involved the use of rodent models, most notably the inbred BioBreeding rat56 and non-obese diabetic (NOD) mouse57,58. These animal models show many similarities with the human form of this disease including several common genetic susceptibility loci and similar pathogenesis. The study of such models has greatly improved our understanding of the underlying genetics, environmental risk factors and mechanisms of action of this complex disease12. Several studies using these rodent models combined with observations in human T1DM have shown that targeted destruction of β-cells is mediated by disruption of immune regulation leading to proliferation of CD4+ and CD8+ T cells which are auto-reactive against islet antigens such as glutamic acid decarboxylase 65 (GAD), islet antigen-2, insulin and zinc transporter 8; production of autoantibodies from B lymphocytes; and activation of the innate immune system59–62. Marcovecchio and colleagues suggested that the destruction of β-cells results from presentation of islet autoantigens to T cells in the lymph node of the pancreas, thus

25 activating them and allowing them to invade the islets63. These invading T cells then secrete cytokines which may have a direct role in β-cell destruction, but also act to attract cytotoxic T cells, which magnify the damage to β-cells, without damaging other cells within the islets63. A summary of the mechanisms involved in T1DM pathogenesis can be seen in Figure 1.2.

The main auto-antibodies involved in T1DM are insulin antibody, GAD, a tyrosine phosphatase-like molecule called islets antigen-2 antibody and zinc transporter 8, which are not known to be pathogenic and may not appear in all cases of T1DM64. Around 70% of patients with T1DM are GAD positive, however some subgroups of this disease may only have antibodies against one of the other antigens, as is the case in idiopathic T1DM, although greater concentration and number of antibodies have been associated with earlier disease onset and many patients with T1DM demonstrate abnormal glucose tolerance a few years prior to diagnosis65,66. T1DM shares certain genetic characteristics with other autoimmune diseases, particularly the involvement of many susceptibility genes within the major histocompatibility complex (MHC). Over 60 genes have been associated with T1DM and of these the human leucocyte antigens (HLA) DR3 and DR4 are thought to confer the greatest contribution to genetic susceptibility67,68. It has also been suggested that the autoimmune regulator gene AIRE plays an important role in development of T1DM and approximately 20% of those with spontaneous mutations in this gene go on to develop T1DM along with other autoimmune diseases12. Although this may partially explain the autoimmune destruction of β-cells observed in T1DM, it has been suggested that T1DM may develop from a combination of functional defects in bone marrow, thymus and immune system11. The involvement of various innate immune cells as well as cytokines produced from B lymphocytes may indicate that tissue inflammation also has an important role in the development of T1DM, similar to that in T2DM and other conditions such as Alzheimer’s disease and atherosclerosis69–77. While there is currently no known cure for T1DM, in terms of disease management recent guidelines recommend patients attempt to achieve HbA1c levels within the normoglycaemic range using intensive insulin therapy from the time of diagnosis combined with carbohydrate counting, in order to reduce the impact on future health and attempt to prevent secondary complication later in life78.

26

Figure 1.2| Pathophysiology of Type 1 Diabetes Mellitus79 In healthy individuals’ immature dendritic cells (iDCs) activated regulatory T cells (Tregs), including central tolerance which prevents β-cell death. The pathophysiology of type 1 diabetes can be broken down into three steps. a| Modified β-cell antigens are released from islets of Langerhans and presented by major histocompatibility complex (MHC) class I molecules. These antigens are recognized by CD8+ T cells that damage cells expressing MHC class I peptides by releasing cytotoxic cytokines (such as interferon-γ (IFNγ)) or via the perforin–granzyme pathway. b| β-cell components are taken up by iDCs and transported to pancreatic lymph nodes, where antigens are presented to CD4+ T cells. CD4+ effector T cells undergo clonal expansion and express adhesion molecules and chemokine receptors. c| In the pancreas activated CD4+ T cells recruit and activate inflammatory cells, resulting in insulitis. Cytokines mediate islet β- cell destruction by inducing pro-apoptotic signalling and/or expression of CD95 by islet β- cells, which are destroyed by CD95 ligand (CD95L)-expressing effector T cells. Free radical generation may also be involved in the pathogenic destruction of islet β-cell (not shown). Regulatory T cells are also involved at various stages to mediate type 1 diabetes.

27

1.2.2 Type 2 Diabetes Mellitus Although the aetiology of T2DM is more clearly understood in comparison to T1DM, the pathological processes involved are not so clear-cut, but are thought to be influenced by a combination of β-cell failure, IR, underlying genetics, and environmental factors. Unlike T1DM in which absolute β-cell failure results in failure to secrete insulin, the development of T2DM occurs due to gradual failure of insulin secretion resulting in progressive hyperglycaemia. Several mechanisms have been proposed to explain the β-cell failure in T2DM including glucotoxicity, lipotoxicity, amyloidosis, gene miscoding, mitochondrial disease and disruption of ion-channel function (Figure 1.3)80. In contrast to T1DM, several therapeutic approaches have shown that recovery of β-cell function may be possible in T2DM, most notably is the observed resolution of DM following bariatric bypass surgery, which suggests that β-cells can be reactivated over short time periods81–83.

Figure 1.3| Factors Involved with Development of Type 2 Diabetes Mellitus84 The major defects in type 2 diabetes are insulin resistance in muscle and liver and impaired insulin secretion by pancreatic β-cells. β-cell resistance to glucagon-like peptide 1 (GLP1) leads to reduced β-cell function, whereas increased glucagon levels and enhanced sensitivity to glucagon contribute to excessive glucose production by the liver. Insulin resistance (IR) in muscle and liver is aggravated in accelerated lipolysis and increased plasma free fatty acid (FFA) levels, which contributes to β-cell failure. Hyperglycaemia is maintained by increased renal glucose reabsorption by the sodium/glucose co-transporter 2 (SGLT2) and a higher threshold for glucose spillage in the urine. Weight gain results from resistance to the appetite-suppressing effects of insulin, leptin, GLP1, amylin and peptide YY, as well as imbalances in neurotransmitters, which further exacerbates IR. Vascular insulin resistance and inflammation are also involved in development of type 2 diabetes. AMPK, AMP- activated protein kinase; DPP4, dipeptidyl peptidase 4; IκB, inhibitor of NF-κB; MAPK, mitogen-activated protein kinase; NF-κB, nuclear factor-κB; RA, receptor agonist; ROS, reactive oxygen species; TLR4, Toll-like receptor 4; TNF, tumour necrosis factor; TZDs, thiazolidinediones.

28

Another feature common to T1DM and T2DM is IR which has been demonstrated in case- control insulin infusion studies in which the diabetic subjects were seen to clear glucose at a vastly reduced rate to the controls80. Almost 20 years ago, several drugs known as thiazolidinediones (TZD) were trialled to investigate the role of IR in DM due to their agonistic activity on the IR-associated nuclear receptors peroxisome proliferator-activated receptor gamma (PPARγ)85. Of these drugs, troglitazone proved to be hepatotoxic during clinical trials and rosiglitazone was associated with increased chance of cardiac failure, and these have since been discontinued86. Although pioglitazone is known to reduce IR, the unwanted side- effects of other TZD drugs has led to many concerns regarding the safety and efficacy of modulating IR85. In recent years, the idea that IR is critical to the development of T2DM has dwindled since IR is also found in many non-diabetic states, particularly in obesity. Although obesity is a known risk factor for T2DM, in many instances obesity can persist for several years without DM development80,87. In modern society, urbanisation of developing countries, an abundance of energy-dense food and decreasing levels of physical activity have led to what has been termed the “obesity epidemic” and up to an estimated 90% of cases of T2DM can be attributed to obesity88 due to a combination of IR and insulin deficiency89,90. IR in obesity is thought to result from elevated plasma levels of free fatty acids, which are metabolised at the expense of glucose, therefore reducing glucose uptake and glycogen synthesis by skeletal muscle. This leads to a state of chronic hyperglycaemia, which further impairs sensitivity to insulin and contributes to impaired β-cell function. Increased dietary fat intake has also been shown to cause lipid deposits within and around other tissues and organs that are not usually involved in fat storage. This stimulates the production of toxic reactive lipid species from the mitochondria, causing oxidative damage and cellular dysfunction to the surrounding tissue. Additionally this has the effect of enhancing IR, decreasing glucose metabolism, reducing insulin secretion from β-cells, enhancing β-cell apoptosis and this combination of factors eventually leads to DM91–94. In order to address the rising prevalence of T2DM and obesity there is a clear need for community-wide public health initiatives to raise awareness and encourage the shift towards healthier lifestyles. However, in reality these approaches are often difficult to implement and it is challenging to ensure compliance with such interventions5.

The relatively recent use of GWAS has led to the discovery of at least 40 T2DM genetic susceptibility loci and has also highlighted ethnic genetic differences, which may partially explain the phenotypic differences in DM between ethnic groups, such as those observed in Asian populations compared with Europeans95–97. Although the majority of these

29 susceptibility genes are linked to insulin sensitivity and β-cell function, two genes were of particular interest were PPARγ (which encodes PPARγ and is related to IR) and FTO (a gene linked with body weight)80,98–103 Subjects with defects in the FTO gene are found to be an average of 3kg heavier compared to the rest of the population103 and the higher prevalence of FTO and PPARγ amongst diabetics may go some way to explain the relationship between IR, obesity and DM.

1.2.3 Rare Forms of Diabetes In addition to T1DM and T2DM there are also several rarer forms of DM or diabetes-like conditions. Of particular note amongst these is gestational DM (GDM) which can be defined as any amount of glucose intolerance that is first recognised during pregnancy10. This is a temporary condition in which blood glucose values are above a normoglycaemic level, but below those typically required for DM diagnosis, and blood glucose levels should be re-tested at least six weeks after pregnancy104. Although this condition is temporary and the majority of patients’ blood glucose will return to normoglycaemic levels after pregnancy, there is a known risk of developing T2DM in addition to an increased risk of complications during pregnancy to both mother and child105. Late in the second or early in the third trimester of pregnancy, maternal insulin secretion is expected to increase up to three-fold to compensate for increased IR associated with pregnancy, and failure to secrete sufficient insulin to overcome this can often result in hyperglycaemia106,107. However, due to the criteria used to classify GDM, women with previously undiagnosed T2DM, pre-clinical T1DM or rarer forms of DM will also be included, and these cases may be prone to a higher risk of complications during pregnancy108. As our understanding of the pathophysiology of DM has improved, it has become clear that some forms of DM do not fit the criteria for either T1DM or T2DM. These cases are often the result of abnormalities within a single gene (monogenic), show an autosomal dominant pattern of inheritance and represent approximately 2% of cases of DM in the UK9. Maturity onset diabetes of the young (MODY) is a subtype of T2DM, characterised by inadequate insulin production and release from β-cells, but is typically diagnosed at a much younger age than would be expected of T2DM109. A number of subtypes of MODY have been identified and the treatment must be directed according to the specific genetic mutation, for example MODY2 is a result of a mutation in a glucokinase gene110, whereas MODY3 affects a gene coding for hepatocyte nuclear factor and is known to respond to treatment with insulin stimulating drugs called sulphonylureas111. Another rare form of DM is latent autoimmune diabetes of adults (LADA) which is a slowly progressing form of T1DM diagnosed in adulthood and characterised by the presence of diabetes-associated

30 autoantibodies without the need for insulin therapy at the time of diagnosis112. Two metabolic states that involve hyperglycaemia and insulin secretion are impaired glucose tolerance and impaired fasting glucose. Although these alone are not considered clinical conditions, they represent an intermediate stage between normal glucose control and DM, and are associated with an increased risk of T2DM46.

1.2.4 Diabetes Management and Treatment For many individuals diagnosed with diabetes overall health outcomes can be improved through careful management of glycaemia as well as cardiovascular risk factors including hypertension and hypercholesterolaemia by adopting a healthy diet, appropriate levels of physical activity and good adherence to any medicines prescribed by a physician113–115. In addition to disease management it is also essential to periodically monitor secondary complications in diabetic patients through eye examinations (retinopathy), measurement of urine albumin, serum creatinine and glomerular filtration rate (GFR) (nephropathy), foot examination and assessment of cardiovascular complications22. For individuals with T1DM an abundant supply of high quality insulin is essential and should be available to all individuals affected with this disease regardless of geographical location or socioeconomic status116–118. The International Diabetes Federation (IDF) Access to Medicines and Supplies report highlights that no low income country had full government provision (at no or low cost) of essential insulins to children or adults in 2016 and the cost of blood glucose supplies often exceeds the cost of insulin especially in some of the poorest countries116. In an attempt to address this issue and reduce the global burden of diabetes the IDF have developed the Life For A Child programme with the aim of providing insulin to over 18,000 of the poorest children and adolescents with T1DM in over 41 countries119.

For treatment of T2DM, insulin can also be used however the most commonly prescribed medications for this are metformin, sulphonylureas, glucagon-like peptide-1 (GLP-1) analogues and dipeptidyl peptidase-4 inhibitors (DPP4i)22,120. Some commonly used diabetes medications along with potential new treatments can be seen in Figure 1.4120. Such treatments act to enhance the natural response of the body to ingested food and reduce glucose levels after eating. The first line of defence against T2DM is healthy lifestyle through diet, physical activity, not smoking and maintaining a healthy body weight22. If such lifestyle changes are insufficient, the most frequently prescribed medication for initial treatment of T2DM is metformin. Metformin a complex drug that acts at multiple sites through various molecular mechanisms. This drug can directly or indirectly target the liver to lower glucose

31 production, and the gut to increase glucose utilisation, increase GLP-1 and alter the microbiome121. At a molecular level, metformin acts to inhibit Complex I of the mitochondrial respiratory chain in the liver, supressing adenosine triphosphate (ATP) production and activating 5' adenosine monophosphate-activated protein kinase (AMPK), enhancing insulin sensitivity and lowering cyclic adenosine monophosphate (cAMP), thus reducing the expression of gluconeogenic enzymes121–123. Metformin also exhibits AMPK-independent effects on the liver that may inhibit fructose-1,6-bisphosphatase with AMP121. This proposed mechanism of action has received some criticism due to high extracellular concentrations being required to produce rapid effects and some studies did not report any changes in cellular adenosine diphosphate:ATP ratios following treatment with metformin124,125 and it has been suggested that metformin can exhibit antidiabetic activity independent of AMPK126,127. It has been suggested that metformin may have adverse effects on renal function in patients with T2DM and CKD and careful adjustment of dose and monitoring of response to treatment is advised in such cases128,129. Insulin secretagogues, such as sulphonylureas and meglitinides, can also be used in T2DM to stimulate pancreatic beta cells to release insulin130. Sulphonylureas have been widely used in clinical practice since the 1950s and are a classic first or second line therapy in T2DM131,132. Meglitinides act through a similar mechanism to sulphonylureas but have a different binding site allowing more rapid absorption and faster insulin secretion but require more frequent dosing133. Alpha-glucosidase inhibitors such as acarbose, miglitol and voglibose reduce postprandial triglycerides without inducing hypoglycaemia, as they do not stimulate insulin release, and do not significantly affect body weight130,134,135 Acarbose has been seen to reduce the risk of CVD and slow the development of diabetes in patients with impaired glucose tolerance136,137. The use of TZD have been discussed previously and these increase insulin sensitivity through action on muscle, adipose tissue and liver by binding to PPARs to increase glucose utilization and decrease glucose production. However, there are a number of side effects associated with such drugs and troglitazone has been discontinued due to risk of severe hepatocellular injury138 and current TZD should be used with caution130. DPP4i increase activity of incretin agents that increase insulin secretion and inhibit glucagon therefore improving islet function and glycaemic control in T2DM. These may be used as monotherapy or in combination with metformin, TZDs and insulin. DPP4i are well tolerated; with low risk of producing hypoglycaemia, and have minimal effect on patient’s weight130. Long-term treatment of T2DM is dependent on the reversal of β-cell failure and improving IR and new treatment strategies are currently in development120,139. Despite these advances, there remains the challenge of preventing and reversing advanced microvascular and macrovascular complications associated with

32 diabetes. Current progress in this area is limited, and early diagnosis along with targeted treatment strategies may prove to be essential in closing this gap.

Figure 1.4| Current and Future Diabetes Treatments and Intervention Site120 Major Intervention sites for glucose-lowering treatments including current treatments and possible future treatments. GLP1, glucagon-like peptide-1; DPP4, dipeptidyl-peptidase-4; SGLT2, sodium-glucose co-transporter 2 (also known as SLC5A2); SGLT1, sodium-glucose co- transporter 1 (also known as SLC5A1); FGF21, fibroblast growth factor 21; SPPARM, selective peroxisome proliferator-activated receptor modulator; 11βHSD1, 11β-hydroxysteroid dehydrogenase 1. *Not recommended for glucose-lowering in all countries.

33

1.3 Kidneys in Health and Disease

The kidneys are bean shaped organs each with an indentation on the medial aspect, known as the hilum, where vessels, nerves and the ureter enter the kidney. The main anatomical features of the kidney can be seen in Figure 1.5. The outer part of the kidney is known as the renal cortex, which surrounds the inner medulla. The renal medullae are divided into triangular sections known as renal pyramids, separated by renal columns, tapering into renal papilla, each opening into a renal calyx that leads to the ureter.

Figure 1.5| Anatomy of the Kidney140 The parenchyma of the kidney is composed of an outer region known as the renal cortex and an inner region called the renal medulla. Connective tissue forms renal columns which radiate from the cortex through the medulla and separate renal pyramids and renal papillae. Renal papillae are groups of collecting ducts which transport urine produced in the nephrons to the calyces. The major calyces joint together to form the renal pelvis. The renal columns dived the kidney into 6 – 8 lobes and support the vessels entering a exiting the cortex.

The main function of the kidney is in the maintenance of homeostasis through the production of urine and in order to carry out this role each kidney contains over a million microscopic units known as nephrons. Nephrons are highly specialised funnels from which urine is produced and the structure of an individual nephron can be seen in Figure 1.6. Each nephron is composed of two main structures, the renal corpuscle and the renal tubule. Most nephrons (around 85%) are cortical nephrons which can be found high in the cortex with only the lower portion of the loop of Henle extending into the medulla. The remaining nephrons are known as juxtamedullary nephrons as their renal corpuscles are located near the junction of the cortex and medullary layers. The loop of Henle of juxtamedullary nephrons extend much

34 further into the medulla then cortical nephrons and these are involved in the concentration of urine 141. Urine from the collecting ducts of the renal tubules enter the calyx associated with each renal pyramid and from there they drain into the renal pelvis and on to the ureter. The volume of urine excreted by the kidneys is proportional intake with waste substances being secreted into the tubules of the nephron while allowing useful substances to be reabsorbed into the blood, in addition to the removal of waste substances the kidneys are also involved with adjusting fluid and electrolyte balance141,142.

Figure 1.6| Internal Structure of an Individual Nephron143 The nephron is the functional unit of the kidney and is composed of the renal corpuscle and renal tubule. The renal corpuscle contains Bowman’s capsule which is surrounded by a network of capillaries known as the glomerulus and is the site of filtration. Fluid filtered through the glomerulus enters the renal tubule for selective reabsorption after which filtrate enters the collecting duct and is transport through the renal calyces to the ureter.

1.3.1 Global Burden of Chronic Kidney Disease CKD is a worldwide public health priority and is estimated to affect between 11 – 13% of the global population. CKD is associated with increased cardiovascular morbidity, premature mortality and a substantial healthcare cost144–146. In 2016, CKD was the 16th most common cause of death worldwide and by 2040 it is predicted to reach 5th place in the global mortality table as can be seen in Figure 1.7147. The increasing burden of CKD has been seen to affect low-income and middle-income countries at disproportionate levels and people with CKD in these countries are at higher risk of premature cardiovascular death or developing ESRD due

35 to poor management of risk factors and inadequate CKD management strategies148,149. Individuals in low- and middle- income countries also have poor access to renal replacement therapy, for example in middle and eastern Africa less than 3% of people requiring renal replacement therapy receive it150.

Figure 1.7| Leading Causes of Years of Life Lost in 2016 and Predictions for 2040147 Top 20 Level 3 causes of years of life lost (YLLs) from 2016 and estimates for 2040 by rank order. Colour-coding is based on Global Burden of Disease (GBD) Level 1 cause hierarchy: red=communicable, maternal, neonatal, and nutritional diseases; blue=non-communicable causes; green=injuries. Lines between time periods connect causes with solid lines representing increasing relative rank and dashed lines representing decreasing rank. From 2016 to 2040, three measures of change are shown: percentage change in total number of YLLs, percentage change in the all-age YLL rate, and percentage change in the age standardised YLL rate. Statistically significant changes are in highlighted in bold. COPD, chronic obstructive pulmonary disease; Neonatal preterm birth, neonatal disorders due to preterm birth complications.

CKD is typically defined as abnormalities in kidney structure or function persisting for more than 3 months and impacting on the health of the individual151. Indicators of kidney damage are used to classify this disease including anatomical abnormalities on ultrasound imaging, reduced renal function measured by low estimated GFR (eGFR) and the presence of proteinuria most commonly measured using urinary albumin to creatinine ratio (ACR)151. Traditionally, the terms microalbuminuria (ACR of 30 to 300 mg/g) and macroalbuminuria (ACR > 300 mg/g) were used to denote the severity of proteinuria amongst CKD patients.

36

More recent data have demonstrated that any increase in ACR above normal level is associated with increased risk of renal damage, irrespective of eGFR levels, and therefore it is recommended that the terms microalbuminuria and macroalbuminuria be replaced by clinically significant albuminuria35,152–156.

As the body ages the kidneys undergo age-related structural changes along with a reduction in functional capacity. In adults over 35 years, the kidneys gradually lose functional nephrons and decrease in size so that by 80 to 85 years old many individuals will have lost up to 30% of total kidney mass157. Although many older people with reduced kidney mass will continue to have normal kidney function there is a reduction in the “margin of safety” which in turn will impact on the ability of the kidneys to respond to stress placed upon remaining nephrons such as infection or reduced blood flow157. Many elderly people have a low GFR but without any specific kidney disease. It is therefore important to understand the underlying genetic causes of kidney disease and decreasing renal function in order to design the most effective therapies and reduce the associated healthcare burden158,159. Persistent reduced kidney function with or without proteinuria, measured by ACR, is indicative of CKD and is associated with damage to kidney tubules and glomeruli. Advanced CKD may lead to further complications necessitating renal replacement therapy (RRT) (dialysis and/or renal transplant).

1.3.1.1 Pathophysiology of Chronic Kidney Disease Regardless of the underlying cause of kidney damage, there are three main intermediaries which contribute to the progression of CKD160. Firstly, glomerular hyperfiltration and an increase in single nephron GFR result from a loss of nephron mass, as described by Brenner and colleagues161–163. Hypertension has long been established as an important marker of CKD development and the Multiple Risk Factor Intervention Trial demonstrated that blood pressure is an independent predicator of disease progression amongst CKD patients164. Finally, proteinuria which was once viewed as simply a marker for renal injury, is now generally acknowledged as a contributing pathogenic pathway in CKD progression irrespective of specific underlying glomerular pathology and will lead to increased glomerular damage165. Gradual decline in kidney function in CKD leads to dysfunction of several biological pathways including altered cellular metabolism, changes to nitrogen balance and protein metabolism, IR and increased production of mediators of inflammation and oxidative stress166–169. Increased generation of reactive oxygen species (ROS) has been observed at various stages of CKD and it has been suggested that this may be the result of mitochondrial dysfunction170.

37

Although the underlying cause of reduced renal function in CKD may vary between cases there are several pathways involved with CKD development including the accumulation of extra cellular matrix (ECM) proteins in the glomerulus, fibrosis, tubular atrophy and inflammation (Figure 1.8). The glomerular capillaries are surrounded by tubular epithelial cells called podocytes which are key components in the control of glomerular filtration171. Podocytes are highly specialised cells with foot processes regularly spaced around the glomerular basement membrane forming a multi-layered sieve known as the slit diaphragm172. The glomerular filtration barrier is a dynamic structure due to the ability of podocytes to efface and reform their foot processes, and effacement is thought to occur in response to podocyte injury stimulated by actin cytoskeleton reorganisation and accompanied by increased proteinuria. Although this process of podocyte effacement is thought to be reversible in the early stages, it has been suggested that prolonged stress on the glomerular filtration barrier will lead to a reduction in podocyte density and overall podocyte numbers. This may be due to podocyte detachment and effacement resulting from shear forces acting on the walls of the slit diaphragm experienced in glomerular hyperfiltration159,173.

38

Figure 1.8| Pathogenesis of Chronic Kidney Disease174 After the loss of half the total nephron mass, CKD progresses in a similar manner regardless of underlying causes. Activation of the renin-angiotensin-aldosterone system (RAAS) occurs due to initial hyperfiltration leading to proteinuria. Angiotensin II and protein uptake in the renal tubules leads to inflammation and fibrosis of the glomerulus and tubules. This leads to progressive decline in glomerular filtration and systemic complications.

Several mediators of fibrosis and inflammation have been implicated in CKD development, possibly due to their actions in the glomerulus and kidney tubules. One such mediator is transforming growth factor beta (TGF-β) which is secreted in response to renal inflammation has also been shown to promote renal fibrosis175. TGF-β is thought to play a key role in the development of glomerulosclerosis through its ability to stimulate podocyte apoptosis and detachment from the glomerular basement membrane176; induce mesangial expansion through hypertrophy and proliferation of mesangial cells and synthesise ECM177; initiation of endothelial to mesenchymal transition to produce glomerular myofibroblasts which are a

39 major source of ECM175,178. These actions of TGF-β contribute to the pathological processes of CKD through increased fibroblast proliferation and the associate production of ECM in renal tubule and fibroblasts which eventually results in loss of epithelial cells, tubular cell death and interstitial fibrosis.

Expression of epidermal growth factor receptor (EGFR) may be a key player both CKD and acute kidney injury (AKI), which is a known risk factor for CKD. EGFR activation by its exogenous ligands such as EGF has been shown to promote regeneration and recovery following AKI whereas in many animal models of CKD EGFR has been associated with the initiation and progression of renal fibrosis, which may be due to a defective repair mechanisms in response to persistent renal injury179. Activation of the NLRP3 inflammasome by Toll-like Receptors (TLRs) is also thought to play an important role in renal fibrosis and progression of CKD in response to damage associated molecular patterns released from injured tissue and may also partially explain the maladaptive response of the immune system observed in CKD180,181. Reduced renal function is associated with a number of other complications, most notably of which is an increased risk of CVD, with evidence to suggest that decreasing GFR is associated with increased CVD risk154,182–185. CKD has also been associated with other vascular complications including hypertension, cerebrovascular diseases and stroke186–188. The association between CKD and CVD may be multifactorial due to the existence of other concordant cardiovascular conditions such as hypertension and diabetes, which may in turn influence the course of treatment received186,189,190. Oxidative stress is a common feature of CKD and increased production of ROS due to mitochondrial dysfunction has been observed in diabetes, inflammation and aging191. Increased production of ROS due to mitochondrial dysfunction may also contribute to CVD and other comorbidities associated with CKD191–193. Release of ROS is also known to initiate sustained activation of EGFR through ROS-dependent phosphorylation of Src and also act as damage-associated molecular patterns activating the NLRP3 inflammasome when released from injured renal tissue180,194. Over the last decade or so several genes have been associated with CKD development and a number of these may also be directly involved in maintaining mitochondrial function and integrity195. Further investigation into the role mitochondria play in maintaining renal health and their contribution to glomerular dysfunction and CKD may unveil new disease mechanism as well as novel treatments targeted at restoring mitochondrial function.

40

1.3.2 Diabetic Kidney Disease In addition to the immediate impact on daily life, DM is associated with a number of microvascular complications such as DKD and diabetic retinopathy, or macrovascular complications including diabetic CVD23–25,27,28. The rising burden of CKD has been attributed to increasing rates of obesity and diabetes in low-income and middle-income countries and from 2005 to 2015, the prevalence of DKD increased by 39.5% globally196,197. In the western world DKD is the leading cause of ESRD3,198,, a major cause of mortality in T1DM and T2DM, and has also been associated with higher risk of CVD. Over the course of this disease the glomeruli undergo three major histological changes. Hyperglycaemia initially triggers accumulation of ECM that leads to mesangial expansion and thickening of the glomerular basement membrane (GBM). Intraglomerular hypertension resulting from ischaemic injury or dilation of the afferent artery then leads to glomerular sclerosis199. These morphological changes to the glomerulus can be seen in Figure 1.9. In individuals with overt proteinuria and reduced GFR systemic hypertension is common and increased systolic blood pressure in DKD may further exacerbate kidney damage leading to ESRD199,200.

Figure 1.9| Structural Changes in the Glomeruli in Diabetic Kidney Disease201 The healthy glomerulus includes afferent arterioles, capillary loops, endothelial cells, basement membrane, podocytes, parietal epithelial cells, and tubule epithelial cells and is impermeable to albumin. During diabetic kidney disease (DKD) the glomerulus undergoes arterial hyalinosis, mesangial expansion, collagen deposition, basement membrane thickening, podocyte loss and hypertrophy, albuminuria, tubular epithelial atrophy, accumulation of activated myofibroblasts and matrix, influx of inflammatory cells, and capillary rarefaction. The micrograph on the left shows a normal glomerular section and on the right is from a DKD sample (PAS stained, x400 magnification).

41

1.3.2.1 Diagnosing Diabetic Kidney Disease A typical DKD diagnosis is based on persistent clinically detectable proteinuria, or macroalbuminuria, combined with elevated blood pressure and reduced GFR, and is often preceded by a prolonged period of microalbuminuria (albumin excretion rate of 20 – 200 µg/min) of up to 20 years, sometimes referred to as incipient diabetic nephropathy202. The clinical criteria for diagnosis of DKD can be found in Table 1.2. Hyperglycaemia is known to be one of the main risk factors for development of DKD203, and studies have shown that intensive control of blood glucose levels (HbA1c of <6.5%) may be effective in halting decline in GFR and progression to ESRD35,204. Despite this evidence there is still much debate over the benefits of intensive glycaemic control in terms of DKD management due to a significantly increased risk of hypoglycaemia associated with reduced kidney function, particularly during the later stages of DKD205–207208,209. Earlier familial aggregation studies suggested a genetic basis for DKD and several candidate genes have emerged such as those regulating the renin- angiotensin-aldosterone system (RAAS), hypertension, dyslipidaemia, hyperglycaemia and immune response198,210,211

Table 1.2| Clinical Criteria for Diagnosis of Diabetic Kidney Disease212 Stage Description Characteristics GFR UAE Blood Time scale (mL/min/ (mg/ Pressure (years) 1.73 m2) day) 1 Type 1 – Glomerular Normal Present at Hyperfiltration hyperfiltration and ≥90 <30 Type 2 – diagnosis hypertrophy Normal hypertension 2 Type 1 – GBM thickening, Normal mesangial Silent 60-89 <30 Type 2 – <5 expansion and Normal sclerosis hypertension 3 Type 1 – Increased 30- Incipient Microalbuminuria 30-59 Type 2 – 5-15 300 normal hypertension 4 Overt Macroalbuminuria 15-29 >300 Hypertension 15-25 5 Uraemic/ESRD Diffuse global <15 (or Hypertension ↓ 25-30 glomerulosclerosis dialysis) ESRD, end-stage renal disease; GBM, glomerular basement membrane; GFR, glomerular filtration rate; UAE, urinary albumin excretion.

1.3.2.2 Pathophysiology of Diabetic Kidney Disease It has become increasingly evident that hyperglycaemia alone does not explain the high prevalence of DKD. It is thought that development of DKD is due to a combination of

42 haemodynamic factors, metabolic influences and inflammation. Prior to the development of DKD, the kidneys of diabetic patients may undergo metabolic changes such as hyperaminoacidemia and hyperglycaemia that alter kidney haemodynamics leading to inflammation and fibrosis213–215 (Figure 1.10). Glomerular hyperfiltration due to diabetes is also thought to contribute to DKD development, although the mechanisms involved are not yet fully understood216. It has been proposed that increased proximal tubular reabsorption of glucose in the proximal tubule by sodium–glucose cotransporter 2 leads to decreased distal delivery of solutes such as sodium chloride to the macula densa217,218. This may lead to a decrease in tubuloglomerular feedback which may dilate the afferent arteriole increasing glomerular perfusion. At the same time, increased production of angiotensin II at the efferent arteriole produces vasoconstriction therefore leading to an overall increase in intraglomerular pressure and glomerular hyperfiltration215,217.

43

Figure 1.10| Pathophysiology of Diabetic Kidney Disease174 The pathogenesis of diabetic kidney disease is like that of chronic kidney disease with chronic hyperglycaemia thought to be the main pathologic event causing damage to the kidney mediated by glomerular hyperfiltration, direct effects of hyperglycaemia, advanced glycosylation end products (AGE) and cytokine secretion. Dilation of the afferent arteriole is thought to be mediated by hyperglycaemia and high concentrations of insulin-like growth factor-1. Increased glomerular filtration rate results from enhanced sodium transport due to hyperfiltration of glucose which alters sodium-glucose transport in the proximal convoluted tubule. Hyperglycaemia and AGE induce mesangial matrix expansion, cellular expansion, apoptosis and increase basement membrane permeability. Elevated levels of cytokines such as VEGF, TFG-β and profibrotic proteins are also thought to cause damage to nephrons.

44

1.3.2.2.1 Haemodynamic Pathway Under normal circumstances, cleavage of angiotensinogen by renin produces angiotensin I, which is further cleaved by angiotensin-converting enzyme (ACE) into the physiologically active angiotensin II, which is involved in maintenance of blood pressure by vasoconstriction. The two major binding sites for angiotensin II in the kidneys are angiotensin II receptor type

1 (AT1) and angiotensin II receptor type 2 (AT2). Of these receptors, AT1 is known to be involved in vasoconstriction, aldosterone release, stimulation of tubular transport, inflammation, fibrogenesis and growth219. In DKD, the RAAS is a common therapeutic target and ACE inhibitors and angiotensin II receptor blockers have proven efficacy in reducing albuminuria and retarding loss of renal function, slowing the eventual progression to DKD and ESRD220–223. The other main haemodynamic pathway involved in DKD is the endothelin system, particularly endothelin-1 (ET-1) which also stimulates vasoconstriction of the efferent arteriole. Within the kidney, the physiological functions of ET-1 mimic those of RAAS and are involved in hypertension, endothelial dysfunction, inflammation and fibrosis224. Enhanced ET-1 expression also activates a signalling cascade stimulating growth and proliferation of mesangial cells and production of extracellular matrix, resulting in inflammation and fibrosis in the kidneys. ET-1 may also be capable of activating receptors which increase permeability of the glomerulus thus enhancing albuminuria and accelerating the progression of DKD225. The RAAS and endothelin systems are also known to interact with each other, resulting in systemic and intraglomerular hypertension, increased expression of TGF-β and connective tissue growth factor resulting in further accumulation of connective tissue and fibrosis of the kidney198.

1.3.2.2.2 Metabolic Pathway In 2001, Brownlee detailed the metabolic pathway of DKD in which hyperglycaemia disrupts normal glycolysis and stimulates the polyol pathway, the hexosamine pathway, production of advanced glycation end products (AGE) and activates protein kinase C (PKC)170. An overview of this can be seen in Figure 1.11. During hyperglycaemia, glyceraldehyde-3- phosphate dehydrogenase (GADPH) is inhibited by excess superoxide production from the electron-transport chain, which impedes glycolysis and results in accumulation of glucose, glucose-6-phosphate, and fructose-6-phosphate226,227. Excess glucose enters the polyol pathway and is converted to sorbitol by reduced nicotinamide adenine dinucleotide phosphate (NADPH)-dependent aldose reductase, sorbitol is then converted to fructose. This process results in decreased levels of NADPH that in turn inhibits the regeneration of reduced glutathione (GSH). Decreased levels of GSH may have a role in increasing oxidative stress and

45 cellular apoptosis170,228. During the oxidation of sorbitol to fructose NADH concentration increases which inhibits GADPH activity and increases the formation of methylglyoxal and diacylglycerol, which then enter the AGE and PKC pathways170. The fructose generated has also been implicated in increased proteinuria, reduced GFR and increased fibrosis within the kidneys228,229. Fructose-6-phosphate is produced during the third step of glycolysis and in hyperglycaemia this enters the hexosamine pathway. In this pathway fructose-6-phosphate is converted to glucosamine-6-phosphate which then acts as a substrate to promote transcription of the inflammatory cytokines; tumour necrosis factor-α (TNF-α) and TGF-β1170. Increased TGF-β1 levels are involved in increasing renal cell hypertrophy and mesangial matrix components, two classic characteristics of DKD230,231. AGE are produced due to the irreversible glycation of proteins that takes place in hyperglycaemia225. Three pathways are primarily responsible for production of AGE precursors: oxidation of glucose to glyoxal, degradation of Amadori products, and aberrant glycolysis which forces the conversion of glyceraldehyde-3-phosphate to methylglyoxal. AGE then damage both intracellular and extracellular proteins resulting in structural changes and increased expression of ECM proteins225,232. AGE are also capable of binding various pro-inflammatory receptors thus activating downstream release of cytokines such as IL-1, IL-6, and TNF-α; growth factors such a TGF-β1, vascular endothelial growth factor (VEGF), platelet-derived growth factor subunit B, connective tissue growth factor; and increased production of ROS233–235. The other main metabolic pathway thought to be involved in DKD, the PKC pathway, results from the conversion of glyceraldehyde-3-phosphate into dihydroxyacetone phosphate and then diacylglycerol, when then facilitates PKC activation236. During hyperglycaemia, prolonged diacylglycerol upregulation contributes to sustained PKC activation237. PKC plays a number of roles in DKD, including promoting glomerular hyperfiltration236, abnormal renal blood flow, altered capillary permeability and altered expression of nitric oxide synthase which has been associated with severe proteinuria, declining renal function, and hypertension238,239. PKC also mediates increase in VEGF which, as noted above, is linked to abnormal intrarenal blood flow, altered capillary permeability, the development of microalbuminuria and fibrosis due to glomerular basement membrane thickening and ECM accumulation236,240.

46

Figure 1.11| Key Components of the Metabolic Pathway of Diabetic Kidney Disease170 Under normoglycaemic conditions most of the body's glucose is used in glycolysis and TCA cycle. During chronic hyperglycaemic conditions, excess glucose may enter the following pathways: Polyol pathway whereby it is reduced to sorbitol which is then oxidised to form fructose; Hexosamine pathway in which glutamine fructose-6-P amidotransferase (GFAT) converts fructose 6-P to glucosamine 6-P. This is followed by the formation of UDP-GlcNAc which is the substrate for protein translational modifications. This pathway is known to be involved in insulin resistance and diabetes; The protein kinase C (PKC) activation pathway originates from either fructose-6-P or glyceraldehyde-3-P is converted to Glycerol-3-P which forms diacylglycerol (DAG) capable of activating several isoforms of PKC, involved in numerous signalling pathways; The advanced glycation end products (AGE) pathway in which AGE are produced via the formation of methylglyoxal from glyceraldehyde-3-P which then reacts with residues of proteins, forming advanced glycation end products. The activation of these pathways along with oxidative stress due to excess superoxide production is thought to have an important role in the development of diabetic kidney disease.

47

1.3.2.2.3 Inflammation In addition to the haemodynamic and metabolic components involved in DKD development, it has been suggested that chronic activation of the innate immune system and a low-grade inflammatory state in diabetic patients also contribute to the pathogenesis of this disease241,242. Nuclear factor kappa-light-chain-enhancer of activated B cells (NF-κB) is a transcription factor capable of regulating the expression of multiple genes involved in inflammation, immunity, apoptosis, and is found in a range of cell types in the human kidney243,244. Expression of NF-κB is increased during hyperglycaemic conditions245 and in DKD, NF-κB activation is correlated with proteinuria and interstitial cell infiltration which further stimulates release of NF-κB leading to a vicious cycle resulting in persistent proteinuria243,244,246. ROS produced in a hyperglycaemic environment activate Janus kinase-2 (JAK2) in the renal and vascular tissue247 which is associated with hypertrophy of mesangial cells241. In subjects with DKD an inverse correlation has been observed between JAK2 mRNA levels and estimated GFR248. A number of inflammatory cytokines such as TNF-α and Interleukins (IL-1, IL-6, and IL-18) have also been observed in DKD249,250. In animal models, upregulation of TNF-α and IL-6 has been associated with increased kidney weight and urine albumin excretion249. Additionally, IL-18 and TNF-α levels also correlated positively with levels of albuminuria in patients with DM251,252. At the cellular level, these cytokines are thought to have various roles in inflammation associated with DKD including increased kidney weight, albuminuria, increased vascular endothelial cell permeability, promotion of glomerular hypercellularity and GBM thickening, inducing apoptosis of endothelial cells, and are directly toxic to cells253–256.

The pathogenesis of DKD is related to a complex interaction between haemodynamic, metabolic and inflammatory pathways. The mediators of the kidney injury include several cytokines and growth factors that produce oxidative stress, abnormal glycosylation and lipid peroxidation. These events lead to increased inflammation and structural changes in the glomeruli, resulting in dysfunction of the glomerular basement membrane barrier and the onset of microalbuminuria198,257. Although hyperglycaemia and albuminuria are known to be crucial to the development of DM and DKD, this does not explain why many patients with uncontrolled DM do not go on to develop DKD and this has led to the idea that the pathogenesis of this disease may be influenced by genetic components258.

1.3.2.3 Treatment of Diabetic Kidney Disease The main aim when treating DKD is to prevent progression to macroalbuminuria, maintain renal function and decrease associated comorbidities. For the past 30 years, RAAS blockade,

48 in conjunction with glycaemic control, weight loss and blood pressure control have remained the mainstay of treatment for DKD. Recent attempts to develop novel treatments for DKD such as early renin RAAS blockade in patients with T1DM and T2DM, dual RAAS blockade, protein kinase C-beta inhibition, treatment with endothelin receptor antagonists and treatment with the anti-oxidant bardoxolone have been unsuccessful259. In more recent years, new components of the RAAS system have been identified and may be potential therapeutic targets for DKD treatment, for example angiotensin-converting enzyme 2 (ACE2) which abundant in the kidneys but is not responsive to conventional ACE inhibition260 has been seen to counteract the adverse effects of angiotensinogen II and is thought to provide renoprotection by reducing oxidative stress, inflammation, and lipotoxicity260. Animal models have also shown that increase activity of ACE2 in podocytes may slow the development of DKD261. Like RAAS, the natriuretic peptide (NP) system has been proposed as a possible target for DKD treatment. The enzyme, neprilysin is involved with the degradation of NPs, and several other vasoactive peptides such as bradykinin, substance P, angiotensin II and endothelin262. Inhibition of neprilysin leads to natriuresis, vasodilatation, reduced intraglomerular pressure and decreased proteinuria262. Sodium-glucose cotransporter-2 (SGLT2) inhibitors are glucose lowering agents known to reduce blood pressure, weight, improve arterial stiffness and may offer renal protection through tubuloglomerular feedback263–267. Studies in humans have shown that inhibition of SGLT2 results in lowering of plasma uric acid levels by 10-20%268–270. which may be associated with renal protection270. GLP-1 receptor agonists and DPP4i are incretin-based therapies that are glucose lowering agents and may have renoprotective effects. Endogenous GLP-1 increases pancreatic β-cell insulin secretion and inhibits glucagon secretion to lower glucose. DPP-4i rapidly breaks down GLP-1 at the brush border of proximal tubules and glomerular podocytes271. Clinical data on such therapies is limited but DPP4i treatment appears to slow development and progression of microalbuminuria in T2DM272. Therapies which target inflammatory and fibrotic pathways are also of interest for treating DKD for example those which block the formation of AGE and TGF-β such as the AGE inhibitor pyridoxamine273 and the Janus kinase inhibitor baricitinib274, which was reported to reduce albuminuria, urinary levels of IP-10 and plasma levels of tumour necrosis factor-receptor 2 after treatment for 6 months274. Further work is required to determine if these effects on translate into preservation of renal function over time. Inhibition or reduction of TGF-β may also have anti fibrotic effects in the kidney, although more studies are required to test the safety and efficacy of such treatments275. The limited success of DKD treatments is disappointing. More personalized diet-management

49 plans combined with targeted therapies specific to each patient, and stage of DKD, may be required to improve quality of life and prolong survival in those suffering from DKD.

1.3.2.4 Genetics of Diabetic Kidney Disease The idea that some of those persons with DM have an underlying genetic predisposition to develop DKD was first reported by Seaquist and colleagues in 1989, based on the observation that 83% of diabetic siblings of those with DKD had evidence of diabetic nephropathy compared to 17% of diabetic siblings of patients without DKD210. These observations have since been reproduced by others and further evidence of a genetic influence in DKD has emerged, such as the ethnic differences in susceptibility to DKD276. Advances in molecular genetics and genomic techniques such as genetic linkage analysis and GWAS have allowed the investigation of specific genetic susceptibility variants or genes involved in DKD277–279. Genetic linkage analyses are family based studies, which make use of linkage disequilibrium (LD) in disease causing variants at specific genetic loci280. LD occurs when the different alleles of a particular gene are found at a higher frequency than would be expected if the loci were inherited independently and in family-based linkage studies this focuses on the identification of a specific disease susceptibility marker which has been inherited from the same parent281. Genetic linkage studies for DKD have investigated functional polymorphisms in genes related to various systems involved in DKD such as RAAS, those involved in metabolism and homeostasis of blood glucose, IR pathways and lipid production280. These have also investigated cytogenetic locations and map loci associated with renal complications282,283. There are several limitations of genetic linkage studies, particularly the requirement for previous knowledge of the region under investigation, availability of samples from multiple generations of families, and the need for assumptions to be made regarding disease pathogenesis. Linkage studies have been used to mapping genetic diseases with a known mode of inheritance and clear associations between genotype and phenotype, such as in MODY284,285.

Over the last decade, GWAS have emerged as a useful research strategy to investigate the degree to which certain genetic biomarkers are associated with a disease state in unrelated patients compared with healthy controls. GWAS offer several advantages over traditional genetic linkage studies, most notably the use of single nucleotide polymorphisms (SNPs) as genetic biomarkers, which allows the whole genome to be surveyed without the need for biological assumptions regarding disease pathology, thus enabling the discovery of novel associations and genetic variations in areas of the genome which may not have been otherwise considered. Due to a significant amount of regional variation in genetics and the

50 requirements for a large sample size, the results of GWAS studies into DKD have often been difficult to reproduce. The ELMO1 gene codes for a protein which mediates Rac activation in cell migration286 and its expression has previously been associated with DKD287. Two SNPs in intron 16 of ELMO1 have been associated with DKD development in those with T2DM, rs11769038 (odds ratio [OR] 1.24; P= 1.7 × 10-3) and rs1882080 (OR 1.23; P= 3.2 × 10−3), both reached genome wide significance in a comprehensive analysis of the data collected in the Genetics of Kidneys in Diabetes (GoKinD) US study288, but these results were not replicated in the GoKinD UK study289. A large-scale meta-analysis investigating DKD known as the Genetics of Nephropathy – an International Effort (GENIE) consortium reported suggestive associations with T1DM for SNPs within the ERBB4 and AFF3 genes41. ERBB4 is an epidermal growth factor receptor which is co-expressed with collagen genes in the tubulointerstitial compartment of the kidney and the presence of rs7588550 (OR 0.66; P = 2.1×10−7) may have a renoprotective effect in DKD41. In the AFF3 gene the SNP rs7583877 did not achieve genome wide significance for association with DKD (OR 1.14; P= 0.002) but was found to be significantly associated with ESRD (OR 1.29; P = 1.2×10−8)41. Functional data has shown that this gene is a component of the TGFβ pathway and increased expression has been link to kidney fibrosis in cellular models41,290.

More recently the SORBS1 gene has emerged as having a renoprotective effect in DKD when the rs1326934 C allele is present (OR 0.83; P= 0.009), although this did not reach genome wide significance in the replication cohort of this meta-analysis42. In another meta- analysis of GWAS data from patients with T1DM a moderate association was reported between rs10811661 and DKD (OR 1.15; P= 0.011), and a stronger association with ESRD (OR 1.35; P= 0.00038)291. The association of this SNP, found in the CDKN2A/B locus, with DKD and ESRD in T1DM is particularly interesting as this is also thought to have a role in predisposition to T2DM, which may suggest a similar genetic basis of DKD in T1DM and T2DM98,291,292. The power of GWAS to identify true genomic associations relies heavily on sample size and repeating the same analysis with a larger sample size will often reveal significantly more genetic loci associated with the phenotype being investigated290. The work of very large scale international collaborations such as the Juvenile Diabetic Research Foundation’s Diabetic Nephropathy Collaborative Research Initiative(DNCRI), have allowed improved the accuracy and replicability of GWAS and have identified novel genetic loci involved in the development of DKD290,293,294. The development of DKD and ESRD cannot be explained by the influence of a single gene or SNP, but this heterogeneous disease may develop due to complex interactions between numerous biological pathways, genetic

51 influences and environmental risk factors. The familial aggregation of DKD would suggest a genetic basis for this disease, however this may also be an indicator that other non- genomic, yet inheritable, factors are involved such as the involvement of the mitochondrial genome and epigenetic influences.

1.4 Mitochondria and Kidney Disease

Mitochondria are double membraned organelles, the outer layer of which acts as a gateway into the organelle. An invaginated inner membrane forms the cristae which extend into the mitochondrial matrix increasing the surface area for electron transfer. Mitochondria are responsible for generation of adenosine triphosphate (ATP) to provide energy to eukaryotic cells via a series of oxidative phosphorylation reactions known as the electron transport chain (ETC) but also regulate cellular metabolism through via heme and steroid synthesis, generating ROS, establishing the membrane potential and controlling calcium and apoptotic signaling295296. It is thought that these cellular powerhouses originated through an endosymbiotic relationship with an α-proteobacterium297. Each mitochondrion contains multiple identical copies of mitochondrial DNA (mtDNA) which encodes the main proteins of the ETC as well as ribosomal RNAs (rRNA) and transfer RNAs (tRNA) required for translation of messenger RNA (mRNA)298. The basic structure of a mitochondrion can be seen in Figure 1.12.

Figure 1.12| Basic Structure of a Mitochondrion299 The mitochondrion is separated from the cytoplasm by the outer membrane. The inner membrane separates the intermembrane space from the central matrix. The inner membrane is differentiated into the inner boundary membrane and the cristae. The cristae extend into the matrix and are the main sites of mitochondrial energy conversion. Mitochondria are highly dynamic and undergo transient and rapid morphological adaptations which are crucial for many cellular processes such as cell cycle, immunity, apoptosis and mitochondrial quality control.

52

1.4.1 The Mitochondrial Genome Genes required for normal mitochondrial function are found on both mitochondrial DNA (mtDNA) and nuclear DNA (nDNA). The mitochondrial genome is a circular double stranded DNA molecule composed of 16,569 base pairs harbouring 37 genes, which encode 13 key proteins of the ETC along with two rRNAs and 22 tRNAs essential for their translation, the structure of this can be seen in Figure 1.13300. The remaining components of the ETC, along with polypeptides involved in maintaining and regulating expression of mtDNA, and mitochondrial dynamics are encoded by nuclear genes, synthesised on ribosomes in the cytoplasm and transported to the appropriate location within mitochondria301. The two strands of mtDNA are classified according to their guanine content which gives each strand a distinct buoyant density in caesium chloride gradients300. Most genetic information is found on the heavy (H) strand with genes coding for two rRNAs, 14 tRNAs and 12 polypeptides, whereas the light (L) strand contains only nine genes; eight coding for tRNAs and one polypeptide of the ETC300,302.

In contrast to nDNA, mtDNA lacks introns and non-coding intergenic regions, apart from a small regulatory region between the mitochondrial genes for phenylalanine and proline tRNAs known as the non-coding region (NCR). This 1.1 kb stretch of DNA frequently contains a third single strand of DNA 650 nucleotides long (7S DNA) which forms a displacement-loop, or D-loop303. Contained within the D-loop are the H- and L-strand promoters (HSP and LSP, respectively), as well as the origin of replication for the H-strand (OH). The prevailing hypothesis for many years was that the D-loop is an mtDNA replication intermediate formed by incomplete H-strand replication, known as the strand-displacement model (SDM)304–306.

This idea is based on the proximity of the 5’ end of the D-loop with OH, however, the high turn-over rate of 7S DNA strands would suggest a functional role beyond that of a replicative intermediate307,308. The existence of a third DNA strand permits a more open conformation of the mtDNA molecule, allowing proteins involved in replication or transcription to bind and regulate these activities309,310. Holt and colleagues311 propose an alternative function of the D-loop in regulating packaging and maintenance of mtDNA. This is thought to involve the action of a specialised ATPase, ATAD3p, which facilitates aggregation and separation of mtDNA and anchors these molecules to the inner mitochondrial membrane311,312. This may indicate that the D-loop is a functional element of mtDNA packing and maintenance, although this again does not explain the high turnover of 7S DNA and the apparent presence

313 of the ATAD3 N-terminal outside the mitochondrial matrix . According to the SDM, the OH within the NCR is essential for replication of mtDNA with initiation of H-strand replication

53 beginning here and continuing two thirds of the way around the molecule until origin of replication for the L-strand is exposed, thus initiating synthesis of the L-strand, after which H- and L-strand synthesis continues to produce two daughter mtDNA molecules314,315. Alternative models of mtDNA replication have also been proposed such as RNA incorporation in the D-loop during mtDNA replication and the strand-coupled bi-directional replication model316,317. More recently it has emerged that a diverse range of mechanisms are present across different metazoan taxa and more evidence is still required to fully understand the process of mtDNA replication318.

Figure 1.13| Mitochondrial DNA and Proteins of the Electron Transport Chain Each mitochondrion contains multiple copies of mitochondrial DNA (mtDNA). The mitochondrial genome (a) is a circular genome composed of 16,569 base pairs and encodes 37 genes. 14 of these genes code for proteins of the electron transport chain (b) as well as 2 ribosomal RNAs and 21 transfer RNAs required for translation of messenger RNA. The colour of the genes in (a) are matched to the colours of the ETC proteins in (b), Complex II is encoded entirely by nuclear genes, highlighting the importance of nuclear genes for proper mitochondrial function. mtDNA is often located closed to stalked particles that contain the proteins of the ETC which may contribute to a higher mutation rate compared with nuclear DNA, due to ROS generated as a respiratory by-product 300,302,319.

54

In humans, mitochondrial inheritance is uniparental, passing to the next generation through maternal gametes, with numerous mechanisms in place to eliminate paternal mitochondria and mtDNA from zygotes contributing to mtDNA homoplasmy in offspring. As each mitochondrion contains multiple copies of maternally inherited mtDNA and each cell contains many mitochondria it is possible for mutations to occur during replication which will affect some mtDNA molecules but not others. This phenomenon is known as heteroplasmy as there is a mixture of mutant and wild type mtDNA molecules within the same mitochondrion, cell or tissue. Mitochondrial heteroplasmy may contribute to disease development if mutant mtDNA molecules with a group of cells or tissues exceed a certain threshold resulting in metabolic dysfunction and disease320. A genetic bottleneck exists to act against the high copy number and high mutation rate of mtDNA, reducing opportunity for genetic selection by ensuring only a subset of mitochondrial genomes are transferred to the oocyte during ovum formation321–323. This genetic bottleneck along with uniparental inheritance of mtDNA acts to limit the amount of variation within individuals, selecting against deleterious mtDNA variants while still permitting variation between individuals thus conserving functional mtDNA throughout many generations. During an individual’s lifetime the actions of ROS and the resulting mtDNA mutations are thought to be directly involved in various disease mechanisms and the process of aging324. Mitochondrial dysfunction may be due to germline mutations in either mtDNA or nDNA, which directly affect proteins of the ETC or alter the expression of genes coding for ETC components. Mitochondrial dysfunction may also result from sources outside of the ETC, these include inherited mutations to genes not involved with oxidative phosphorylation (OXPHOS) and oxidative stress due to environmental aggravation325.

Although the exact mechanisms of replication in mtDNA are not yet fully understood, a number of nuclear encoded enzymes are essential to this process including: the catalytic subunit of mitochondrial DNA polymerase γ (POLγ) and its corresponding processivity factor (POLγB)326; Twinkle DNA helicase327; mitochondrial RNA polymerase (POLRMT)315; mitochondrial single-stranded DNA-binding protein (mtSSB)328; Ribonuclease H1 (RNaseH1)329; and DNA ligase III330. Dysfunction of these enzymes or mutations within the corresponding genes has been observed to impede the mitochondrial replication machinery. POLγ and Twinkle have been implicated in several diseases with known mitochondrial involvement331–333. The involvement of these nuclear proteins in mitochondrial replication highlights the importance of nDNA for regulation and maintenance of mtDNA. During transcription of nDNA each gene transcribes a corresponding mRNA, containing a unique set

55 of codons from a single cistron which codes for a single protein, this is known as monocistronic mRNA. In contrast to this, mitochondrial mRNA is polycistronic and a single mRNA is able to code multiple proteins334. The mitochondrial genetic apparatus also uses a reduced number of tRNAs for translation of codons by using a specialised tRNA with an unmodified uridine, with conformational flexibility, in the first anticodon position capable of pairing with any one of the four bases335. This results in the ‘four-way wobble rule’ or ‘super wobbling’ which allows 22 tRNA species to translate all mitochondrial genes, as opposed to the 32 tRNAs required for translation of nDNA336. In addition to this, translation of mitochondrial genes to proteins deviates from the universal genetic code and the same codon sequence may code for different amino acids in different species337. There is also a degree of uncertainty surrounding the termination of translation in human mitochondria with evidence suggesting multiple mechanisms of termination including a number of stop codons and the involvement of various proteins and release factors338,339.

1.4.2 Regulation of Mitochondria by Nuclear Genes Nuclear genes are directly involved in regulating mtDNA replication and maintaining mitochondrial homeostasis. Nuclear genes are also essential for many other mitochondrial processes with at least 2,526 autosomal genes coding for 22, 713 transcripts thought to be involved with mitochondria. Many nuclear encoded mitochondrial genes (NEMGs) code for proteins which are synthesised on ribosomes in the cytosol, migrate to the mitochondria, and are transported across the mitochondrial membrane based on the presence of a specific N-terminal presequence340. Other NEMGs do not code directly for mitochondrial components but regulate the expression of a range of genes which are essential for normal mitochondrial biogenesis and function.

The most obvious example of this relationship between nDNA and mitochondria can be seen in the succinate dehydrogenase enzyme complex which forms complex II of the ETC and is coded for entirely by nuclear genes341–343. RecQ-like helicase 4 (RECQL4) is a dynamic protein important for protein-protein interactions in nDNA replication but also localises to the mitochondrion where it interacts with POLγ in order to maintain mtDNA integrity344. Mutations in the RECQL4 gene have been shown to result in increased mtDNA copy number and mitochondrial dysfunction345. This protein may also act to regulate mitochondrial transport of p53 tumour suppressor and RECQL4 dysfunction may decrease p53 activity346. The polypeptide encoded by the DNA2 gene contains two independent functional domains: the N-terminal nuclease and the C-terminal helicase domain347. This protein is known to

56 function in the nucleus and is implicated in telomere maintenance and also limits accumulation of DNA double-strand breaks, prevents internuclear chromatin bridge formation and increases homologous recombination frequency348,349.

Data from immunofluorescence microscopy suggest that protein products of DNA2 are mainly found in mitochondria where they have been shown to interact with Polγ and Twinkle349,350. These interactions may be important in repairing oxidative lesions in mtDNA, mtDNA maintenance and protection against mtDNA related diseases350,351. Petite integration frequency 1 (PIF1) is a member of the superfamily 1 helicase family and is involved with 5'– 3' DNA unwinding352. While this protein also acts in both nuclei and mitochondria, its localisation is regulated by alternative splicing resulting in α and β isoforms. The PIF1 α isoform is directed to the nucleus where it is involved in regulating the cell cycle by restarting stalled replication forks352–354. In contrast, the PIF1 β isoform localizes to the mitochondria, with some nuclear activity353. Evidence suggests that PIF1 may associate with mtDNA and mitochondrial inner membranes355, has a role in reducing mtDNA double-strand breaks356, repairing damaged mtDNA357 and may regulate the activity of Twinkle358. A DExH-box helicase known as suppressor of Var1 3-Like Protein 1 (SUV3) is thought to be indirectly involved in mtDNA stability and copy number regulation359. SUV3 is an active ATPase which forms a complex with polynucleotide phosphorylase and mitochondrial polyadenylation polymerase in the lumen of mitochondria360. This complex is thought to be important in regulating mitochondrial RNA levels and may also interact with nDNA replication and repair factors in the nucleus361,362. This role as a mitochondrial RNA helicase and involvement in nuclear genome maintenance may indicate a mechanism for detecting and responding to nDNA damage by inducing mitochondrial RNA degradation347.

These interactions between mtDNA and nDNA highlight the symbiotic relationship between the mitochondrial organelle and the nucleus. Due to this relationship, dysfunction of genes in mtDNA or nDNA can have devastating consequences and mitochondrial dysfunction has been implicated in a vast number of diseases including CKD and DKD.

1.4.3 Mitochondrial Energetics of the Kidney Of the organs in the human body, the kidneys are second to the heart in terms of energy demands, mitochondrial content and oxygen consumption363,364. This is necessary to produce the energy required for removal of waste from blood, reabsorption of nutrients, regulation of electrolyte and fluid balance, maintenance of acid–base homeostasis, and regulation of blood pressure141. Mitochondria also provide the energy required by Na+–K+-

57

ATPase to create ion gradients across the cellular membrane and to facilitate active transport in the proximal tubule, the loop of Henle, the distal tubule and the collecting duct to allow ion reabsorption 365. Energy demands, and in turn mitochondrial content are much higher in the proximal tubules in comparison to the glomerulus as glomerular filtration is a passive process whereas the proximal tubules require large numbers of active transport mechanism in order to reabsorb 80% of the filtrate that passes through the glomerulus366. Because of these high energy requirements, mitochondrial dysfunction in the kidneys may severely affect renal health and has previously been implicated in CKD development.

1.4.4 Mitochondria and Chronic Kidney Disease Damage or mutations in mtDNA will most likely lead to primary defects in mitochondrial OXPHOS due to dysfunction of the ETC components which can result in a range of clinical symptoms involving several different organs broadly termed as mitochondrial diseases. Defects in NEMGs may also lead to OXPHOS defects, abnormal protein translation and loss of mtDNA copy number367. Mitochondrial diseases often come with renal complications, due to an abundance of mitochondria and the high energy demands of the kidney e.g. defects in

CoQ10 synthesis and the mtDNA 3243 A>G mutation are known to cause renal complications including focal segmental glomerulosclerosis368–375. Tubular defects are the most common renal manifestation of mitochondrial disorders from mutations in mtDNA or NEMGs. Mutations in the mitochondrial isoleucine tRNA gene (tRNAIle) have been associated with mitochondrial hypomagnesaemia in a familial study of white individuals376. Fanconi syndrome which is secondary to kidney tubular dysfunction has been linked with mutations in the gene EHHADH377.

Due to the maternal inheritance of mtDNA, mitochondrial disease due to mtDNA mutations are more easily characterised, mutations in NEMGs on the other hand are inherited according to classic Mendelian rules and therefore mutations in one or more of the 2,526 NEMGs may impair mitochondrial function leading to tubular and glomerular injury. Acquired damage to tubular mitochondria has also been shown to alter mitochondrial dynamics, mitophagy and biogenesis in AKI which is risk factor for later CKD368. Numerous NEMGs, previously reported to be involved with CKD along with their associated biological functions can be seen in Appendix 1. Further investigation of the NEMGs involved with CKD revealed many were also known to have important roles in maintaining mitochondrial function and renal health. Dysregulation of these genes due to acquired or inherited mutations may affect several

58 pathways which may initiate a vicious cycle aggravating renal damage and leading to CKD. Many NEMGs are essential for mitochondrial biogenesis and normal mitochondrial function, particularly PGC-1α which regulates the expression of TFAM, COX6C, COX7C, UQCRH, MCAD, SIRT3, and NRF1378. PGC-1α and its downstream targets are reported to be downregulated in peripheral blood mononuclear cells of CKD patients. Induction of PGC-1 by ROS may act as a protective adaption to reduce further ROS generation379. Other genes involved with mitochondrial biogenesis are upregulated in CKD in response to increased ROS production, these include NFE2L2 (regulated by PGC-1α), SOD2, Complex 1 components (NDUFS5, NDUFA6, NDUFA1 and NDUFB1), UQCRH, UQCRB, ATP5I, ATP5J, and ATP5O379–381. Upregulation of these genes may be an attempt to compensate for increased ROS generation resulting from OXPHOS dysfunction. The genes coding for subunit 6C and 7C of Complex IV (COX6C and COX7C) were also seen to be upregulated in CKD, however, complex IV activity was reduced due to chronic oxidative stress and oxidant injury resulting from inhibition of the OXPHOS system380.

Defects in these genes may result in ineffectual clearance of ROS which may further exacerbate mitochondrial damage and lead to more ROS being produced due to downregulation of AIFM1382 and upregulation of NOX4383,384. This vicious cycle of ROS production results in oxidative stress within the mitochondria leading to downregulation of ATG5 and BECN1385. It has been suggested that reduced expression of ATG5 and BECN1 may lead to abnormal or impaired autophagy which in turn leads to further ROS generation, deregulated mitochondrial fission and reduced mitochondrial function386–388. BNIP3 is an important mediator of mitophagy and upregulation of this gene has been associated with sarcopenia in CKD along with decreased mtDNA copy number and reduced mitochondrial function193. The intrinsic pathway of apoptosis has also been implicated in CKD. Downregulation of the anti-apoptotic protein BCL-xL, coded for by BCL2L1, combined with upregulation of pro-apoptotic proteins coded for by BAX and BAK1 leads to increased apoptosis and depolarisation of the outer mitochondrial membrane in proximal tubular cells. Upregulation of BAX and downregulation of BCL2L1 activates the intrinsic apoptotic pathway, permeabilising the inner mitochondrial membrane leading to cytochrome C release389–391.

Uncontrolled apoptosis may eventually lead to inflammation and fibrosis commonly observed in many forms of CKD. Experimental models of CKD have shown increased expression of several genes involved with inflammation and fibrosis. Expression of TGFβ has been seen to increase in response to apoptosis in primary tubular cells exposed to albumin stimulating ECM formation and the extrinsic apoptotic pathway392,393. Upregulation of PPARγ

59 is common in several forms of renal disease and affects renal parenchymal cells stimulating various pathways important in CKD development including fibrotic, inflammatory, immune, proliferative, reactive oxygen and mitochondrial injury pathways394. Upregulation of NLRP3 and its downstream targets CASP1 and proinflammatory cytokines IL-18 and IL-1β are upregulated in CKD in response to mitochondrial ROS generation. Activation of the NLRP3 inflammasome leads to increased ECM deposition, apoptosis and fibrosis of renal cells. Knockout of NLRP3 or Inhibition of ROS have been shown to reduce mitochondrial dysfunction, reduce apoptosis, decrease ECM deposition and protect against renal fibrosis395– 398.

Several NEMGs encode proteins that are critical to normal podocyte function. Progression of CKD is associated with podocyte dysfunction so it is plausible that NEMG variants involved with podocyte function could influence susceptibility to kidney disease. Cathepsin D, coded for by CTSD, is a lysosomal proteinase involved with lysosomal degradation, autophagic degradation and contributes to maintaining podocyte homeostasis. CTSD may be downregulated or inactivated in CKD and loss of this protein leads to impaired autophagy, resulting in the accumulation of toxic subunit c–positive lipofuscins and slit diaphragm proteins followed by apoptotic cell death399. Expression of ITCH is increased in mouse models of kidney disease400. This gene is regulated by the Src kinase Fyn and modulates various signalling pathways including TGFβ and EGF through ubiquitin and non-ubiquitin mediated mechanisms401–404. The interactions with Fyn, TGFβ and related signalling molecules may suggest a role of ITCH in regulating glomerular sclerosis and podocyte function400. Downregulation of NRP2 was observed in animal models of kidney disease and NRP2 knockout mice displayed progressive glomerular damage when exposed to a podocyte toxin400. This gene is expressed in the glomerulus and regulates vascular endothelial growth factor signalling and downregulation may contribute to podocyte damage in CKD405.

The diagram in Figure 1.14 provides a summary of these genes along with the biological process they are involved with. Dysfunction of one or more of these genes or pathways has been observed in CKD and the diagram provides a hypothetical framework to understand the complex role of mitochondrial dysfunction in CKD. Previous research has also shown support for this model with other authors suggesting mitochondrial dysfunction resulting from inherited or acquired genetic defects may lead to development of various forms of kidney disease such as AKI, DKD, glomerular diseases, tubular diseases and CKD368,406,407. According to this framework, mitochondrial dysfunction and OXPHOS defects increase ROS generation and reduce ATP production that leads to increased oxidative stress which may lead to

60 uncontrolled autophagy, mitophagy and further ROS production. Mitochondrial dysfunction, ROS generation and the resulting dysregulation of autophagic mechanisms may also lead to an upregulation of the intrinsic pathway of apoptosis that in turn leads to inflammation and fibrosis in the renal tubules, glomerulus and podocytes. This damage may then lead to further autophagy leading to another vicious cycle of apoptosis, fibrosis and inflammation which will further reduce podocyte function eventually leading to irreversible podocyte injury and progression to CKD.

61

Increased production of reactive oxygen species Increased oxidative stress

Abnormal mitochondrial biogenesis and function Abnormal autophagy Downregulation of PCG-1α, and mitophagy NRF1, TFAM, MCAD and AIFM1 Downregulation of Increased intrinsic NFE2L2, ATG5, CTSD and BECN1 Upregulation of apoptosis NDUFS5, NDUFA6, NDUFA1, Upregulation of BNIP3 NDUFB1, AT5I, AT5J, AT5O, Downregulation of BCL2L1 APOL1, NOX4 and UQCRB and AIFM1 Upregulation of BAX, BAK1 and HIF1

Reduced podocyte function and Increased inflammation and detachment fibrosis

Downregulation of Upregulation of NLRP3, CASP1, NRP2. KLF6 and CTSD PPARγ and TGFβ

Upregulation of ITCH

Irreversible podocyte damage

Chronic kidney disease Indicates damage or dysfunction of unknown origin

Figure 1.14| Mitochondrial Dysfunction in CKD The interrelated genes and processes provide a structure to better understand the role of mitochondrial dysfunction in CKD. This diagram outlines some of the genes and mechanisms involved which lead to an accumulation of mitochondrial damage within tubular cells which may lead to activation of intrinsic apoptosis mechanisms and eventual effacement and irreversible podocyte damage leading to CKD.

62

1.4.5 Mitochondria in Diabetic Kidney Disease Excess ROS produced due to over-activity within ETC have been hypothesised to have an important role in modulating the metabolic pathways involved in DKD development170. Superoxide is a form of ROS produced as a by-product of the ETC, primarily from respiratory complexes such as flavoproteins, iron-sulphur clusters and ubisemiquinone408. The main functions of mitochondrial ROS are highlighted in Figure 1.15. In healthy individuals, ROS are a key component of the redox signalling pathway, however over-production of ROS can cause oxidative damage in many disease states409. In cell models of hyperglycaemia, excess glucose is oxidised to produce a surplus of electron donors from the tricarboxylic acid cycle. To prevent overloading of the ETC these electrons are donated to free oxygen molecules from coenzyme Q therefore increasing the amount of superoxide produced, which leads to an aggregation of glycolytic intermediates, resulting in activation of the metabolic pathways of hyperglycaemic damage previously discussed in Section 1.2.3.2.2410. However, clinical trials with antioxidants for treatment of diabetic complications have failed to show significant benefit411 and much of the data supporting this theory was derived from animal and cell culture models of diabetes which may not accurately reflect the superoxide concentration in vivo410,412. The lack of evidence for excessive superoxide production in the diabetic kidney led to the development of an alternative theory for the development of DM complications known as mitochondrial hormesis412. Mitochondrial hormesis describes a beneficial role for increased production of superoxide from mitochondria based on recent studies investigating ROS production in response to in vitro models of hyperglycaemia413 and in response to exercise414. Reduced mitochondrial superoxide and reduced ETC activity as a consequence of prolonged hyperglycaemia may lead to reduced activity of AMPK and decreased phosphorylation of endothelial nitric oxide synthase resulting in downstream stimulation of inflammatory pathways and vascular complications410,412. Therapeutic approaches based on mitochondrial hormesis would therefore aim to increase mitochondrial superoxide production to "normal levels” and reduce the damaging effect of diseases such as DKD412.

63

Figure 1.15| Effects of Reactive Oxygen Species Generated from Mitochondria409 Reactive oxygen species (ROS) produced by mitochondria may cause oxidative damage to mitochondrial proteins, membranes and DNA, which impairs the ability of mitochondria to synthesize ATP and perform their metabolic functions that are central to the normal operation of most cells. Mitochondrial oxidative damage also increases release of intermembrane space proteins such as cytochrome c (cyt c) to the cytosol by mitochondrial outer membrane permeabilization (MOMP) and activates the cell's apoptotic machinery. Mitochondrial ROS production also induces the mitochondrial permeability transition pore (PTP), rendering the inner membrane permeable to small molecules during conditions such as ischaemia/reperfusion injury. Consequently, mitochondrial oxidative damage contributes to a wide range of pathologies, including diabetic kidney disease. In addition, mitochondrial ROS are involved in redox signalling, reversibly affecting the activity of a range of functions in the mitochondria, cytosol and nucleus.

The conflicting evidence over the role of ROS in diabetes and in the development of secondary complications highlights a need to investigate the mitochondrial genome to improve our understanding of the mechanisms involved in DKD development. Although the exact mechanisms are not yet clear, it has been suggested that an accumulation of mtDNA damage and severe dysfunction of mitochondria in certain patients with DM results in upregulation of inflammatory mediators such as MYD88 and NF-κB which may contribute to the development of DKD415. Cells with a higher energy consumption are known to harbour larger mtDNA copy numbers, however altered mtDNA levels have also been associated

64 with various diseases including DKD416,417. In vitro cell models using peripheral blood mononucleocytes and renal mesangial cells from patients with DKD as surrogates for kidney tissues suggest that hyperglycaemia may lead to increased mtDNA copy numbers in renal cells prior to mitochondrial dysfunction416. The close proximity of mtDNA to the site of ROS generation may contribute to this damage, which will result in further mitochondrial dysfunction, a decrease in mitochondrial protein synthesis and increasing heteroplasmy of mtDNA415,418. Under normal circumstances, ROS are converted to oxygen and hydrogen peroxide in a dismutation reaction which is catalysed by superoxide dismutase (SOD) and several SNPs in gene coding for manganese SOD have been associated with increased risk of DKD in patients with T1DM419. Selective degradation of dysfunctional mitochondria by autophagosomes through a process known as ‘mitophagy’420 has been suggested to have a role in the development of DKD421. A number of studies have shown a potential cytoprotective role of mitophagy within the glomeruli of the kidney where it has been shown to regulate accumulation of collagen and impairment of these mechanisms is known to contribute to fibrosis in DKD360,422. Mouse models have also been used to better understand the interaction between nDNA and mitochondria. A novel role of Rho-associated coiled coil- containing protein kinase 1 (ROCK1) in DKD was identified using three mouse models423. ROCK1 activation in a high-glucose environment was shown to increase mitochondrial fragmentation and is also involved in dynamin-related protein-1 signalling and mitochondrial remodelling423.

Although the exact mechanisms of DKD are not yet clear, there is a clear role of the mitochondria in this disease however, more evidence is required to investigate the true relationship between mitochondrial dysfunction and development of DKD. Due to the limited coding capacity in the mtDNA and the higher susceptibility to mutation there is a significant degree of interaction between the mtDNA and nDNA, however both are also susceptible to epigenetic modifications that can result in a range of complications and may be involved in DKD.

1.5 Epigenetics

Epigenetic modifications are those which affect gene expression without directly altering the DNA sequence. These changes can be inherited between generations or may result from environmental exposures acquired during a lifetime. The study of epigenetic modifications gained widespread attention through studies investigating epigenetic features in the children and grandchildren of those exposed to famine during pregnancy and early in life424,425. Known

65 epigenetic mechanisms include DNA methylation, histone modifications, chromatin remodelling and regulation by non-coding RNA. DNA methylation is the covalent addition of a methyl group to a residual cytosine adjacent to a guanine nucleotide (CpG dinucleotide) through the actions of a family of DNA methyltransferase enzymes426. The promoter regions of genes are rich in CpG dinucleotides and can be subject to both hypermethylation and hypomethylation which are associated with transcriptional silencing and activation of gene expression, respectively427,428. Histone modifications are posttranslational modifications to histone proteins which include methylation, acetylation, phosphorylation, ubiquitination and SUMOylation429. These modifications can affect the physical structure of the chromatin and stimulate recruitment of additional modifiers. Chromatin remodelling also alters the physical structure of the chromatin through the actions of ATP dependent enzymes such as SWI/SNF, ISWI, CHD and INO80 complexes430,431. Non-coding RNAs (ncRNA) are also an important epigenetic mechanism and these include micro RNA (miRNA), small interfering RNA (siRNA), small nuclear RNA (snRNA), piwi-interacting RNA (piRNA) and long ncRNA (lncRNA)432. These ncRNAs are involved with regulating DNA methylation, chromatin remodelling, and genomic imprinting and are influential in transgenerational epigenetic inheritance by acting as a guide to specific genomic locations and recruiting proteins to target sites432. These various forms of epigenetic modifications often act together resulting in complex heritable changes to gene expression without affecting the DNA sequence433.

1.5.1 Types of Epigenetic Modification Epigenetic modifications play an important role in many biological processes such as cell differentiation, imprinting, and inactivation of the X chromosome. These modifications typically occur in response to external stimuli such as pre-natal malnutrition, ultraviolet radiation, and cigarette smoke. The resulting changes may be temporary or persist for a long time and can be passed from one generation of cells to the next. Some epigenetic features directly affect the DNA molecule itself (e.g., DNA methylation), some alter chromatin structure or modify chromatin associated proteins (e.g., post-translational histone modification), and other involve modification by RNA molecules (e.g., gene silencing by noncoding RNAs and RNA methylation). Types of epigenetic modifications at their target site can be seen in Figure 1.16.

66

Figure 1.16| Epigenetic Modifications and Locations434 Epigenetic changes include chemical modifications to DNA itself (DNA methylation), the histones (around which DNA is wound) or non-coding RNA. A total of 8 histone units form chromatin, around which 146 base pairs of DNA is wound to form the nucleosome. a|DNA methylation occurs predominantly at the CpG islands. (Lower box). Some modifications can increase accessibility to DNA (Upper box). b|Histone complexes can have modifications at multiple positions on the tail. c|Histone- remodeling complex move histones regulating access to DNA.

1.5.1.1 DNA Methylation DNA methylation occurs when a methyl group is added to 5’-cytosines of CpG dinucleotides in areas known as CpG islands, typically found in promoter regions435. It is thought that hyperglycaemia can alter DNA methylation436 and may play a role in DKD development. The methylation site has been found to dictate the effect on gene expression with methylation at promoter regions tending to lead to gene repression, whereas in gene bodies it may be involved in regulation of transcription, elongation and alternative splicing437. DNA methylation patterns obtained from whole blood genomic DNA from patients with DM and DKD compared to those with DM without DKD found methylation in several genes,

67 including UNC13B, which has been associated with time to onset of DKD438. A large scale EWAS439 investigating DNA methylation in patients with T1DM and DKD found 23 genes to significantly differentially methylated between case and control groups, with six of these genes having a known role in kidney disease. In support of this, another study showed significant differences of methylation in 1,061 genes in kidney tubular epithelial cells of patients with DN compared to controls440. The difference between studies is likely due to the difference in tissue types from which samples were obtained and it is possible that DNA methylation profiles in diseases such as diabetes are more tissue specific than previously thought441,442.

1.5.1.2 Non-Coding RNAs Short single stranded non-coding RNAs known as miRNAs are also capable of altering expression of genes resulting degradation of mRNA or halting translation442,443. These miRNA are around 22 nucleotides in length and more than 2,500 mature miRNAs have been identified in humans which regulate approximately 60% of protein-coding genes444. Due to imperfect complement binding a single miRNA can target many genes and abnormal expression of miRNAs has been implicated in the pathogenesis of various diseases including DKD445. In DKD, miRNAs are thought to be involved in renal fibrosis due to their involvement in the TGF-β1 pathway as both upstream regulators and downstream effectors446,447. Of the miRNAs involved in DKD, miR-192 is one of the best studied and despite this there is conflicting evidence regarding its expression and interaction with TGF-β1448–450.

1.5.1.3 Histone Modification The other main epigenetic influence that may contribute to DKD is the post-translational modification (PTM) of histone proteins by methylation, acetylation, ubiquitination and phosphorylation. These PTMs typically occur in the amino-terminal tail of histone proteins and alter the chromatin structure therefore affecting gene expression and this has been shown to be involved in the expression of genes associated with fibrosis and inflammation in DKD447. Histone PTMs have been observed in diabetic patients, particularly in the insulin promoter transcription factor Pdx1, and this is thought to be involved in regulating insulin expression and secretion in response to changes in blood glucose451,452. Metabolic abnormalities in adipocytes of those with T2DM has been shown to be affected by histone PTM in the transcription factors C/EBPβ and PPARγ453,454, which may be involved in the development of T2DM associated with obesity455. Various studies have

68 investigated the role of PTMs of histone in the TGF-β signalling pathway and how this may influence fibrosis and ECM accumulation due to hyperglycaemia in DKD456–458.

1.5.2 Epigenetics in nDNA and mtDNA Epigenetic changes are not only confined to nDNA but the mitochondrial genome itself is also subject to epigenetic modification through similar mechanisms to nDNA. Historically, there was much debate surrounding the ability of mtDNA to undergo epigenetic changes459–465. However, in 2011 advances in methodology and sensitivity allowed Infantino and colleagues to demonstrate the presence of methylated bases in human mtDNA466. Also in 2011, Shock and co-workers showed that a DNA methyltransferase (DNMT)1 transcript variant was present inside mitochondria where it is capable of modifying transcription of the mitochondrial genome467. Also around this time, Chestnut and colleagues observed DNMT3a in mitochondria of mouse and human tissue providing evidence that DNA methylation may act of mtDNA by similar mechanisms as in nDNA468. While cytosine methylation most commonly takes place at the CpG dinucleotide sequence, in mtDNA methylated cytosines are mainly located in non-CpG moieties within the mitochondrial D-Loop, particularly in the promotor region of the heavy strand and in conserved sequence blocks of the which may indicate a role in regulating mtDNA replication or transcription469. In addition to cytosine methylation mtDNA also contains hydroxylmethylated cytosine at a higher density than in autosomes in both the D-Loop region and along the entire mitochondrial genome469,470. Initially mtDNA was not thought to contain histones, however more recent evidence has demonstrated that mtDNA is protein coated and contained within nucleoids471–473. Histone family members, particularly H2A and H2B, have also been observed in mitochondria although rather than directly binding DNA these were found to localise to the mitochondrial membrane474. In 2008 Bogenhagen and colleagues found 57 proteins associated with human mtDNA475. Amongst these are several proteins essential for mitochondrial biogenesis as well as mtDNA transcription and replication including mitochondrial transcription factor A (TFAM), mitochondrial transcription factor B (TFBM), DNA polymerase gamma, prohibitin, Twinkle helicase and mitochondrial single-stranded DNA binding protein473,476–479. Whole genome and transcriptome sequencing has also revealed the presence of numerous ncRNAs including siRNAs, miRNAs and lncRNAs which are involved with regulating essential signalling pathways in mitochondria by altering expression of nuclear-encoded mitochondrial proteins480–486. In addition to ncRNAs encoded by nDNA, Rackham and colleagues identified three lncRNAs transcribed by the mitochondrial genome from regions complementary to the genes coding for ND5, ND6 and Cyt b487. These mitochondrial lncRNAs are regulated by the

69 mitochondrial RNase P complex and their abundance varies significantly across different tissue types with a high abundance observed in cardiac tissue487,488. There are also thousands of non-coding small RNAs transcribed by mtDNA the majority of which are derived from sense transcripts of mitochondrial genes but these have also been mapped to the mitochondrial D- loop region489,490.

DNA methylation and miRNAs are known to affect the expression of both nDNA and mtDNA whereas, mtDNA does not contain histone proteins and therefore PTM of histones is limited to nDNA only. As the field of epigenetics continues to develop studies such as the Human Epigenome project and the Encyclopedia of DNA elements aim to greatly improve our understanding of the role of epigenetics in many diseases including diabetes and DKD491–494. Epigenetics may also play a role in metabolic memory resulting from epigenetic changes due to lifestyle and environmental factors, and these changes may also predispose future generations to metabolic abnormalities due to hereditary epigenetic changes495,496. Due to the possible reversibility of epigenetic changes, there is also potential to develop epigenetic drugs and miRNA inhibitors which may be used in combination with current DKD treatments497,498.

1.5.3 Epigenetics and Kidney Disease Epigenetic modifications resulting from an adverse in utero environment have been implicated in increased risk of several diseases such as cardiovascular disease, hypertension, T2DM, obesity and renal disease499–504. A systematic review by colleagues reported that individuals with low birth weight (<2.5 kg at birth) have an increased risk of CKD development in adulthood505. Epigenetic changes are also acquired throughout life often in response to adverse conditions such as the high-glucose environment experienced in DM506. The role of epigenetic modifications in kidney disease have previously been discussed in detail elsewhere277,278,507–509 and methylation of nDNA and histone modifications are known to influence CKD development, particularly in patients with DM510,511. Genes involved with renal development and renal fibrosis have also been found to be differentially methylated in those with CKD440,512. Environmental influences such as maternal smoking have been linked to CKD development later in life through a process known as foetal programming513 and this has been shown to alter genome-wide and gene-specific methylation patterns in offspring440,514– 518. Genes involved with kidney development and DKD have been reported to be differentially methylated in diabetic patients with ESRD compared to those with diabetes with no evidence of nephropathy519. Amongst these differentially methylated genes the top canonical

70 pathways included oxidative phosphorylation and mitochondrial dysfunction518. DNA hypermethylation was also observed in the promotor region of PGC-1α in diabetic patients and was inversely correlated with mitochondrial content99,436 . Interestingly, the influence of epigenetic modifications on mitochondrial function appears to go both ways with recent studies utilising cell lines devoid of mitochondria showed that depletion of mitochondrial DNA can lead to abnormal CpG methylation patterns and restoration of mtDNA in these cells partially reverse these changes520. This highlights the ability of mitochondria and nDNA to interact through both genetic and epigenetic mechanisms. Due to this complex interaction, mitochondrial dysfunction will affect both the nuclear genome and the epigenome and vice versa; the resulting bioenergetics failure has implicated in a number of energy deficiency diseases particularly in tissues which rely on high energy flux including brain, heart, muscle, renal, and endocrine systems521,522. Within the context of renal disease the ability of mitochondrial proteins to cause epigenetic modifications has been demonstrated through the ability of mitofusion 2 to abate histone acetylation in the promoter region of collagen IV through a reduction in ROS generation, in turn reducing collagen IV expression in streptozotocin-induced diabetic rats523–525.

In addition to environmental stress, altered metabolic states such as hyperglycaemia in diabetes or uraemia in CKD can also lead to epigenetic changes known as hyperglycaemic or uraemic “memory”, respectively526–528. Such modifications to mtDNA and nuclear genes involved with mitochondrial function may play an important role in the development of CKD. In a hyperglycaemic environment such as DKD several mitochondrial genes are differentially methylated including PMPCB, TSFM and AUH529. Hypermethylation of DNA has also been shown to impact miRNA let-7a downregulation of UHRF1, which interacts with Dnmt1530,531. The majority of research into CKD and DKD has focused on genetic and epigenetic features of nuclear DNA and although the field of mitochondrial epigenetics is still in its relative infancy there is evidence to suggest that epigenetic regulation of mtDNA is involved in maintaining mitochondrial function and dysregulation of these mechanisms may play an important role in disease pathogenesis in organs with high energy demands. Such epigenetic features may also be potential therapeutic targets as they are reversible and tissue specific, therefore, they may be lucrative potential targets for drugs targeting specific epigenetic changes. For example, epigenetic changes associated with DKD can be reversed by losartan treatment in mice models of diabetes278,530,532,533. It is evident that DKD is a complex disease that develops due to a combination of environmental risk factors, genetic predisposition and mitochondrial dysfunction. However, epigenetic regulation of gene expression may also be

71 involved in disease development442. In a similar way to GWAS, ‘epigenome-wide association studies' are commonly used to assess epigenetic modifications of the whole genome.

1.6 Genome Wide Association Studies

GWAS can detect associations between genetic variants and traits of interest. Nevertheless, association does not indicate causation and in order to better understand biological mechanisms of disease experiments in vitro and in vivo are needed to explore how identified genes and variants are implicated in disease pathogenesis. The ultimate goals are often stated as improving treatment or preventing disease development. GWAS have been used to assess the relative contribution of genes to disease risk, and this genetic information may be combined with existing clinical risk algorithms to help improve individual disease prediction allowing more personalised approaches to treatment534–541.

1.6.1 Single Nucleotide Polymorphisms Single nucleotide polymorphisms or SNPs are alleles which vary between people and populations. These changes in the DNA sequence occur approximately once in every 1,000 nucleotides and are therefore considered a type of common genetic variation542,543. SNPs act as markers for a region of the genome and while the majority of SNPs were believed to have minimal effect on biological processes some of these have functional consequences resulting in for example, changes to transcription factor binding affinity altering gene expression, altered mRNA transcript stability or amino acid sequence544,545. SNPs represent a considerable proportion of the DNA variation in human populations546. Typically GWAS will make use of biallelic SNPs and the frequency at which these alleles appear in a population in order to investigate the association between a SNP and a particular phenotype547.

The use of common variants in GWAS contrasts with the use of genetic markers in linkage analysis which are often used to investigate rare genetic disorders. The use of linkage analysis has not proved to be practical for the investigation of common diseases such as DKD and other diabetic complications due to the lack of large multigenerational families282,442,548–550. There are also different genetic mechanisms involved in many common diseases when compared to typical rare diseases551–555. This idea, along with the observation that a number of susceptibility variants for common diseases have high minor allele frequencies, form the basis of the common disease/common variant (CD-CV) hypothesis556,557. The reasoning behind this was that common disorders may result from genetic variation that is common within the population. Accordingly, SNPs associated with common diseases should cause only

72 small changes in gene expression, otherwise the frequency of the SNP would be reflective of the frequency of the phenotype in the population, and thus common variants would have less impact on phenotype compared to rare variants547,556,557. Additionally, this hypothesis suggests that total genetic risk due to common variants is distributed across multiple genetic loci with each single SNP contributing only a small proportion to the total genetic risk for a phenotype. While there is evidence to support the CD-CV hypothesis it should not be assumed that the entire genetic component of a common disease is entirely due to common alleles558,559 and several exceptions to this hypothesis have been reported in complex diseases560–566 and also in Mendelian disorders567–570. These observations seem to indicate that the genetic architecture of complex traits and Mendelian disorders may be more closely linked than that implied by the CD-CV hypothesis555,556,571–574. In contrast to this, the common disease/rare variant hypothesis suggests that multiple rare variations may be more likely to have a phenotypic effect and influence common disease susceptibility than common variants556,575,576. There is evidence to suggest that complex phenotypes and diseases are the result of a combination of both common and rare variants with different effect sizes572,577–581. These two models should not be seen as separate ideas and arguably a combined “polygenic model”582 is best suited to explain the variability in many cases including several forms of kidney disease583–585. Several studies investigating the genetic basis of T2DM have provided support for the CD-CV hypothesis586–589 due to the observation that low-frequency and rare variants contribute much less to T2D heritability than common variants. Despite these findings, the impact of rare genetic variation on T2DM risk still remains unclear due to difficulties in capturing the full extent of rare genetic variants in the genome590. It is also important to keep in mind that the DNA sequence alone does not dictate disease susceptibility and additional mechanisms may influence risk of a particular disorder including somatic mosaicism, small ncRNA or miRNA, epigenetic regulation, posttranslational modifications, variable X-inactivation, gene-gene and gene-environment interactions585. The relationship between the strength of a genetic effect and allele frequency of genetic variants is illustrated in Figure 1.17591.

73

Figure 1.17| Risk Allele Frequency and Strength of Genetic Effect in Association Studies591 Genome-wide association studies have historically investigated common variants and have identified many associations with modest to low effect size on risk (bottom right). Whole- exome and more recently whole-genome sequencing studies allow identification of low- frequency and rare variants within that cannot be genotyped or imputed. On average, effect sizes of less common and rare variants are on are larger than more common variants. For common variants effect sizes are generally small which necessitates very large sample sizes for discovery.

1.6.2 Linkage Disequilibrium The term LD is used to describe the degree to which an allele of one SNP is correlated with the allele of another SNP within a population resulting from historical evolutionary forces, particularly finite population size, mutation, recombination rate, and natural selection281,592. LD is closely related to chromosomal linkage, whereby two markers on a chromosome remain physically joined on a chromosome through several generations within a family. It would be expected that recombination events throughout generations would amplify this effect and assuming a population of a fixed size, undergoing random mating, repeated random recombination would break apart contiguous chromosome segments therefore leading to all alleles being inherited independently. Any linkage between genetic markers which deviates from the expected independent inheritance is therefore known as LD281,593. Population size, number of founding , the number of generations and amount of recombination are all known to affect the rate of LD decay594,595. Due to these factors, different sub- populations of a species display different patterns of LD with African-descent populations

74 showing the most ancestral diversity and thus have smaller regions of LD whereas European- descent and Asian-descent populations have a higher degree of LD as these were created by founder events (a sampling of chromosomes from the African population), which altered the number of founding chromosomes, population size, and the generational age of the population. These populations on average have larger regions of LD than African-descent groups596–598. Measures of LD are typically based on the difference between the observed frequency of co-occurrence for two alleles and the frequency that would be expected if the markers are independent599. The most frequently used measures of LD are D' and r2 and these

599,600 can be seen equations 1.1 and 1.2 . In these equations, π12 is the frequency of the ab haplotype, π1 is the frequency of the a allele, and π2 is the frequency of the b allele.

Equation 1.1

Equation 1.2

Equation 1.1| D' is a population genetics measure that is related to recombination events between markers and is scaled between 0 and 1. A D' value of 0 indicates complete linkage equilibrium, which implies frequent recombination between the two markers and statistical independence under principles of Hardy-Weinberg equilibrium. A D' of 1 indicates complete LD, indicating no recombination between the two markers within the population. Equation 1.2| For the purposes of genetic analysis, LD is generally reported in terms of r2, a statistical measure of correlation. High r2 values indicate that two SNPs on the same chromosome convey similar information, as one allele of the first SNP is often observed with one allele of the second SNP, so only one of the two SNPs needs to be genotyped to capture the allelic variation.

1.6.3 Imputation The existence of LD is essential to GWAS and allows associations to be made both directly and indirectly (Figure 1.18)601. A direct association refers to a directly genotyped SNP which is found to be statistically associated with a particular trait602,603. If the SNP found to influence the phenotype under investigation is not directly genotyped but is found to be in high LD with a genotyped tag SNP it is known as an indirect association603,604. From analysis of the

75

International Haplotype Map(HapMap) project data it has been demonstrated that >80% of SNPs commonly found in populations of European descent can be genotyped with between 500,000 to 1,000,000 SNPs covering the genome602. In genetic association studies LD and the presence of tag SNPs can be utilised to avoid genotyping redundant SNPs605. This is known as genotype imputation and involves predicting, or imputing, genotypes that are not known or are not directly sequenced in a sample of individuals. This involves sequencing a subset of SNPs and then using a reference panel of densely genotyped individuals to ‘fill in the gaps’ by predicting the genotype of the unobserved SNPs, therefore increasing the number of SNPs available for association analysis thus increasing study power and also permitting meta- analysis between multiple cohorts ensuring a common set of SNPs between all cohorts606.

For imputation, the genotypes of samples are split into their two component haplotypes and then a reference panel with a dense set of SNPs is used to fill in the missing genotypes in the target sample. Sections of a haplotype are identified in the reference populations which share SNP alleles with the haplotype being imputed that is indicative of descent from a common ancestor and therefore the same SNP alleles will be present throughout the haplotype. There are two major types of imputation, single or multiple. In single imputation a single estimate of each missing value is determined whereas multiple imputation produces multiple estimates that are all slightly different, therefore allowing estimates of standard error to be made607,608. In most cases, multiple imputation will provide less biased estimates with better coverage for a linear regression model and often also perform well for estimation of regression parameters for a linear mixed model with random intercept609,610. Although more complex methods exist these may only be required for specific circumstances such as irregularly spaced data610. Due to the use of imputation, the significant SNP may not necessarily be the SNP which was genotyped and therefore it cannot be assumed that this is a causal variant. It is also important to remember that association is not causation and follow- up studies are necessary to assess the significance of SNPs identified in an independent collection with in vivo and in vitro studies necessary to identify any impact on function611.

76

Figure 1.18| Indirect and Direct Associations in GWAS a| Direct associations are those in which a candidate SNP (red) is directly tested for association with a disease phenotype. For example, this is the strategy used when SNPs are chosen for analysis based on prior knowledge about their role, for example when missense SNPs that are likely to have an impact on the candidate gene (green rectangle). b| Indirect associations refer to SNPs which are genotyped (red) based on linkage disequilibrium (LD) patterns in order to gain as much information about the surrounding SNPs as possible. In this example, the SNP shown in blue is indirectly tested due to LD with the other three SNPs. A combination of these approaches is also possible601.

1.6.4 GWAS Methodology Typically, GWAS will utilise a case-control approach whereby two large groups of individuals are compared, one control group of healthy individuals and one case group of individuals with the disease or trait under investigation. Individuals are genotyped in order to determine which nucleotide base is present at each SNP locus. The genotyping method will depend on the exact needs of the study and in GWAS this is usually performed by array-based hybridisation making use of oligonucleotide probes, with a known sequence array, immobilised on a solid support which is exposed to the sample and the hybridisation pattern can then be used to infer the nucleotides present at each SNP locus547,612–614. The number of SNPs genotyped will vary depending on the technological platform used to perform this step but usually there are at least one million SNPs genotyped547, followed by imputation606. Each of the SNPs is then investigated to determine whether there is a statistically significant difference in allele or genotype frequency between the case and control groups615. The ratio of the odds of individuals in the case group having a specific allele and the odds of disease for individuals who do not have that allele is known as the odds ratio (OR) and if allele frequency in the cases is markedly higher than that in the control group the OR will be greater than one and vice versa615–617. The significance of level of the OR is determined through a chi- squared test and a P-value is calculated. The use of ORs and P-values are essential for reporting the effect size and in GWAS a significant P-value is indicative of association with a particular trait and the OR provides information about the size of the effect observed547,616.

77

In addition to the case-control method, which is used to investigate qualitative phenotypes, GWAS are also commonly used for analysis of quantitative phenotypes such as height or, as in the current study, glomerular filtration rate. As well as calculating associations it is important to account for any variables which may confound the results and lead to false positives such as sex and age as well as controlling for population stratification to account for genetic variation due to geographic and ethnic backgrounds618. GWAS data can be integrated with other biological data in order to generate more informative results.

1.6.4.1 SNP Arrays Arrays for SNP genotyping have been developed by various companies such as Motorola Life Sciences619, Agilent619, Affymetrix620 and Illumina621. Affymetrix and Illumina are the most widely used and they both rely on the binding of nucleotide bases to their complementary . These array platforms require hybridisation of fragmented single-strand DNA to nucleotide probe sequences (Figure 1.19)622. Each array may contain hundreds of thousands of unique probes that are designed to bind to target DNA sequences to produce signals that can be measured. The intensity of the signals reflects the amount of target DNA in the sample and the affinity between the sample DNA and the probes621,622. After capturing the raw intensity data computational algorithms are used to infer SNP genotypes605.

Figure 1.19| Overview of SNP Array Technology622 a| The Affymetrix assay utilises 25-mer probes for both alleles, and the location of the SNP locus varies from probe to probe. DNA binds to both probes when it is complementary to all 25 bases a stronger signal (bright yellow) is produced than when bound to a mismatched SNP site (dimmer yellow). b| Illumina bead technology utilises a 50-mer sequence complementary to the sequence adjacent to the SNP site. The single-base extension (T or G) complementary to the allele found in the DNA sample (A or C, respectively) can bind and produce a coloured signal (red or green, respectively). Both platforms utilise computational algorithms to quantify the raw signals and visualise these appropriately622.

78

Genetic variants identified using GWAS are typically common in the population with a minor allele frequency (MAF) greater than 1%592. SNP arrays have typically been seen as a relatively inexpensive way to survey such genetic variants throughout the genome and the number of SNPs which must be genotyped to provide sufficient coverage over the full genome varies between different platforms623. Table 1.3 and Table 1.4 show the coverage of several commonly used genotyping platforms for whole genome SNP genotyping and for mtDNA alone. The positions of mitochondrial SNPs on each array in relation to mitochondrial genes taken from the University of California, Santa Cruz genome browser (UCSC) database624 can be seen in Figure 1.20. As costs of DNA sequencing continue to fall, massively parallel methods may be replaced by single molecule approaches for many applications however the choice of sequencing platform will depend on the availability of resources, budget and also the depth and precision of sequencing required625,626.

79

Table 1.3|Comparison of Whole Genome SNPs Found on Commonly used Genotyping Platforms Dark grey shaded boxes indicate the total number of genome-wide SNPs present on each array. The pink shaded box shows the number of SNPs common between all arrays. Affymetrix Affymetrix Affymetrix Illumina Illumina Illumina Illumina Illumina Genome-Wide Axiom Human Mapping Genotyping Array Infinium Infinium OmniQuad HumanCoreExome- HumanCoreExome- Human SNP 6.0 UKBiobank Array 500K Array Set Omni5-4 Omni2.5-8 Array 24 Array 12 Array Array Affymetrix Human Mapping 500K 230,905 170,778 143,072 47,873 46,570 493,877 50,490 500,568 Array Set Affymetrix Axiom 481,496 275,425 139,068 106,814 105,900 96,840 845,487 UKBiobank Array Affymetrix Genome-Wide 449,211 337,809 279,695 88,414 86,074 934,968 Human SNP 6.0 Array Illumina HumanCoreExome- 314,848 284,597 274,814 528,021 542,585 12 Array Illumina HumanCoreExome- 324,032 293,765 282,096 550,601 24 Array Illumina 811,000 754,414 1,134,514 Omni1-Quad Array Illumina Infinium SNPs Common to ALL 2,319,402 2,372,784 Arrays = 7,851 Omni2.5-8 Illumina Infinium 4,327,108 Omni5-4

80

Table 1.4| Comparison of mtDNA SNPs found on Commonly used Genotyping Platforms Dark grey shaded boxes indicate the total number of mitochondrial SNPs present on each array. The percentages that mitochondrial SNPs make up of their respective arrays are shown in brackets in the dark grey shaded boxes. The pink shaded box shows the number of SNPs common between all arrays for mitochondrial SNPs. The Affymetrix 500K array has been excluded as it contains zero mitochondrial SNPs Affymetrix Affymetrix Illumina Illumina Illumina Illumina Illumina Genome- Axiom Genotyping Array Infinium Infinium OmniQuad HumanCoreExome- HumanCoreExome- Wide Human UKBiobank Omni5-4 Omni2.5-8 Array 24 Array 12 Array SNP 6.0 Array Array Affymetrix Axiom 94 68 16 115 111 83 354 (0.042) UKBiobank Array Affymetrix Genome-Wide 44 33 9 70 68 411 (0.045) Human SNP 6.0 Array Illumina HumanCoreExome- 177 153 17 345 364 (0.067) 12 Array Illumina HumanCoreExome- 179 155 18 367 (0.067) 24 Array Mitochondrial Illumina 19 14 25(0.002) SNPs Common to ALL Omni1-Quad Array (Excluding Affymetrix 500K) Illumina Infinium 174 180 (0.008) Arrays = 24 Omni2.5-8 Illumina Infinium 209 (0.005) Omni5-4

81

Figure 1.20| Positions of Mitochondrial SNPs on Each Array in Relation to Mitochondrial Genes and mRNA From the UCSC Database “Combined” track contains all SNPs from the seven individual array tracks. All known mitochondrial SNPs are also shown along with common SNPs and any SNPs which have previously been flagged as disease associated. 624

82

1.6.4.2 Statistical Analysis and Significance Threshold There are a number bioinformatics tools available for analysis of SNP data such as PLINK627, which is one of the most popular and powerful of these programs but other software is available such as GenABEL628, GWAF629, SNPAssoc630 and snpMatrix631 which are also based on R language. Software such as GWAS Analyzer,632 GWAS GUI633 and MAVEN634 provide platforms for intuitively viewing the results of GWAS. PLINK can be used for cleaning data, quality control and association analysis amongst other applications627. Other software is available for various stages of the GWAS process such as EIGENSTRAT635 or GRAF-pop636 for principal component analysis; and MACH 1.0637 software for imputation. One important step in GWAS is the correction for multiple testing in order to determine the true significance of a result. Bonferroni correction is commonly used for this, however this is considered to be overly conservative due to LD between SNPs, particularly when using imputed data638. A significance threshold of 5 x10-8 is commonly used to correct for the number of independent tests carried out at genome-wide level and is therefore suitable for many applications639. A less stringent threshold of 1 x10-5 may be used to indicate “suggestive” significance, although a large number of results may cross this threshold in a typical GWAS638. In order to draw accurate conclusions and identify true associations the gold standard in GWAS is replication of top-ranked SNPs in an independent population638,640.

1.6.5 Applications of Genome-Wide Association Studies The use of GWAS has allowed researchers and clinicians to identify specific genes that increase or decrease susceptibility to complex disorders. The genes identified through GWAS can provide insights into the pathways involved in complex disease development and also contribute to the development of novel, targeted treatments641.

1.6.5.1 GWAS in Research In the early days of genome-based research, genome-wide linkage studies were used to identify rare genetic variants involved with monogenic disorders or highly penetrant traits. These were also used to map the regulating loci of multifactorial traits and diseases, however, this approach had limited success when it was applied to common disorders which often have a more complex genetic architecture and less availability of biological samples from multiple generations of the same family601. In contrast to this, GWAS have a higher genetic resolution than linkage studies, can be more easily applied to a general population and provide better statistical power for identifying genetic associations. GWAS are best placed to identify common variants with small effects whereas linkage studies are best suited for the

83 identification of rare variants with large effects641,642. For the majority of traits studied using GWAS, it is apparent that SNPs in many genes contribute to the overall genetic risk, therefore the proportion of variance attributed to a solitary variant is very small and larger experimental samples have led to higher powered studies facilitating the discovery of novel loci associated with several diseases over the last 10 – 15 years592,643,644. Initially genetic variation was thought of as “one gene, one function, one trait“, however, the idea of pleiotropy has become widely accepted due to genome-based research methods645–647.

In addition to identifying specific loci associated with a trait, GWAS have allowed the quantification of “SNP heritability” and it has been estimated that one-third to two-thirds of genetic variation at causal variants can be captured by common genotyped and imputed SNPs based on LD644,648. Further work utilising whole-genome sequencing is necessary to identify the amount of variation due to causal variants with frequencies less than 1% which would not be captured in many GWAS592. As more GWAS data is generated, novel methods for data analysis have emerged including methods for modelling population structure and relatedness between individuals in a sample during association analyses649–655; methods to detect novel variants and gene loci using GWAS summary statistics656–658; methods for estimation and partitioning genetic (co)variance659,660; and methods of inferring causality661– 663. The sharing of genetic data in the public domain has also been an important step to ensure reported associations are accurate and reproducible. Initial concerns regarding the identification of individual participants from GWAS summary statistics were found to be unsubstantiated and the use of such summary statistics has allowed the discovery of novel associations656,664–666, estimation of SNP-based heritability667, quantification of pleiotropy across many phenotypes645,659,668, and creation of more accurate prediction scores, as well as follow-up with computational tools, functional assays, and model systems for the identification of candidate genes592. In order to increase the discovery power of individual GWAS these can also be combined in a meta-analysis but strict guidelines should be adhered to when combining GWAS results to ensure homogeneity between studies669.

The use of GWAS has developed the understanding of the underlying genetics involved with several disorders including bipolar disorder670, cancer671, Alzheimer’s disease672, cardiovascular disease673, epilepsy674, obesity675, T2DM676, CKD and DKD677. The discovery of true biological associations using GWAS is not as simple as it first appears as associations, between a specific variant at a genomic locus and a trait, are not directly informative regarding the target gene or the mechanism by which the variant is associated with an altered phenotype. As new types of data, new molecular technologies, and new analytical

84 methods emerge the relationship between sequence and phenotype is becoming clearer592. The various applications of SNP arrays in GWAS and other related analytical methods are summarised in Table 1.5. These tools have been successfully used for defining the relative role of genes and the environment in disease risk, assisting in risk prediction (enabling preventative and personalized medicine), and investigating natural selection and population difference592.

Table 1.5| Uses of Genome-Wide SNP Arrays in Human Genetic Discoveries592 Analysis Purpose Discoveries Estimation of genetic Detecting and quantifying Pleiotropy is ubiquitous correlationa pleiotropy Estimation of SNP Genetic architecture Large proportion of genetic variation heritabilitya captured by common snps Genome-wide Quantifying genome Large variation in LD in the genome assessment of LD architecture Genome-wide CNV Detecting trait-CNV Hundreds of associations with analysis associations diseases and disorders GWAS Detecting trait-snp 10,000 robust associations with associations diseases and disorders, quantitative traits,∼ and genomic traits Mendelian Testing causal Replication of known causal randomisation relationships relationships; empirical evidence of observational associations that are not causal Polygenic risk scoresa Detecting pleiotropy; Out-of-sample prediction works as validating GWAS expected; detection of novel trait discoveries associations Population Reconstructing human Genetic structure can mimic differences in allele population history; geographical structure; evidence of frequencies detecting selection natural selection Trait GWAS with - Fine-mapping; detecting Two-thirds of GWAS-associated loci omics GWAS* target genes; function implicate a gene that is not the nearest gene to the most associated SNP aThese analyses can be performed with GWAS summary statistics.

1.6.5.2 From GWAS to Function Several studies have investigated the link between statistical associations reported in GWAS and how this influences biological function. Increased expression of ELMO1 has been observed in the presence of high glucose which contributes to chronic glomerular injury due to upregulation of TGF-β1, collagen type 1, and fibronectin expression and dysregulation of renal ECM metabolism287,678. Variants in ELMO1 have been associated with DKD in multiple independent cohorts, including Chinese679, African-American680, Pima Indian681, and

85

Caucasian populations288 however these findings were not replicated in populations included in the Diabetic Nephropathy Collaborative Research Initiative (DNCRI) collection682. The GoKinD collection have reported associations between FRMD3 and DKD and these have also been replicated in several independent studies683–685. In the context of DKD, FRMD3 expression differs significantly between early stages of the disease and advanced DKD686. It has also been reported that there is a link between a SNP found in the promotor region of FRMD3 and the BMP-signalling pathway. It has also been suggested that AFF3 and SORBS1 play a functional role in DKD susceptibility based on reports from Sandholm and colleagues41 and Germain and co-workers42. Recently a very large study investigating DKD genetics revealed rs55703767, a common missense mutation in the collagen type IV alpha 3 chain (COL4A3) gene that encodes a major structural component of the GBM, associated with DKD294. Genetic variants encoding the proteins of the GBM are known to be responsible for several inherited nephropathies687.

1.6.5.3 Clinical Applications of GWAS Molecular genetic discoveries have several potential applications in the field of medicine as summarised in Table 1.6688. There are many challenges that must be overcome in order to translate the abundance of genomic data from research findings to medical practice. In the case of complex diseases, the interaction of various determinants, such as genetic, molecular and environmental factors contribute to the clinical phenotype. Genetic studies are most informative within families in order to gain insights into pathogenesis of disease, identify at- risk individuals and target specific pathways for prevention and treatment688. Genomic risk prediction based on GWAS findings aims to enable disease prevention by quantifying individual risk in the early stages and also to provide a more comprehensive diagnosis when established methods are inadequate689.

GWAS can also provide opportunities for drug discovery and drug repositioning690,691, and it has been estimated that drug targets supported by genetic evidence may improve the success rate in clinical development692. Many approved drugs have been directly supported by evidence derived from GWAS692. Drug repositioning, based on evidence from GWAS, has been successful for the treatment of autoimmune diseases by employing drugs to target the IL-23 pathway which is now widely accepted for treating psoriasis and psoriatic arthritis, ankylosing spondylitis and inflammatory bowel disease693–697. GWAS have also been used to identify over 100 risk loci associated with schizophrenia and there is evidence of considerable pleiotropy with other psychiatric disorders646,698,699, although the genetic composition of rare and common variants will likely differ substantially between such disorders700–708. In addition

86 to the discovery of risk alleles for schizophrenia, GWAS findings may provide the opportunity to identity multi-gene targets for drug repositioning, which may lead to new treatments for this disease709. In recent years the field of pharmacogenomics has emerged as pharmacogenetic studies have moved on from a candidate gene approach and adopted genome-wide research methods such as GWAS710. These pharmacogenomic GWAS have enabled the discovery of biomarkers for drug response as well as improving understanding of disease pathophysiology and molecular pharmacology when combined with functional studies. The use of GWAS, integrated with other omics technology and electronic medical records, offers opportunities for “precision medicine”. There are a number of bottlenecks which must be addressed before this becomes a reality including cost effective whole- genome sequencing, careful curation of data repositories and clearly established clinical genomic protocols to ensure minimal errors in diagnosis which could have detrimental consequences711. As of 2019, the NHS will offer whole-genome analysis for all seriously ill children with a suspected genetic disorder and adults with certain rare diseases or hard to treat cancer712.

87

Table 1.6| Clinical Applications of Molecular Genetic Discoveries688 Genetic testing • Best applied in familial setting and through trio or cascade screening • Best applied in single gene disorders with Mendelian patterns of inheritance • Preclinical diagnosis in familial setting enabling interventions to reduce the risk • Unambiguous assignment of causality is almost impossible in a single individual in most situations • Limited applications in complex phenotypes, whether as a single variant or as a genetic risk score Accurate diagnosis and disease • Distinction of the disease from phenocopy classification conditions, which enables implementation of proper treatment • Possible genetic-based categorization of the disease, which might enable rendering appropriate therapy Prognostication • With few exceptions, limited impact beyond the clinical means and often unreliable Response to therapy • Early applications in cancer therapy • Potential for individualized therapy and prevention of side effects, albeit not proven to be superior to conventional medical approaches Insight into fundamental • This is likely the primacy of the genetic data, mechanism(s) of disease enabling discovery of new mechanisms and targets for prevention and treatment

1.7 Next Generation Sequencing

For more than a decade now second and third generation sequencing methods have allowed unbiased examination of billions of RNA and DNA templates. The term ‘Next Generation Sequencing’ (NGS) is widely used to refer to second generation sequencing methods which process millions of sequence reads in parallel therefore allowing sequencing of whole genomes (or specific sections) at high resolution in a relatively short time and with high accuracy713. Such methods may also be referred to as massively parallel sequencing due to the more recent development of third generation sequencing or long-read sequencing approaches, which can produce genome assemblies of unprecedented quality and allow direct detection of epigenetic modifications on native DNA and permit whole-transcript sequencing without the need for assembly714.

During the mid-1970’s two methods emerged for direct sequencing of DNA: the Maxam- Gilbert sequencing method715 and the Sanger chain-termination method716. The Maxam and

88

Gilbert technique made use of radiolabelled DNA which was then treated with chemicals to cleave the DNA at specific bases; these fragments were then ran on a polyacrylamide gel to determine the length of cleaved fragments (and the position of specific nucleotides) and therefore infer the DNA sequence715. This method was very time consuming and relied on neurotoxic reagents and a breakthrough in the field of DNA sequencing came in 1977 with the development of Sanger’s ‘chain-termination’ technique716,717. This method utilises chemical analogues of deoxyribonucleotides (dNTPs) known as dideoxynucleotides (ddNTPs) which lack the 3’ hydroxyl group necessary for DNA chain extension, therefore they are unable to bond with the 5’ phosphate of the next dNTP. By mixing labelled ddNTPs into the DNA extension reaction at low concentrations DNA strands of each possible length are produced as ddNTPs are randomly incorporated during strand extension and halt the elongation reaction. Four parallel reactions are carried out each with a different ddNTP base and the results are run on four lanes of a polyacrylamide gel. Autoradiography can then be used to infer the nucleotide sequence of the original based on the radioactive band in the corresponding lane at that position on the gel716,717. Over the next 2 decades a number of improvements to the methods and the development of automated Sanger sequencing718 led to this becoming the most widely used DNA sequencing technology and the method of choice for sequencing the first human genome719.Following the completion of the project many new methods and techniques were developed for high-throughput genetic sequencing with the goal of bringing the cost of sequencing an entire human genome for less than $1000720,721. In 2014 this goal was reached by Illumina using the HiSeq X Ten system which can sequence a whole genome for $1000 including instrument depreciation, DNA extraction, library preparation, and estimated labour for a typical high-throughput genomics laboratory722. The dramatic drop in the cost of genome sequencing since the completion of the human genome project can be seen in Figure 1.21. These estimates have been met with some scepticism due to the high initial costs of sequencing machines, unrealistic labour cost projections and cost of bioinformatics analysis. Cost estimates for the Illumina HiSeq X Ten system were based on sequencing 18, 000 human genomes per year for four years to achieve the $1000722 and therefore actual costs will vary depending on the throughput of individual laboratories. As well as the direct costs associated with sequencing it is important to consider the impact of incidental findings on individuals and their families as well as the cost of downstream tests and treatments which may be required following genetic testing723,724. This

89 highlights the need to balance benefits and costs when employing high throughput sequencing approaches for routine clinical use.

Figure 1.21| Falling Cost of Genome Sequencing 2001 - 2019725 Data showing the cost of sequencing a human-sized genome from 2001 – 2019. Also shown is hypothetical data reflecting Moore's Law, which describes a long-term trend in the computer hardware industry involving the doubling of 'compute power' every two years. Since 2008, when second generation sequencing entered widespread use, there has been a dramatic out- pacing of Moore’s Law.

1.7.1 Sequencing Principles The precise methodology will vary between platforms and manufacturers but in general a typical NGS sequencing run involves sample preparation, sequencing and data generation. Firstly, a library is prepared using amplification or ligation with specified adapter sequences which permit library hybridisation to sequencing chips and give a universal priming site for sequencing primers726. The surface on which the library fragment is amplified will vary between platforms but during this stage DNA linkers hybridise the library adapters to produce clusters of DNA727–729. These clusters are each derived from a single library fragment and each cluster represents an individual sequencing reaction. Quantification of each cluster is permitted by light or fluorescence generated from repeated cycles of nucleotide incorporation730,731. A collection of DNA sequences generated at each cluster is produced at the end of the sequencing run and this can be analysed in various ways to generate useful results for downstream analyses.

90

The method of sequencing utilised for high throughput sequencing approaches varies between different platforms and manufacturers but in general there are four main principles which can be employed732. Cyclic array sequencing or sequencing by synthesis (SBS) involves physical separation of target sequences on an array to which enzymes are repeatedly added in cycles in order to detect base integration. As a base is added the sequence is revealed by using reversibly blocked nucleotide substrates or by adding only one type of nucleotide per cycle733. Another approach is known as sequencing by hybridisation which involves many probes fixed on an array to which the target DNA sample is added. The fixed probes may be DNA from a reference genome or a known sequence and the target DNA then binds to complementary probes on the array allowing the sequence of the target DNA to be deciphered without the need for library construction734. Microelectrophoretic methods are based on the dideoxy method utilised in Sanger sequencing but on a much smaller scale in order to retain the accuracy and read length of Sanger sequencing while allowing parallelisation in order to reduce time and costs735. Several real-time single-molecule sequencing approaches are also available in order to allow ultra-fast DNA sequencing in which the sequencing is detected as soon as the nucleotide is incorporated. Such methods include nanopore sequencing which involves analysis of each base in a sequence of nucleic acids as it bases through a membrane known as a nanopore. Another method for real-time sequencing involves detection of nucleotides as they are incorporated by fluorescence resonance energy transfer or with zero mode waveguides736,737.

1.7.2 Template Preparation Methods Prior to sequencing many NGS protocols require a library to be prepared from random fragments of DNA and then amplified by polymerase chain reaction (PCR). During library preparation the DNA sample is randomly fragmented followed by ligation of short double stranded pieces of synthetic DNA known as adaptors which allow the unknown DNA sequence to bind to a complementary counterpart738. The steps involved in library preparation can be seen in Figure 1.22.

91

Figure 1.22| Overview of Library Preparation for NGS739. DNA is first fragmented enzymatically or by sonication and adapters are ligated to the fragments using DNA ligase. These adapters allow the fragment to bind to a complementary counterpart. One end of the adaptor is 'sticky' whilst the other is 'blunt'. The blunt end is joined to the DNA using DNA ligase. In order to prevent base pairing between molecules and dimer formation, a phosphate is removed from the sticky end of the adapter to create a 5′-OH end and the DNA ligase is unable to form a bridge between the two termini.

In order to accurately detect the signal produced by the sequencing reaction the library fragments must be amplified which involves clustering fragments into PCR colonies containing multiple copies of each library fragment which then undergo PCR to produce large numbers of DNA clusters. Emulsion PCR involves the formation of microwells each containing a bead and a strand of DNA740,741. A PCR reaction takes place in each well in order to denature the library fragments and allow the reverse strand to attach to each bead which is then amplified. The original reverse strand then denatures and is released from the bead, it then re-attaches to the bead to give two separate stands which are then amplified so two DNA strands are attached to the bead. This process is repeated for 30 – 60 cycles to produce

92 clusters of DNA all with the same sequence740–742. This method has received some criticism due to the time taken and number of steps involved and also due to the fact that only around two thirds of the microwells will actually contain one bead, despite these criticisms this methods is extensively used in many NGS platforms743,744. The main steps involved in emulsion PCR are summarised in Figure 1.23.

Figure 1.23| Library Amplification by Emulsion PCR745. Fragmented DNA templates are ligated to adapters and captured in aqueous droplets called micelles along with a bead with complementary adapters, deoxynucleotides (dNTPs), primers and DNA polymerase. PCR is performed within the micelle, covering each bead with thousands of copies of the same DNA sequence.

Another commonly used method for library amplification is bridge PCR (Figure 1.24) which uses a flow cell coated with primers complementary to those on the DNA library fragment746,747. DNA is randomly attached to the surface of the flow cell and then exposed to reagents to stimulate polymerase-based extension. Nucleotides and enzymes are then added which allow the free end of the single strands of DNA to attach to the flow cell via the complementary primers results in bridge structures. Enzymes then interact with these bridges to denature them and leave two single stranded DNA fragments attached to the flow cell in close proximity. This process is repeated to produce clonal clusters of localised strands with identical sequences 746,747.

93

Figure 1.24| Library Amplification by Bridge PCR733. Fragmented DNA is ligated to adapter sequences and bound to a primer immobilized on a solid structure, such as a patterned flow cell. The free end of the DNA can interact with other nearby primers, forming a bridge structure. PCR is used to create a second strand from the immobilized primers, and unbound DNA is removed. Amplification products remain attached near the point of origin. Following the PCR, each cluster contains ~1,000 copies of a single member of the template library.

Bridge amplification and emulsion PCR are the most widely used methods to prepare templates for NGS reactions which, however other methods are also available such as the use of gridded rolling circle nanoballs which involves amplification of a population of single DNA molecules by rolling circle amplification which are then captured on a grid of spots748. This method is employed by the BGISEQ-500 platform developed by the Beijing Genomics Institute749.

1.7.3 Sequencing Methods and Platforms The first commercially available second generation sequencing platforms was developed by 454 Life Sciences (later purchased by Roche) and utilised a method of sequencing known as pyrosequencing750. This approach made use of the SBS principle in which a single stranded DNA primer hybridises to the end of the strand and this is sequentially exposed to the four different dNTPs. When the correct dNTP binds with the strand a pyrophosphate is released which is converted to ATP in the presence of ATP sulfurylase and adenosine in a reaction which produces light which can be detected and the relative intensity is proportional to the amount of base added750. The Ion torrent platform utilises SBS to synthesize a DNA strand, complementary to the target strand, one base at a time. Each time a nucleotide is added a hydrogen ion is released which is detected by a semiconductor chip and the change in signal

94 intensity will reflect the amount of nucleotide added751,752. Sequencing by Oligonucleotide Ligation and Detection (SOLiD) is a method used by Applied Biosystems753,754. This method makes use of beads with immobilised adapters conjugated to the target sequence which are then deposited onto a glass surface on which a primer is hybridised, and the beads are then exposed to a library of 8-mer probes. DNA ligase attaches these probes to complimentary target sequences and a phosphorothioate linkage between bases 5 and 6 allows the cleavage of a fluorescent dye from the fragment and this fluorescence can be measured. This is repeated many times with shorter primers used each time to ensure the whole target is sequenced. This approach is the most accurate of second generation platforms and is also relatively inexpensive, however due to short read length it is only suitable for certain applications754. Reversible terminator sequencing is an approach used by Illumina which uses modified nucleotides to allow reversible termination of primer extension unlike Sanger sequencing in which primer extension is terminated irreversibly by dideoxynucleotide755–757. This approach uses bridge PCR for amplification of library fragments and may use either 3′- O-blocked reversible terminators or 3′-unblocked reversible terminators755.

The main drawback of many of these NGS methods are the short read lengths and the time required for library preparation. Third generation sequencing methods have been developed in order to remove the need for clonal amplification, reduce errors due to PCR, simplify library preparation and produce longer read lengths with higher throughput sequencers714. One example of this would be the single-molecule real-time sequencing approach used by Pacific Biosciences (PacBio) which involves attachment of single polymerase molecules to a solid support onto which a primed template can bind allowing larger DNA molecules to be sequenced in real-time with less fragmentation required758. The Illumina TruSeq Synthetic Long-Reads platform utilises clonal amplification and barcoding of ~10kbp DNA molecules before sequencing with a short read instrument. These barcoded short reads are then used to produce synthetic long reads with high accuracy that can be used for phasing analyses and assembly without error correction759. This method relies on long-range amplification and as the reads are synthetically generated, the available read lengths are shorter than other approaches, and are prone to termination and potential biases in regions with high GC content or tandem repeats714. According to the Illumina website the TruSeq Synthetic Long- Read DNA library prep kit has been discontinued and a solution is not currently available760. Oxford Nanopore Technologies have also developed a single molecule approach utilising silicon-based nanopores and is capable of measuring small disruptions to an electrical current as DNA molecules pass through the pore761. Various high through put sequencing platforms

95 are available from different manufacturers and the choice of platform will depend on the desired application, available space and budget amongst other considerations. Some of the most popular manufacturers, common platforms and instrument specifications can be seen in Table 1.7.

Table 1.7| Comparison of Next-Generation Sequencing Instrument Specifications762 Manufacturer Instrument Maximum Maximum Read length Sequenc reads per output range ing run run (gigabases) time (hours) Illumina Iseq 100 4 M 0.144 36-bp SE 9 0.2 50-bp SE 9 0.3 75-bp SE 10 8 M 0.6 75-bp PE 13 1.2 150-bp PE 17.5 MiniSeq 25 B 1.65 – 1.875 75-bp SE 7 50 B 3.3 – 3.75 75-bp PE 13 6.6 – 7.5 150-bp PE 24 MiSeq 25 B 3.3 – 3.8 75-bp PE 21 50 B 13.2 – 15 300-bp PE 56 NextSeq 0.4 B 25 – 30 75-bp PE 11 550 0.8 B 50 – 60 75-bp PE 18 100 – 120 150-bp PE 26 HiSeq 3000 2.5 B 105 – 125 50-bp PE 24 – 84 5 B 325 – 375 75-bp PE 650 – 750 150-bp PE HiSeq 4000 5 B 210 – 250 50-bp SE 24 – 84 10 B 650 – 750 75-bp PE 1 300 – 1 500 150-bp PE HiSeqX 6 B 1,600 – 1,800 150-bp PE 72 NovaSeq 2.6–3.2 B 134 – 500 50-bp PE 13 – 6 6000 6.6-3.2 B 333 – 1 250 100-bp PE 19 – 36 16-20 B 1 600 – 3 000 150-bp PE 25 – 44 Ion Torrent Ion 510 2-3 M 0.3 – 0.5 200 bp 2.5 – 4 Chip 0.6 – 1.0 400 bp Ion 520 4-6 M 0.6 – 1.0 200 bp Chip 3-4 M 1.2 – 2.0 400 bp Ion 530 15-20 M 3 – 4 200 bp Chip 6 – 8 400 bp 9-20 M 1.5 – 4.5 600 bp Ion 540 60-80 M 10 – 15 200 bp Chip Ion 550 100-130 M 20 – 25 200 bp Chip

96

BGI MGISEQ- 300 M 60 50-bp SE/PE, 48 200 100-bp SE/PE BGISEQ- 1,300 M 520 50-bp SE/PE, 45 – 213 500 100-bp SE/PE MGISEQ- 1 800 M 1 080 50-bp SE/PE, 48 2000 100-bp SE/PE, 150- bp PE, 300- bp SE MGISEQ-T7 - 6 000 150 bp PE 24 PacBio Sequel 0.5 M 20 10–30 kb on 10 – 20 average Nanopore MinION Mk 0.5 M 50 Hundreds to Up to 72 1B thousands of GridION X5 2.5 M 250 kb Up to 72 PromethIO 375 M 15 000 Up to 64 N (Beta) B, billion; bp, base pairs; kb, kilobase pairs; M, million; PCR, polymerase chain reaction; PE, paired-end reads; SE, single-end reads. aHigh-Output Kit specifications provided (Mid-Output Kit also available for 150 base pair paired-end reads); bDiagnostic instrument version also available; cMiSeq Reagent Kit v3 specifications provided (v2, Micro and Nano Kits also available); dSingle HiSeqX instrument specifications provided. This instrument is only available as clusters of 5 or 10 instruments; eSpecifications provided for a dual flow cell; fSpecifications provided refer to a chip rather than an instrument: Ion Gene Studio S5 instrument compatible with Ion 510 and Ion 540 chips, Ion Gene Studio S5 Plus and Prime instruments compatible with Ion 510 and Ion 550 chips; gReported release date in 2019; hSpecifications provided for one SMRT Cell. Up to 12 SMRT Cells can be run simultaneously.

Sequencing methods based on PCR are unable to sequence modified DNA bases and methylated cytosine such as 5-methylcytosine and 5-hydroxymethylcytosine are both treated as cytosine by the enzymes involved in PCR therefore losing information regarding epigenetic modifications. Bisulfite sequencing allows this information to be retained by making use of the difference in reactivity of cytosine and 5-methylcytosine when treated with bisulfite. Bisulfite deaminates cytosine to form uracil (which is read as T when sequenced), whereas 5-methylcytosine is unreactive (read as C). By carrying out two sequencing runs in parallel, one with bisulfite treated DNA and one without, the amount of methylated cytosines can be determined from the difference between the outputs of the two runs763. Another important epigenetic modification is 5-hydroxymethylcytosine which reacts with bisulfite to form cytosine-5-methylsulfonate (which reads as C when sequenced). This means that bisulfite sequencing cannot be used as a true indicator of methylation alone. This issue can be overcome by using oxidative bisulfite sequencing which involves chemical oxidation of 5- hydroxymethylcytosine to 5-formylcytosine using potassium perruthenate prior to bisulfite treatment. During bisulfite treatment 5-formylcytosine is deformylated and deaminated to

97 form uracil. In this method three separate sequencing runs are necessary to distinguish cytosine, 5-methylcytosine and 5-hydroxymethylcytosine764.

High-throughput sequencing platforms can also be used to determine the sequence of RNA molecules using deep-sequencing technologies, this is commonly known as RNA-Seq765. Prior to sequencing the RNA is converted to a library of cDNA fragments with adaptors attached to one (single-end sequencing) or both ends (paired-end sequencing). Depending on the platform, amplification may or may not be required and each molecule is then sequenced to give short sequences typically 30–400 bp, depending on the technology used. Any high- throughput sequencing technology766 can be used for RNA-Seq, and protocols have been developed by Illumina767,768, Applied Biosystems SOLiD769, Roche 454 Life Science770,771 and Helicos Biosciences772. This approach can be used for analysis of gene expression, to identify alternative splicing, sequencing miRNAs, sequencing non-coding RNA, sequencing anti-sense RNA and single cell sequencing726.

1.7.4 Applications of Next Generation Sequencing The use of high throughput sequencing technology has allowed researchers to collect vast quantities of genomic sequencing data. This technology has an abundance of applications, such as: diagnosing and understanding complex diseases; whole-genome sequencing; analysis of epigenetic modifications; mitochondrial sequencing; transcriptome sequencing; and exome sequencing. As costs of sequencing continue to fall, the use of high throughput techniques is becoming more widespread and there are a number of important considerations which will come with this such as computational challenges in processing and storing the huge amount of data produced from sequencing experiments as well as ethical issues which arise when working with DNA such as ownership of the data, incidental findings and data security issues713,773. The use of this technology has had a major impact on many disciplines of genetics for example in evolutionary biology the use of NGS has drastically increased the amount of DNA sequence data available from extinct organisms therefore providing better insight into the evolutionary history of many organisms774,775 Another area which has benefited from the use of this technology has been in forensic science for short tandem repeat analysis when DNA samples may be limited or of poor quality, for identifying different individuals or species when a sample contains DNA from multiple sources, for microbiological analysis, ancestry and phenotype inference and also in Y chromosome and mitochondrial whole-genome studies776–778. In the context of kidney disease, the identification of causative genes has often been difficult due to the complex nature of this

98 disease which means that mutations often occur in more than one gene. The use of NGS may help to overcome some of these issues and allow investigation of genetic modifiers which may influence disease aetiology and progression779–782. Before NGS can become a routine clinical tool it is important to address issues such as integration of data obtained from NGS into a patients clinical records and combine this with other omics data and enable a precision medicine approach to become routine practice for treatment of many diseases783.

1.8 Project Rationale

There is a clear genetic component to DKD as evidenced by family-based association studies, the fact that DKD develops in only a subset of those with DM and the identification of several susceptibility loci with GWAS. However, the loci identified by GWAS often lack reproducibility which may indicate that there is more to DKD than the nuclear genetics alone, with evidence to suggest that mitochondria and epigenetics may also play a role in the development of this disease. There is a clear need to develop novel biomarkers for DKD with higher sensitivity and specificity in order to better understand the molecular mechanisms, to allow multiparametric risk assessment of kidney injury and develop effective interventional strategies in future.

1.8.1 Hypotheses This work in this thesis aims to investigate mtDNA and nuclear genes required for mitochondrial function in order to identify variants which may be associated with phenotypes related to DKD in individuals with T1DM. GWAS will be used to investigate SNPs which may be used to predict risk of DKD development and disease susceptibility. Mitochondrial genetic and epigenetic features of DKD will also be explored using NGS with the aim of establishing a workflow for sequencing and analysing NGS data derived from mtDNA. The workflow will also be evaluated, and the feasibility of a large-scale sequencing study will be assessed with the aim of applying this to future studies investigating mitochondrial genetic and epigenetic features of DKD.

99

Chapter 2| Mitochondrial-Genome-Wide Association Studies

2.1 Introduction

Significant technological advancements since the early 2000’s led to the development of genome-wide association studies (GWAS) to help determine genetic risk factors associated with common diseases having complex genetic aetiologies547,592. Identified risk alleles can prove useful in designing experiments to understand the biological mechanisms of disease. Some of the associated DNA variants may help stratify populations based on genetic risk of disease and potentially predict responses to drug treatment710,784. The use of this technology has led to the discovery of several candidate loci and genetic variants associated with diabetic kidney disease (DKD) in both type 1 diabetes mellitus (T1DM)41,42,45,785,786 and type 2 diabetes mellitus (T2DM)40,287,787,788. From the results of the aforementioned GWAS investigating T1DM and T2DM it is evident that multiple genes are involved in DKD susceptibility, however the variants identified to date have only modest effects on DKD risk and likely represent only a small proportion of the full range of loci which contribute to the heritability of DKD43.

2.1.1 GWAS in Diabetic Kidney Disease Research The first GWAS investigating DKD was published in 2003 using less than 100,000 gene-based SNPs and only 188 T2DM patients (94 cases with either proteinuria or ESRD and 94 normoalbuminuric controls) in the discovery phase and these were then tested in a replication cohort with 732 participants (466 additional cases and 266 additional controls)789. This study identified SLC12A3, a thiazide-sensitive Na–Cl co-transporter, as a potential candidate gene for DKD; however, this was not found to be statistically significant after adjustment for multiple comparisons. The same research group completed another GWAS which identified a common variant in the ELMO1 gene that was also found to be associated with DKD in the initial analysis but did not remain significant in meta-analysis in subsequent studies287,288,682.

A larger scale GWAS investigating DKD was performed in the USA Genetics of Kidneys in Diabetes (GoKinD) study cohort examining more than 2.4 million SNPs (~360,000 genotyped SNPs and 2.1 million imputed SNPs) in 820 case subjects (284 with proteinuria and 536 with ESRD) and 885 control subjects with T1DM785,786,790. The strongest association identified was found at the FRMD3 gene (P-value=5.0×10−7). Another three genomic regions near the CHN2/CPVL, CARS, and MYO16/IRS2 genes were also reported as associated with DKD. Other

100 studies have also reported associations with DKD-related phenotypes at FRMD3, CARS, and MYO16/IRS2 in T2DM among Caucasians791, African Americans685, and Asians792, which seems to indicate that these genes could be true susceptibility loci for DKD in T1DM and T2DM, however these did not remain significant in high powered meta-analysis682. As enthusiasm for GWAS continued to grow several large international consortia formed with the aim to dramatically increase sample sizes available for GWAS. The Genetics of Nephropathy: an International Effort (GENIE) consortium carried out the first meta-analysis GWAS of DKD in T1DM with 6,691 individuals from three collections in its discovery phase; the All Ireland / Warren 3 GoKinD UK Collection (UK-ROI), the Finnish Diabetic Nephropathy Study (FinnDiane), and the USA GoKinD study41. The replication stage consisted of 5,873 individuals from nine additional cohorts. This combined meta-analysis found two signals at the AFF3 and RGMA/MCTP2 genes to be significantly associated (P-value < 5 × 10− 8) with ESRD risk as well as a novel variant on chromosome 2q (rs4972593) between the SP3 and CDC7 genes which was identified as a gender-specific variant associated with ESRD in female patients with T1DM793. It should be noted that this larger study with updated quality control (QC) did not support the findings of the U.S. GoKinD GWAS and further investigation of the ELMO1 locus and regions previously associated with DKD were not found to reach genome wide significance682 and similar inconsistencies in GWAS findings have also been reported in other investigations of DKD293.

The GENIE consortium have also investigated quantitative DKD-related traits and found a strong association between variants in GLRA3 and albuminuria, with confirmation in a collection of 259 independent Finnish individuals with T1DM and meta-analysis794. More recently the GENIE consortium expanded their meta-GWAS to include a more comprehensive set of genetic variants (~ 37 million SNPs), a larger number of subjects (12,540 individuals), and analyses across a range of albuminuria- and eGFR-based sub-phenotypes, however no variant was found to reach genome-wide significance45. The Family Investigation of Nephropathy and Diabetes (FIND) carried out the only trans-ethnic GWAS of DKD, which included more than 13,000 unrelated T2DM individuals of European–American, African– American, Mexican–American, or Indian American ancestry788. From the ethnic specific analysis of Mexican–Americans in this study a significant SNP was found at rs12523822 on chromosome 6q (P-value = 5.7 × 10− 9) between the SCAF8 and CNKSR3 genes and this was also strongly associated with DKD in a trans-ethnic meta-analysis across all ethnic groups. An association at the APOL1/MYH9 locus has also been reported to be associated with DKD in African-American individuals with T2DM. APOL1 is a well-known susceptibility locus for ESRD

101 in those of African ancestry and therefore this association is likely due to the inclusion of unrecognized non-DKD among cases in this analysis795–797.

Another large-scale GWAS was performed by the Surrogate markers for Micro- and Macro- vascular hard endpoints for Innovative diabetes Tools (SUMMIT) consortium40. This aimed to identify genetic and non-genetic markers for DKD in T2DM through the investigation of eight DKD phenotypes and included more than 5,717 T2DM participants in its discovery phase and up to up to 26,827 subjects with T2DM in the replication group40. These analyses identified a novel locus near GABRR1 (P-value 4.5 × 10− 8) associated with microalbuminuria. Additionally, joint analysis of T1D and T2D subjects with eGFR based on serum creatinine found an association between this phenotype and variants near UMOD (P-value = 4.4 × 10− 12), which had been previously reported by Pattaro and colleagues in > 11,000 diabetic individuals included in the CKDGen Consortium798. The major findings from GWAS for DKD associations in individuals with T1DM and T2DM is summarised in Table 2.1. A major confounder in DKD GWAS comes from the classification of DKD. Almost all individuals with T1DM and DKD display classical histological characteristics of DKD on renal biopsy, whereas only 30–50% of individuals with T2DM and albuminuria have DKD on biopsy, while the remainder may have kidney disease associated with hypertension, obesity, ageing and atherosclerotic vascular disease799. These observations, along with the association at the APOL1/MYH9 in DKD and non-diabetic kidney disease from the FIND study highlight the differences in genetic architecture between T1DM and T2DM788,795–797.

102

Table 2.1| Summary of Major Loci Identified Through GWAS of DKD and DKD-Related Traits (P-value < 1.0 × 10− 5)623 Chr Gene SNP Position P-value Effect size DKD Diabetes Population Study design N samples N N Ref (OR/ phenotype Type samples’ markers beta) discovery cohort 2 AFF3 rs7583877 100,460,654 1.2 × 10− 8 1.29 ESRD vs. non- T1DM Caucasian Case–control 11,847 6,652 2.4 M 41 ESRD HS6ST1 rs13427836 129,027,961 6.3 × 10− 7 0.19 uACR T2DM Caucasian Quantitative 7,399 5,509 2.2 M 787 SSB rs1974990 170,646,916 1.4 × 10− 6 3.17 eGFR T2DM Caucasian Quantitative 14,828 13,158 37 M 40 /Asian ERBB4 rs7588550 213,168,768 2.1 × 10− 7 0.66 PR/ESRD vs. T1DM Caucasian Case–control 11,847 6,231 2.4 M 41 NA COL4A3 rs55703767 227 256 385 8.19×10-11 0.78 DKD (MA or T1DM Caucasian Case–control 17,024 - 49 M 687 ESRD) European 9.68×10-9 0.84 MA, 19,300 - Microalbumin uria and ESRD vs NA 5.30×10-9 0.76 CKD and DKD Quantitative 14,663 - 9.38×10-8 0.76 MA Case-Control 14,875 - 4 PTPN13 rs61277444 87,529,078 1.9 × 10− 6 1.41 ESRD vs. non- T1DM Caucasian Case–control 12,540 5,150 37 M 45 ESRD 6.0 × 10− 6 1.42 ESRD vs. no T1DM Caucasian Case–control 12,540 3,406 37 M 45 DKD 6 GABRR1 rs9942471 89,948,232 4.5 × 10− 8 1.25 MA vs. NA T2DM Caucasian Case–control 4,801 4,227 37 M 40 SCAF8/ rs955333 154,947,408 1.3 × 10− 8 0.73 PR/ESRD vs. T2DM Trans- Case–control 13,736 6,197 906,600 788 CNKSR3 NA ethnic rs12523822 154,954,420 5.7 × 10− 9 0.57 PR/ESRD vs. T2DM American Case–control 2,154 857 906,600 788 NA Indians 7 CHN2/ rs39059 29,255,470 5.0 × 10− 6 1.39 PR/ESRD vs. T1DM Caucasian Case–control 1,705 - 359,193; 785,7 CPVL NA 2.4 M 86

103

ELMO1 rs741301 36,917,995 8.0 × 10− 6 2.67 PR/ESRD vs. T2DM Japanese Case–control 920 188 81,315 287 NA CNTNAP2 rs1989248 148,141,082 6.0 × 10− 7 1.26 MA/PR/ESRD T1DM Caucasian Case–control 12,540 3,135 37 M 45 or eGFR < 45 vs. NA 1.8 × 10− 6 1.29 ESRD vs. no T1DM Caucasian Case–control 12,540 3,406 37 M 45 DKD PRKAG2 rs10224002 151,415,041 2.7 × 10− 8 2.01 eGFR T2DM Caucasian/ Quantitative 17,696 13,158 37 M 40 Asian 9 FRMD3 rs10868025 86,164,176 5.0 × 10− 7 1.45 PR/ESRD vs. T1DM Caucasian Case–control 1,705 - 359,193; 785,7 NA 2.4 M 86 10 NRG3 rs72809865 83,291,690 7.4 × 10− 6 1.17 MA/PR/ESRD T1DM Caucasian Case–control 12,540 5,150 37 M 45 vs. NA 10 SORBS1 rs1326934 97,284,081 5.7 × 10− 7 0.84 PR/ESRD vs. T1DM Caucasian Case–control 7,801 1,462 11.1 M 42 NA 11 CARS rs451041 3,060,725 3.1 × 10− 6 1.36 PR/ESRD vs. T1DM Caucasian Case–control 1,705 – 359,193; 785,7 NA 2.4 M 86 11 RAB38 rs649529 88,008,251 5.8 × 10− 7 -0.14 uACR T2DM Caucasian Quantitative 7,787 5,825 2.2 M 787 13 MYO16/ rs1411766/ 110,252,160 1.8 × 10− 6 1.41 PR/ESRD vs. T1DM Caucasian Case–control 1,705 – 359,193; 785,7 IRS2 rs17412858 /110,252,60 NA 2.4 M 86 8 15 RGMA/ rs12437854 94,141,833 2.0 × 10− 9 1.80 ESRD vs. non- T1DM Caucasian Case–control 11,847 6,652 2.4 M 41 MCTP2 ESRD 16 UMOD rs11864909 20,400,839 2.1 × 10− 12 2.22 eGFR T2DM Caucasian Quantitative 16,304 13,158 37 M 40 22 MYH9 rs5750250 36,708,483 7.7 × 10− 8 1.27 PR/ESRD vs. T2DM African Case–control 6,108 3,221 906,600 788 NA American 22 APOL1 rs136161 36,657,432 5.2 × 10− 7 1.36 PR/ESRD vs. T2DM African Case–control 6,108 3,221 906,600 788 NA American Chr, chromosome; NA, normoalbuminuria; MA, microalbuminuria; PR, proteinuria; ESRD, dialysis/transplant/eGFR < 15 ml/min/1.73 m2; uACR urinary albumin-to-creatinine ratio; eGFR, estimated glomerular filtration rate *Positions reported for all SNPs are relative to the human reference sequence genome (NCBI Build GRCh37/hg19)

104

2.1.2 Targeted GWAS Approach As outlined in Table 2.1, several studies show evidence for involvement of multiple genes in DKD susceptibility. Those which have been identified to date only show modest effects on DKD risk and together these only represent a small proportion of the heritability of DKD which suggests that more genetic variants associated with DKD are yet to be identified623. A typical GWAS offers a hypothesis-free method to analyse SNPs in a population to identify any variations or genes that may be associated with a phenotype or trait of interest. A GWAS approach is most applicable if the susceptibility loci have a major effect on the phenotype, however, for a heterogeneous disorder in which many genes contribute to the phenotype it may be more suitable to investigate a subset of genes in order to overcome issues with Bonferroni correction and cohort size800,801. Hypothesis-based targeted analyses of GWAS data have been used to investigate SNPs associated with several disorders such as targeting polymorphisms in key pulmonary pathways associated with acute respiratory distress syndrome802; examining genes with known roles in corneal development to investigate central corneal thickness801; studying known cancer susceptibility genes to investigate association of multiple cancers803; assessing pancreatic islet related genes in a mouse model of T2DM804; and exploring apoptosis related genes to investigate SNPs associated with mitochondrial diabetes805. All these studies have investigated common genetic risk factors in the nuclear-encoded genome; however, the mitochondrial DNA (mtDNA) has not been investigated to the same degree. Variation in mtDNA has been associated with non- Mendelian, non-maternally inherited, complex autoimmune disorders806; peripheral arterial disease and venous thromboembolism807; ulcerative colitis808,809; age-related macular disorder810; post-traumatic stress disorder811; multiple sclerosis812; colorectal cancer813; obesity814; and CKD815. A targeted GWAS of SNPs in nuclear encoded mitochondrial genes (NEMGs) and mtDNA performed by Swan and colleagues identified three independent signals in NEMGs associated with DKD in a White European population from the GENIE consortium529. Since this study, the GENIE data has been imputed again based on an updated human genome build and significantly more genes have been found to be associated with mitochondrial function, which has stimulated an interest in further investigating the association with mtDNA and NEMGs in T1DM.

105 2.2 Methods and Materials

2.2.1 Gene Selection Several online databases and literature resources were searched for nuclear genes encoding proteins associated with mitochondrial function (NEMGs) (Appendix 2). The databases queried were Mitoproteome816–819, MitoMiner820, MitoMap821, Ensembl822 and UniProt823. Genes extracted from individual sources were reviewed and duplicates were excluded. Gene names were then screened to ensure there was no duplication between the database searches and literature searches. Genes were annotated with their official HUGO Gene Nomenclature Committee (HGNC) gene symbol824 using Ensembl BioMart release 67 (May 2012) based on the February 2009 Homo sapiens high coverage assembly GRCh37 from the Genome Reference Consortium822. Any genes not found in the BioMart822 search were manually annotated according to their official HGNC gene symbol824. The list of genes was then checked again for duplicates based on HGNC symbols, known pseudonyms and gene positions. PLINK825 utilises variant call format (VCF) files to store information regarding the position and reference allele of each SNP in the genome and it is used to map the SNP to their chromosomal location. In order to ensure compatibility with VCF files in PLINK825 the gene start and end sites from assembly GRCh37 were obtained from BioMart822. Only genes found in autosomes were included in the analysis. Any genes on sex chromosomes, non- human genes, or bacterial artificial chromosomes were excluded from the final list of autosomal genes encoding proteins required for mitochondrial function.

2.2.2 Study Populations The GENIE consortium assembled multiple study populations used for these analyses to help identify genetic markers associated with DKD in persons with T1DM. The current study focuses on the role of mitochondrial genetics in DKD development, assessing both NEMGs and mtDNA variation for association with DKD.

2.2.2.1 UK-ROI Collection The initial analysis was carried out using data obtained from the All Ireland, Warren 3, Genetics Of Kidneys in Diabetes UK (UK-ROI) case-control collection826. This study population included white individuals diagnosed with T1DM before 31 years of age, with parents and grandparents born in the UK and Ireland and full details of these have been previously described826. This collection consisted of DKD cases which were defined as individuals with persistent proteinuria (>500 mg/24 h) developing more than 10 years after diabetes

106 diagnosis, hypertension (>135/85 mmHg and/or treatment with antihypertensive medication), and retinopathy. Samples were classified as ESRD cases if the individual required renal replacement therapy (RRT) defined as dialysis or transplantation. Controls were those with no evidence of DKD and this included participants with persistent normal urine albumin excretion rate (AER; 2 out of 3 urine albumin to creatinine ratio [ACR] measurements <20 µg of albumin/mg of creatinine) despite duration of T1D for at least 15 years, while not taking any antihypertensive medication, and having no history of treatment with angiotensin- converting enzyme (ACE) inhibitors. Genotyping of the UK-ROI data and Diabetic Nephropathy Collaborative Research Initiative (DNCRI) collections were performed as described by Salem and colleagues294. The ESRD phenotype was used to gain insight into the influence of mitochondrial genetics in individuals who go on to develop ESRD because of DKD. Normoalbuminuric individuals with T1DM were used as controls to investigate both the DKD and ESRD phenotypes. Glomerular filtration rate was also investigated as a quantitative phenotype as abnormal eGFR is a key characteristic of kidney disease and is used. Details of the clinical characteristics of the original UK-ROI collection are outlined in Table 2.2.

Table 2.2| Clinical Characteristics of UK-ROI Samples Used for Targeted GWAS Characteristic DKD Case T1DM Controls Number of individuals 712 751 Gender (% male) 58.7% 43% Age at recruitment (years)* 46 (± 11) 42 (± 13) Duration of diabetes (years)* 33 (± 10) 27 (± 9) eGFR* 39.75 (± 20.74) 73.11 (± 22.59) *Data are mean ± SD. *Where available

2.2.2.2 DNCRI Collection After initial analysis of the UK-ROI cohort, any SNPs determined to be suggestive of significance were followed-up in the full DNCRI data set. The full DNCRI collection included up to 19,406 patients with T1DM and of European origin from 17 cohorts previously described by Salem and colleagues294 and these included: the Austrian Diabetic Nephropathy Study (AusDiane); the Coronary Artery Calcification in Type 1 Diabetes (CACTI); the Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications (DCCT/EDIC); Pittsburgh Epidemiology of Diabetes Complications Study (EDC); The Finnish Diabetic Nephropathy (FinnDiane) Study; French and Belgian subjects from the Genetics of Diabetic Nephropathy (GENEDIAB) and Genesis studies; Genetics of Kidneys in Diabetes US Study (GoKinD) from George Washington University (GWU-GoKinD); patients from the Joslin Kidney Study (JOSLIN); individuals with T1D from Italy; The Latvian Diabetic Nephropathy

107 Study (LatDiane); The Lithuanian Diabetic Nephropathy Study (LitDiane); The Romanian Diabetic Nephropathy Study (RomDiane); The Scottish Diabetes Research Network Type 1 Bioresource (SDRNT1BIO); individuals with T1D from Steno Diabetes Center; individuals with T1D from Uppsala, Sweden; UK GoKinD, Warren 3 and All Ireland (UK-ROI) study; and The Wisconsin Epidemiologic Study of Diabetic Retinopathy (WESDR).

2.2.3 Targeted Mitochondrial Association Analysis of UK-ROI Data The data for this primary analysis was obtained from the UK-ROI dataset in which DNA samples were genotyped using the Omni1-Quad array (Illumina), imputed, and association analysis conducted for DKD and ESRD phenotypes41. For the current research, data from SNPs in NEMGs found on chromosomes 1 – 22 and directly genotyped data with SNPs found in mtDNA (Hardy–Weinberg equilibrium (HWE) quality control (QC) was not performed for mtDNA) was extracted and used for the association analyses. After QC, there was a total of 532,811 SNPs in 2,526 NEMGs that code for 22,713 transcripts and 85 SNPs in mtDNA.

Three distinct phenotypes were investigated in this study, and these were DKD, ESRD and eGFR previously described in Section 2.2.1.1. The samples for this analysis came from the UK- ROI collection and details of clinical characteristics can be in Table 2.2 and the samples remaining following QC can be seen in Section 2.3.2. Full details of the PLINK825 commands used can be found in Appendix 3 For qualitative phenotypes the basic allelic association test in PLINK825 was used to compare allele frequencies between affected and unaffected individuals. Logistic regression analysis was also performed adjusted for age, gender, duration of diabetes and population structure based on principal component analysis (PCA) from the original UK-ROI collection. Since eGFR is a quantitative variable, the association analysis for this phenotype used asymptotic significance values. All analyses were also adjusted for genomic inflation. Linear regression adjusted for age, gender, duration of diabetes and principal components. The clumping procedure in PLINK825 was used to assess linkage disequilibrium (LD) in top-ranked results and identify index SNPs. SNPs were said to be a part of the same haplotype if they were found to have an r2 value between 0.4 and 1.0, any SNP with an r2 > 0.8 with the index SNP were said to be in high LD and were deemed suitable for use as a proxy SNP to represent that haplotype in place of the index SNP. SNPs from the association analyses with NEMGs were deemed to be suggestive of significance if P < 1.0 x 10-5. In mtDNA, any SNPs with P < 0.05 were defined as having minimal evidence for association, SNPs with P < 0.01 were said to be associated with the phenotype and if P < 5.88 x 10-4 there was said to be stronger evidence for association between SNP and phenotype.

108 LocusZoom827,828 was used to generate regional association plots to visualise LD in top ranked SNPs based on hg19/1000 Genomes March 2012 EUR.

2.2.4 Follow Up in DNCRI Meta-Analysis Data Following the initial investigation of DKD, ESRD and eGFR in the UK-ROI cohort, association results were used for look-up in the much larger DNCRI cohort to determine if the SNPs deemed to be significant in the more localised UK-ROI population remained significant in a larger, more diverse population. Only SNPs found to be suggestive of significant association before and after adjusting for covariates in the UK-ROI population were included in the look- up for the full DNCRI collection. It should be noted that the DNCRI collection includes all UK- ROI participants but with additional participants, new genotyping using the HumanCore BeadChip (Illumina, San Diego, CA, USA) and updated imputation to approximately 49 million SNPs using 1,000 Genomes Project (phase 3 version 5) as a reference294. Genotyping of mtDNA was not possible for the full DNCRI dataset as mtDNA SNPs were not present on the original genotyping array and were therefore not available for all data sets. For the look-up of the DNCRI data, two covariate models were included. In the minimal adjusted model adjustments were made for age, sex, diabetes duration, centre effect (if necessary) and principal components (PC) (typically 2 – 10 PCs). In the maximal adjusted model, adjustment was also made for HbA1c and BMI. In both covariate models, the number of PCs was adjusted at the study level. The number of PCs used ranged from 2-10 depending on how many PCs it takes to control for sample structure. Positions reported for the DNCRI lookup are derived from Genome Reference Consortium Human Build 37 (GRCh37) and studies were excluded from the final meta-analysis if study imputation INFO score < 0.3 or minor allele count for either cases or controls < 10. Ten phenotypes were investigated in the DNCRI analyses and these are summarised in Figure 2.1. Seven of these were based on albuminuria status. Controls were normoglycaemic individuals and the phenotypes tested were macroalbuminuria with ESRD vs control (DN); macroalbuminuria only vs. control (macro); ESRD vs control (esrd); ESRD vs control and macroalbuminuria (esrdvall); ESRD vs macroalbuminuria (esrdvmacro); abnormal albuminuria and ESRD vs control (allvctrl); and microalbuminuria vs control (micro). The other 3 phenotypes were based on eGFR and these were CKD vs no CKD (ckd); ESRD vs no CKD (ckdextremes); and no CKD and those with eGFR < 45 ml/min AND albuminuria or ESRD vs no CKD and normal AER (ckddn). For all but two there were 17 studies included as described in Section 2.2.1.2. The phenotype group investigating ESRD vs macroalbuminuria did not include Austria, but the remaining studies were in the order previously stated. For the investigation of microalbuminuria compared to

109 control samples, there were only 13 studies included in the following order: AusDiane, CACTI, EDC, DCCT/EDIC, FinnDiane, GENEDIAB, JOSLIN, LatDiane, LitDiane, RomDiane, SDRNT1BIO, SWEDEN AND WESDR. Two adjustment models were used for the DNCRI meta-analysis: the minimum adjusted model which included age, sex, diabetes duration, centre effect and principal components; and the maximum adjusted model which included age, sex, diabetes duration, centre effect, principal components, HbAIc and BMI.

Figure 2.1| Phenotype Definitions for DNCRI Follow Up294 Numbers indicate the total number of cases (darker grey) and controls (lighter grey) included in the meta-analyses for each phenotype. ESKD defined as eGFR<15 ml/min per 1.73 m2 or undergoing dialysis or having renal transplant.

2.3 Results

2.3.1 Gene Selection A total of 2,448 unique genes were returned from database searches with a further 180 genes identified from literature searches (Appendix 4), totalling 2,628 NEMGs. This final gene list contained 2,526 autosomal genes coding for 22,713 transcripts and the selection process is summarised in Figure 2.2.

110 Gene Databases searched Genes identified through (number of genes total) = no. of database searching unique genes (n = 2,448) Mitoproteome (1,705) = 26 MitoMiner (2,343) = 684 MitoMap (88) = 3 Ensembl (1,667) = 57 UniProt (1,122) = 8 Total Unique = 778

Genes appearing in 2 or more databases = 1,670

Additional genes identified Literature databases searched = no. of through literature unique genes identified searching PubMed = 175 (n = 180) Total genes found = 2,628 Then searched: Web of Science = 4 unique Embase = 0 unique Scopus = 1 unique

Genes annotated using biomart From databases (n = 2,404) Genes not found in Biomart – to be From literature (n = 144) checked manually Total (n = 2,548) From data bases (n = 44) From literature (n = 36) Total (n = 80)

Genes manually annotated (n = 68) Genes missing after annotation (n = 9) Duplicates removed after Biomart annotation (n = 3) Sex chromosome genes excluded (n = 90)

Total number of unique autosomal NEMGs

Figure 2.2| Summary of Gene Selection Process This was used to identify nuclear genes encoding proteins associated with mitochondrial function. Further details of the databases and search terms used can be seen in Appendix 2 and 4.

111 2.3.2 Targeted Mitochondrial Association Analysis of UK-ROI Data There were three phenotypes being investigated in the initial association analysis of the UK- ROI data and the number of cases and controls available after QC can be seen in Table 2.3. Results from these analyses can be seen in Sections 2.3.2.1 – 2.3.2.3 and these are summarised in Section 2.3.2.4. Any SNPs found to be significant in any of the analyses in the UK-ROI data set were included in the look-up of the DNCRI meta-analyses data.

Table 2.3| Phenotypes Used for Targeted Mitochondrial Association Studies Samples were obtained from UK-ROI cohort after QC procedures. Phenotype Without covariates With covariates (Cases/Controls) (Cases/Controls) DKD 1,463 (712 / 751) 1,451 (700 / 751) ESRD 955 (204 / 751) 955 (204 / 751) eGFR 1,043 1,043 mtDNA DKD 1,458 (710 / 748) 1,447 (699 / 748) mtDNA ESRD 952 (204 / 748) 952 (204 / 748) mtDNA eGFR 1,043 1,043 DKD, diabetic kidney disease; ESRD, end-stage renal disease; eGFR, estimated glomerular filtration rate; mtDNA, mitochondrial deoxyribonucleic acid

QC was performed using PLINK and the SNPs remaining following this can be seen in Table 2.4.

Table 2.4| Summary of Quality Control From imputed data for nuclear genes associated with mitochondrial function and genotyped data for mitochondrial DNA obtained from UK- ROI participants Total SNPs in SNPs in or Missing Discor MAF SNPs Left Samples data set within 5kb Genotype dance <0.01 Left of Mito > 0.1 with genes HWE nDNA 47,139,222 2,880,249 0 843 2,346, 532,811 1,463 Chr 1 -22 595 mtDNA 339,845* 225 4 NA 140 85 1,458 *Total SNPs genotyped. SNP: single nucleotide polymorphism; kb: kilobase; nDNA: nuclear deoxyribonucleic acid; MAF: minor allele frequency; mtDNA: mitochondrial deoxyribonucleic acid

2.3.2.1 NEMGs Associated with DKD Phenotype in UK-ROI Ten SNPs in two different genes were found to be suggestive of significant association with DKD (Table 2.5). Without adjusting for covariates only one SNP was found to be suggestive of significant association with DKD and this was rs12503462 (P = 3.68 x10-6; OR: 2.29; 95% CI: 1.60 - 3.28) located in APBB2 on chromosome 4 which was also suggestive of significance

112 after adjusting for genomic inflation (P = 5.92 x10-6) but not after adjusting for covariates (P = 9.25 × 10-5). After adjusting for covariates, nine SNPs in CAND2 on chromosome 3 were found to be suggestive of significant association with DKD (Table 2.5). The index SNP in this 64 kb wide LD block was rs772751511 (P = 4.74 x10-6; OR: 0.2947; 95% CI: 0.1746 - 0.4973), spanning from chr3:12852546 – 12917127. Another SNP, rs149201029 (P = 7.76 x10-6; OR 0.20; 95% CI: 0.098 - 0.40), was found to be part of this haplotype according to the clump function in PLINK825, but was not found to be a suitable proxy SNP for rs772751511 based on the observed r2 value (r2 = 0.63) or according to the hg19/1000 Genomes Nov 2014 EUR population values in LocusZoom827,828. In Table 2.6, the MAFs observed in the dataset used in the current analysis are compared to both the global and regional MAFs for each SNP. None of the SNPs found to be suggestive of association with the DKD phenotype remained significant both before and after adjusting for covariates and therefore these were not carried forward to the DNCRI lookup.

113 Table 2.5| SNPs in NEMGs Associated with the Diabetic Kidney Disease Phenotype in UK-ROI Samples SNP Position Gene LD with index SNP MAF in GENIE UK-ROI Without Covariates UK-ROI With covariates rs12503462 4:41075687 APBB2 NA F_A = 0.06742 OR = 2.29 OR = 2.20 F_U = 0.03063 95% CI: 1.60 – 3.28 95% CI: 1.48 – 3.26 P = 3.68 x 10-6 P = 9.25 × 10-5 GC = 6.06 x 10-6 GC = 9.25 × 10-5 rs772751511 3:12862905 CAND2 r2 = 1.0 F_A = 0.01756 OR = 0.38 OR = 0.30 F_U = 0.04461 95% CI: 0.24– 0.61 95% CI: 0.18 – 0.50 P = 2.78 × 10-5 P = 4.74 x 10-6 GC = 3.28 × 10-5 GC = 4.74 x 10-6 rs11712981 3:12862471 CAND2 r2 = 0.989 F_A = 0.01756 OR = 0.39 OR = 0.30 F_U = 0.04394 95% CI: 0.25 – 0.62 95% CI: 0.18 – 0.51 P = 3.97 × 10-5 P = 6.53 x 10-6 GC = 4.66 × 10-5 GC = 6.53 x 10-6 rs573605571 3:12865647 CAND2 r2 = 0.989 F_A = 0.01826 OR = 0.40 OR = 0.31 F_U = 0.04461 95% CI: 0.25 – 0.63 95% CI: 0.18 – 0.52 P = 4.89 × 10-5 P = 7.74 x 10-6 GC = 5.72 × 10-5 GC = 7.74 x 10-6 rs147776883 3:12867149 CAND2 r2 = 0.967 F_A = 0.01826 OR = 0.40 OR = 0.31 F_U = 0.04461 95% CI: 0.25 – 0.63 95% CI: 0.18 – 0.52 P = 4.89 × 10-5 P = 7.68 x 10-6 GC = 5.72 × 10-5 GC = 7.68 x 10-6 rs6442335 3:12869690 CAND2 r2 =0.935 F_A - A = OR = 0.42 OR = 0.32 rs76813531 3:12869474 0.01896 95% CI: 0.26 – 0.65 95% CI: 0.19 – 0.52 rs75307333 3:12871017 F_U - A= P = 8.41 × 10-5 P = 8.59 x 10-6 rs79510614 3:12872146 0.04461 GC = 9.75 × 10-5 GC = 8.59 x 10-6 rs149201029 3:12868572 CAND2 r2 = 0.634 F_A = 0.009129 OR = 0.29 OR = 0.20 F_U = 0.03063 95% CI: 0.16 – 0.54 95% CI: 0.10 – 0.40 P = 3.55 × 10-5 P = 7.76 x 10-6 GC = 4.18 × 10-5 GC = 7.76 x 10-6 SNP, single nucleotide polymorphism; LD, linkage disequilibrium; rs, reference SNP cluster ID; MAF, minor allele frequency; F_A, frequency in affected; F_U, frequency in unaffected; OR, odds ratio; CI, confidence interval; GC, adjusted for genomic inflation; GENIE, Genetics of Nephropathy - an International Effort

114 Table 2.6| Minor Allele Frequencies in Various Populations for SNPs Associated with Diabetic Kidney Disease Phenotype SNP Position Gene Global MAF/MA count MAF in DKD GBR EAS EUR AFR AMR SAS rs12503462 4:41075687 APBB2 A=0.0152/76 (1000 F_A = 0.0674 G = 0.9341 G = 1 G = 0.9543 G= 0.9985 G = 0.9741 G = 0.9898 A1= A Genomes) F_U = 0.0306 A = 0.0659 A = 0 A = 0.0457 A= 0.0015 A = 0.9985 A = 0.0102 A2= G rs11712981 3:12862471 CAND2 A=0.0132/66 (1000 F_A = 0.0176 G = 0.9725 G = 0.9891 G = 0.9771 G= 0.9924 G = 0.9928 G = 0.9826 A1= A Genomes) F_U = 0.0439 A = 0.0275 A = 0.0109 A = 0.0229 A= 0.0076 A = 0.0072 A = 0.0174 A2= G rs149201029 3:12868572 CAND2 T=0.0044/22 (1000 F_A = 0.0091 C = 0.9780 C = 1 C = 0.9861 C=0.9985 C = 0.9957 C = 0.9969 A1= T Genomes) F_U = 0.0306 T = 0.0220 T = 0 T = 0.0139 T= 0.0015 T = 0.0043 T = 0.0031 A2= C rs772751511 3:12862905 CAND2 A=0.0030/375 F_A = 0.0176 N/A N/A N/A N/A N/A N/A A1= - (TOPMED) F_U = 0.0446 A2= A rs147776883 3:12867149 CAND2 A=0.0122/61 (1000 F_A = 0.0183 G = 0.9725 - = 0.9881 - = 0.9751 - = 0.9985 - = 0.9928 - = 0.9826 A1= GAGCTGG Genomes) F_U = 0.0446 GGACCCA A1 = 0.0275 A1 = 0.0119 A1 = 0.0249 A1 = 0.0015 A1 = 0.0072 A1 = 0.0174 GGGGA A2= G rs148715777 3:12865647 CAND2 -=0.0230/115 (1000 F_A = 0.0183 AT = 0.9725 - = 0.0099 - = 0.0249 - = 0.0386 - = 0.0187 - = 0.0164 A1= A Genomes) F_U = 0.0446 A = 0.0275 T = 0.9901 T = 0.9751 T = 0.9614 T = 0.9813 T = 0.9836 A2= AT rs6442335 3:12869690 CAND2 N/A F_A = 0.0190 G = 0.8736 - - - - - A1= A F_U = 0.0446 A = 0.0275 A2= G rs79510614 3:12872146 CAND2 T=0.0154/77 (1000 F_A = 0.0190 C = 0.9725 C = 0.9891 C = 0.9751 C = 0.9856 C = 0.9928 C = 0.9826 A1= T Genomes) F_U = 0.0446 T = 0.0275 T = 0.0109 T = 0.0249 T = 0.0144 T = T = 0.0174 A2= C 0.0072 rs76813531 3:12869474 CAND2 T=0.0142/71 (1000 F_A = 0.0190 C = 0.9725 C = 0.9891 C = 0.9751 C = 0.9902 C = 0.9928 C = 0.9826 Genomes) 0.0446 T = 0.0109 T = 0.0249 T = 0.0098 T = 0.0072 T = 0.0174 A1= T F_U = T = 0.0275 A2= C rs75307333 A1= C 3:12871017 CAND2 C=0.0154/77 (1000 F_A = 0.0190 T = 0.9725 T = 0.9891 T = 0.9751 T = 0.9856 T = 0.9928 T = 0.9826 A2= T Genomes) F_U = 0.0446 C = 0.0275 C = 0.0109 C = 0.0249 C = 0.0144 C = 0.0072 C = 0.0174 SNP, single nucleotide polymorphism; DKD, diabetic kidney disease; rs, reference SNP cluster ID; MAF, minor allele frequency; F_A, frequency in affected; F_U, frequency in unaffected; GBR, British in England and Scotland; EAS, East Asian; EUR, European; AFR, African; AMR, Ad Mixed American; SAS, South Asian

115 2.3.2.2 NEMGs Associated with ESRD Phenotype in UK-ROI Four SNPs in four different genes were found to be suggestive of significant association with ESRD (Table 2.7). Without adjusting for covariates three SNPs were suggestive of significant association with the ESRD phenotype; rs73314193 (P = 1.70 x 10-6; OR: 7.029; 95% CI: 2.786 - 17.74) located in PTPN21 on chromosome 14 which was also suggestive of association after adjusting for genomic inflation (P = 1.70 x10-6) but was not considered to be suggestive of significant association after adjusting for covariates (P = 2.273 × 10-4); rs185192029 (P = 3.30 x 10-6; OR: 5.895; 95% CI: 2.533- 13.72) located in NCOR2 on chromosome 12, however this was not suggestive of association after adjusting for genomic inflation (P = 3.10 x 10-5) or covariates (P = 1.379 × 10-4); and rs13132845 (P = 4.54 x 10-6; OR: 4.621; 95% CI: 2.258 - 9.457) located in PALLD on chromosome 4, however this did not reach the threshold for suggestive significance after adjusting for genomic inflation (P = 1.87 x 10-5) or covariates (P = 1.646 × 10-5). After adjusting for covariates, one SN�P was found to be suggestive of significant association with the ESRD phenotype, this was rs565709775 (P = 4.74 x10-6; OR: 2.406; 95% CI: 1.652 - 3.504) found in NSD3 on chromosome 8. There was no evidence of LD in any of the significant SNPs in the analysis performed with ESRD phenotype. In Table 2.8, the MAFs observed in the data set used in the current analysis are compared to both the global and regional MAFs for each SNP. None of the SNPs in NEMGs found to be suggestive of association with the ESRD phenotype remained significant both before and after adjusting for covariates and therefore were not carried forward to the DNCRI lookup.

116 Table 2.7| SNPs in NEMGs Significantly Associated with End-Stage Renal Disease in UK-ROI Samples SNP Position Gene MAF in mtGWAS Without Covariates With Covariates rs73314193 14:88950391 PTPN21 F_A = 0.03186 OR = 7.029 OR = 7.378 F_U = 0.004660 95% CI: 2.786 – 17.740 95% CI: 2.550 – 21.350 P = 1.70 x 10-6 P = 2.27 x 10-4 GC = 1.70 x 10-6 GC = 2.27 x 10-4 rs185192029 12:125009681 NCOR2 F_A = 0.03431 OR = 5.895 OR = 6.08 F_U = 0.005992 95% CI: 2.533 – 13.72 95% CI: 2.404 – 15.380 P = 3.30 x 10-6 P = 1.38 x 10-4 GC = 3.07 x 10-5 GC = 4.31 x 10-4 rs13132845 4:169610287 PALLD F_A = 0.04167 OR = 4.621 OR = 5.881 F_U = 0.009321 95% CI: 2.258 – 9.457 95% CI: 2.627 – 13.170 P = 4.54 x 10-6 P = 1.65 x 10-5 GC = 1.88 x 10-5 GC = 5.76 x 10-5 rs565709775 8:38230013 NSD3 F_A = 0.1446 OR = 2.098 OR = 2.406 F_U = 0.07457 95% CI: 1.499 – 2.937 95% CI: 1.652 – 3.504 P = 1.11 x 10-5 P = 4.74 x 10-6 GC =1.64 x 10-5 GC = 4.99 x 10-6 SNP, single nucleotide polymorphism; rs, reference SNP cluster ID; MAF, minor allele frequency; F_A, frequency in affected; F_U, frequency in unaffected; OR, odds ratio; CI, confidence interval; GC, adjusted for genomic inflation; GENIE, Genetics of Nephropathy - an International Effort

117 Table 2.8| Minor Allele Frequencies in Various Populations for SNPs Significantly Associated with End-Stage Renal Disease SNP Position Gene Global MAF in GBR* EAS EUR AFR AMR SAS MAF/MA mtGWAS count rs73314193 14:88950391 PTPN21 T = F_A = G = G = G = 0.9652 G = 0.8215 G = 0.9524 G = A1 = T 0.0799/400 0.0319 0.9670 0.9563 T = 0.0348 T = 0.1785 T = 0.0476 0.9468 A2 = G (1000 F_U = T = 0.0330 T = 0.0437 T = 0.0532 Genomes) 0.0047 rs185192029 12:125009681 NCOR2 C = 0.0052/26 F_A = G = G = 1 G = 0.9851 G = 0.9985 G = 0.9957 G = A1 = C (1000 0.0343 0.9780 C = 0 C = 0.0149 C = 0.0015 C = 0.0043 0.9939 A2 = G Genomes) F_U = C = 0.0220 C = 0.00599 0.0061 rs13132845 4:169610287 PALLD G = 0.0126/63 F_A = A = A = 1 A = 0.9523 A = 0.9985 A = 0.9841 A = A1 = G (1000 0.0417 0.9505 G = 0 G = 0.0477 G = 0.0015 G = 0.0159 0.9980 A2 = A Genomes) F_U = G = G = 0.0093 0.0495 0.0020 rs923553585 8:38230013 NSD3 T = F_A = CT = - = 0.1081 - = 0.1650 - = 0.1339 - = 0.2017 - = 0.0900 A1 = C 0.0012/154 0.1446 0.8242 T = 0.8919 T = 0.8350 T = 0.8661 T = 0.7983 T = 0.9100 A2 = CT (TOPMED) F_U = C = 0.1758 0.0746 SNP, single nucleotide polymorphism; rs, reference SNP cluster ID; MAF, minor allele frequency; F_A, frequency in affected; F_U, frequency in unaffected; GBR, British in England and Scotland; EAS, East Asian; EUR, European; AFR, African; AMR, Ad Mixed American; SAS, South Asian

118 2.3.2.3 NEMGs Associated with eGFR Phenotype in UK-ROI From the results of the association analysis using eGFR as a quantitative phenotype there were 55 SNPS in 16 different genes on ten chromosomes found to be suggestive of significant association with eGFR. All these SNPs were found to have a strong but highly variable positive correlation with eGFR.

2.3.2.3.1 Genes on Chromosome 2 Associated with eGFR In chromosome 2 there were 11 SNPs in one gene found to have a strong but variable positive correlation with eGFR prior to adjusting for covariates (Table 2.9). These SNPs were all found in KLHL29. Out of these SNPs the top ranked was rs55792424 (P = 8.59 x 10-7; beta = 13.18; SE = 4.316) which remained suggestive of significance after adjusting for genomic inflation (P = 8.59 x 10-7) but did not remain suggestive of significance after adjusting for covariates. The other SNPs in this gene found to be suggestively associated with eGFR were rs55947039, rs56311441, rs79860558, rs77064076, rs77434252, rs764530213, rs115852079, rs76247029. Another two SNPs in this LD block were found to be in weaker LD with rs55792424, these were rs199798398 (P = 2.1 x 10-6; beta =11.18; SE =2.343), GC (P = 2.10 x 10-6), LD (r2 = 0.738); and rs58984423 (P = 5.61 x 10-6; beta =10.89; SE = 2.386), GC (P = 5.61 x 10-6), LD (r2 = 0.68). All SNPs in chromosome 2 found to be associated with this phenotype remained suggestive of significance after adjusting for genomic inflation but not after adjusting for covariates (Table 2.9). None of the SNPs in NEMGs on chromosome 2 found to be suggestive of association with eGFR were significant both before and after adjusting for covariates and therefore were not carried forward to the DNCRI lookup.

119 Table 2.9| SNPs in NEMGs on Chromosome 2 Significantly Associated with Estimated Glomerular Filtration Rate in UK-ROI Samples SNP Position Gene LD with Without With Covariates index Covariates SNP rs55792424 2:23920296 KLHL29 r2 = 1.0 Beta = 13.18 Beta = 10.80 SE = 2.662 SE = 2.496 95% CI: 2.662 – 95% CI: 5.912 – 7.963 15.69 P = 8.59 x 10-7 P = 1.65 x 10-5 GC = 8.59 x 10-7 GC = 1.65 x 10-5 rs55947039 2:23917340 KLHL29 r2 = Beta = 13.12 Beta = 10.78 0.993 SE = 2.652 SE = 2.486 95% CI: 7.919 – 95% CI: 5.911 – 18.320 15.66 P = 8.83 x 10-7 P = 1.59 x 10-5 GC = 8.83 x 10-7 GC = 1.59 x 10-5 rs764530213 2:23929321 KLHL29 r2 = Beta = 12.56 Beta = 10.45 rs56311441 2:23923589 0.987 SE = 2.619 SE = 2.453 rs79860558 2:23926350 95% CI: 7.428 – 95% CI: 5.646 – rs77064076 2:23926506 17.700 15.26 rs77434252 2:23928491 P = 1.85 x 10-6 P = 2.21 x 10-5 GC = 1.85 x 10-6 GC = 2.21 x 10-5 rs115852079 2:23931972 KLHL29 r2 = 0.98 Beta = 12.85 Beta = 10.83 rs76247029 2:23935464 SE = 2.628 SE = 2.46 95% CI: 7.701 – 95% CI: 6.007 – 18.000 15.65 P = 1.16 x 10-6 P = 1.19 x 10-5 GC = 1.16 x 10-6 GC = 1.19 x 10-5 rs199798398 2:23936272 KLHL29 r2 = Beta = 11.18 Beta = 9.543 0.738 SE = 2.343 SE = 2.194 95% CI: 6.586 – 95% CI: 5.243 – 15.77 13.84 P = 2.10 x 10-6 P = 1.50 x 10-5 GC = 2.10 x 10-6 GC = 1.50 x 10-5 rs58984423 2:23873010 KLHL29 r2 = 0.68 Beta = 10.89 Beta = 8.259 SE = 2.386 SE = 2.241 95% CI: 6.215 – 95% CI: 3.866 – 15.570 12.65 P = 5.61 x 10-6 P = 2.41 x 10-4 GC = 5.61 x 10-6 GC = 2.41 x 10-4 SNP, single nucleotide polymorphism; LD, linkage disequilibrium; rs, reference SNP cluster ID; F_A, frequency in affected; F_U, frequency in unaffected; SE, standard error; GC, adjusted for genomic inflation; CI, confidence interval

120 2.3.2.3.2 Genes on Chromosome 3 Associated with eGFR In chromosome 3 there were two SNPs in two different genes found to have a strong but highly variable positive correlation with eGFR (Table 2.10). A SNP in SLC25A20 was found to be suggestive of significant association with eGFR, rs933524201 (P = 2.18 x 10-6; beta = 19.9; SE = 4.178) and after adjusting for genomic inflation (P = 9.16 x 10-6) but did not remain suggestive of significance after adjusting for covariates (P = 3.719 x 10-5). The other suggestively significant SNP in chromosome 3 was found in OPA1, rs116696291 (P = 4.26 x 10-6; beta = 26.45; SE = 5.722) but did not remain suggestive of significance after adjustment for genomic inflation (P = 1.65 x 10-5) or covariates (P = 1.955 x 10-4). No LD was found between suggestively significant SNPs in chromosome 3. None of the SNPs in NEMGs on chromosome 3 found to be suggestive of association with eGFR were significant both before and after adjusting for covariates and therefore were not carried forward to the DNCRI lookup.

Table 2.10| SNPs in NEMGs on Chromosome 3 Significantly Associated with Estimated Glomerular Filtration Rate in UK-ROI Samples SNP Position Gene LD with Without Covariates With Covariates Index SNP rs93352420 3:48916232 SLC25A2 NA Beta = 19.90 Beta = 16.24 1 0 SE = 4.178 SE = 3.921 95% CI: 11.710 – 95% CI: 8.556 – 28.090 23.920 P = 2.18 x 10-6 P = 3.72 x 10-5 GC = 9.15 x 10-6 GC = 3.72 x 10-5 rs11669629 3:19331845 OPA1 NA Beta = 26.45 Beta = 20.14 1 3 SE = 5.722 SE = 5.387 95% CI: 15.240 – 95% CI: 9.580 – 37.67 30.70 P = 4.26 x 10-6 P = 1.96 x 10-4 GC = 1.65 x 10-5 GC = 1.96 x 10-4 SNP, single nucleotide polymorphism; LD, linkage disequilibrium; rs, reference SNP cluster ID; F_A, frequency in affected; F_U, frequency in unaffected; SE, standard error; GC, adjusted for genomic inflation; CI, confidence interval

2.3.2.3.3 Genes on Chromosome 4 Associated with eGFR In chromosome 4 there were three SNPs in two different genes found to have a strong but moderately variable positive correlation with eGFR (Table 2.11). An LD block in PPA2 was found to have a suggestively significant positive correlation with eGFR and contained two SNPs, rs72668225 and rs72668234 (P = 2.95 x 10-6; beta = 13.96; SE = 2.97), GC (P = 1.76 x 10- 6) and remained significant after adjusting for covariates (P = 1.76 x 10-6; beta = 13.3; CI: 7.878 – 18.72; SE = 2.767). Another SNP on chromosome 4, found in ANKRD17 was found to be positively correlated with eGFR, rs115487201 (P = 7.31 x 10-6; beta = 11.22; SE = 2.488),

121 however this did not remain significant after adjusting for genomic inflation (P = 7.37 x 10-5) or for covariates (P = 2.62 x 10-5). The two SNPs in PPA2, rs72668225 and rs72668234, were found to be suggestive of significant association with eGFR before and after adjusting for covariates and were therefore included in the DNCRI lookup.

Table 2.11| SNPs in NEMGs on Chromosome 4 Significantly Associated with Estimated Glomerular Filtration Rate in UK-ROI Samples SNP Position Gene LD with Without With index Covariates Covariates SNP rs72668225 4:106325001 PPA2 r2 = 1.0 Beta = 13.96 Beta = 13.30 SE = 2.970 SE = 2.767 95% CI: 8.139 – 95% CI: 7.880 – 19.780 18.72 P = 2.95 x 10-6 P = 1.76 x 10-6 GC = 3.23 x 10-5 GC = 1.01 x 10-5 rs72668234 4:106336012 PPA2 r2 = 1.0 Beta = 13.96 Beta = 13.30 SE = 2.970 SE = 2.767 95% CI: 8.139 – 95% CI: 7.880 – 19.780 18.72 P = 2.95 x 10-6 P = 1.76 x 10-6 GC = 3.23 x 10-5 GC = 1.01 x 10-5 rs115487201 4:73937319 ANKRD17 NA Beta = 11.22 Beta = 9.857 SE = 2.488 SE = 2.334 95% CI: 6.339 – 95% CI: 5.280 – 16.090 14.43 P = 7.31 x 10-6 P = 2.62 x 10-5 GC = 6.67 x 10-5 GC = 1.03 x 10-4 SNP, single nucleotide polymorphism; LD, linkage disequilibrium; rs, reference SNP cluster ID; F_A, frequency in affected; F_U, frequency in unaffected; SE, standard error; GC, adjusted for genomic inflation; CI, confidence interval

2.3.2.3.4 Genes on Chromosome 6 Associated with eGFR In chromosome 6 there were four SNPs in three genes found to be suggestive of significant association with eGFR (Table 2.12). A SNP in FARS2, rs58721031 (P = 7.95 x 10-7; beta = 20.43; SE = 4.113) was positively correlated with eGFR and remained suggestive of significant association after adjusting for genomic inflation (GC) (P = 7.95 × 10-7) and after adjusting for covariates (P = 3.09x 10-7; beta = 19.69; CI: 12.2-27.18; SE = 3.821). This SNP was found to be suggestive of significant association with eGFR before and after adjusting for covariates and was therefore included in the DNCRI lookup. No LD was observed between this and surrounding SNPs (Figure 2.3).

The other three SNPs in chromosome 6 were located between chr6:107025917 – 107104764 and were found to be suggestive of significant association with eGFR without adjusting for covariates and remained suggestive of significance after adjusting for genomic inflation. The

122 top ranked SNP was rs59668897 (P = 6.31 x 10-6; beta = 24.03; SE = 5.293), GC (P = 6.31 × 10- 6), located in QRSL1. Also found in QRSL1 and in LD with rs59668897 (r2 = 0.619) was rs2883074 (P = 6.92 x 10-6; beta = 25.34; SE = 5.607), GC (P = 6.92 × 10-6). The other suggestively significant SNP in this LD block (r2 = 0.691) was found in RTN4IP1; rs4946766 (P = 6.34 x 10-6; beta = 25.98; SE = 5.724), GC (P = 6.34 × 10-6). These did not remain suggestive of significant association after adjusting for covariates and therefore were not carried forward to the DNCRI lookup.

Figure 2.3| Regional Association Plot for rs58721031 (purple diamond) found at 6:5482428 in FARS2. This was the only SNP in chromosome 6 found to be significantly associated with eGFR and was significantly associated before and after adjusting for covariates in the UK- ROI collection and in the minimum adjusted model in the DNCRI lookup

123 Table 2.12| SNPs in NEMGs on Chromosome 6 Significantly Associated with Estimated Glomerular Filtration Rate in UK-ROI Samples SNP Position Gene LD with index SNP Without Covariates With Covariates rs58721031 6:5482428 FARS2 NA Beta = 20.43 Beta = 19.69 SE = 4.113 SE = 3.821 95% CI: 12.37 – 28.49 95% CI: 12.20 – 27.18 P = 7.95 × 10-7 P = 3.09 × 10-7 GC = 7.95 × 10-7 GC = 3.09 × 10-7 rs59668897 6:107095365 QRSL1 r2 = 1.0 Beta = 24.03 Beta = 18.10 SE = 5.293 SE = 4.957 95% CI: 13.65 – 34.40 95% CI: 8.379 – 27.81 P = 6.31 x 10-6 P = 2.75 x 10-4 GC = 6.31 x 10-6 GC = 2.75 x 10-4 rs4946766 6:107025917 RTN4IP1 r2 = 0.691 Beta = 25.98 Beta = 20.76 SE = 5.724 SE = 5.349 95% CI:14.76 – 34.40 95% CI: 10.27 – 31.24 P = 6.34 x 10-6 P = 1.11 x 10-4 GC = 6.34 x 10-6 GC = 1.11 x 10-4 rs2883074 6:107104764 QRSL1 r2 = 0.619 Beta = 25.34 Beta = 19.90 SE = 5.607 SE = 5.247 95% CI: 14.35 – 36.33 95% CI: 9.621 – 30.19 P = 6.92 x 10-6 P = 1.57 x 10-4 GC = 6.92 x 10-6 GC = 1.57 x 10-4 SNP, single nucleotide polymorphism; LD, linkage disequilibrium; rs, reference SNP cluster ID; F_A, frequency in affected; F_U, frequency in unaffected; SE, standard error; GC, adjusted for genomic inflation; CI, confidence interval

124 2.3.2.3.5 Genes on Chromosome 12 Associated with eGFR In chromosome 12 there were two SNPs in two different genes found to have a strong but highly variable positive correlation with eGFR (Table 2.13). A SNP in KRT4 was found to have a positive correlation with eGFR, rs56122389 (P = 4.24 x 10-6; beta = 21.88; SE = 2.97) and remained suggestive of significant association after adjusting for covariates (P = 7.65 x 10-6; beta = 19.88; SE = 4.42; CI: 11.22-28.54) but not after adjusting for genomic inflation (P = 1.38 x 10-5). No LD was observed between rs56122389 and surrounding SNPs (Figure 2.4). This SNP, rs56122389, was found to be suggestive of significant association with eGFR before and after adjusting for covariates and was therefore included in the DNCRI lookup.

The other suggestively significant SNP in chromosome 12 was found in DNM1L, rs56282590 (P = 2.31 x 10-6; beta = 22.42; SE = 4.72) and remained suggestively of significance after adjustment for genomic inflation (P = 8.02 x 10-6) but did not retain significance after adjusting for covariates (P = 2.57 x 10-5). This SNP was found to be in strong LD (r2 > 0.8) with two other SNPs in this region; however, these did not reach significance in the current analysis. This SNP was not found to be suggestive of association with eGFR both before and after adjusting for covariates and therefore was not carried forward to the DNCRI lookup.

Table 2.13| SNPs in NEMGs on Chromosome 12 Significantly Associated with Estimated Glomerular Filtration Rate in UK-ROI Samples SNP Position Gene LD with Without With Covariates index SNP Covariates rs56122389 12:53212107 KRT4 NA Beta = 21.88 Beta = 19.88 SE = 4.732 SE = 4.419 95% CI: 12.61 – 95% CI: 11.22 – 31.15 28.54 P = 4.24 x 10-6 P = 7.65 x 10-6 GC = 1.41 x 10-5 GC = 2.55 x 10-5 rs56282590 12:32885213 DNM1L NA Beta = 22.42 Beta = 18.66 SE = 4.720 SE = 4.414 95% CI: 13.17 – 95% CI: 10.01 – 31.67 27.31 P = 2.31 x 10-6 P = 2.57 x 10-5 GC = 8.21 x 10-6 GC = 7.60 x 10-5 SNP, single nucleotide polymorphism; LD, linkage disequilibrium; rs, reference SNP cluster ID; F_A, frequency in affected; F_U, frequency in unaffected; SE, standard error; GC, adjusted for genomic inflation; CI, confidence interval

125

Figure 2.4| Regional Association Plot for rs56122389 (purple diamond) found at 12:5312107 in KRT4 from association analysis with eGFR phenotype before adjusting covariates. This was the only SNP in chromosome 12 found to be significantly associated with eGFR before adjusting for covariates.

2.3.2.3.6 Genes on Chromosome 17 Associated with eGFR In chromosome 17 there were 17 SNPs in one gene found to be suggestive of significant association in the eGFR results (Table 2.14). All these SNPs were found to have a strong but highly variable positive correlation with eGFR. These were found in ABCA9 and were part of a LD block 87 kb wide ranging from chr17:66971664 – 67059245. The index SNP in this block was rs72851722 (Figure 2.5) and was found to be significantly positively correlated with eGFR before adjusting for covariates (P = 3.56 x 10-6; beta = 21.7; SE = 4.656), GC (P = 3.56 x 10-6) and remained significant after adjusting for covariates (P = 1.46 x 10-6; beta = 21.05; CI: 12.53- 29.57; SE = 4.345). Three of the other SNPs (rs72844573, rs72851720 and rs568286059) in this LD block were also positively correlated with eGFR before adjustment, after adjusting for GC and after adjusting for covariates (Table 2.14). These four SNPs were found to be suggestive of significant association with eGFR before and after adjusting for covariates and therefore these were included in the DNCRI lookup. The remaining SNPs in this LD block were only found to be suggestive of significant association after adjusting for covariates and these can also be seen in Table 2.14. Although LD information was not available in LocusZoom827,828 for three of these SNPs these were found to be in LD with rs72851722 in this data set according to the clump function in PLINK 825 and these were rs72851720 (r2 = 0.978), rs562377691 (r2 = 0.92), rs529256589 (r2 = 0.92).

126 Table 2.14| SNPs in NEMGs on Chromosome 17 Significantly Associated with Estimated Glomerular Filtration Rate in UK-ROI Samples SNP Position Gene LD with index Without Covariates With Covariates SNP rs72851722 17:66972055 ABCA9 r2 = 1.0 Beta = 21.70 Beta = 21.05 SE = 4.656 SE = 4.345 95% CI: 12.58 – 30.83 95% CI: 12.53 – 29.57 P = 3.56 x 10-6 P = 1.46 x 10-6 GC = 3.56 x 10-6 GC = 1.46 x 10-6 rs568286059 17:66976506 ABCA9 r2 = 0.959 Beta = 20.62 Beta = 19.82 SE = 4.535 SE = 4.231 95% CI: 11.73 – 29.51 95% CI: 11.52 – 28.11 P = 6.10 x 10-6 P = 3.21 x 10-6 GC = 6.10 x 10-6 GC = 3.21 x 10-6 rs529256589 17:66989002 ABCA9 r2 =0.92 Beta = 19.31 Beta 19.08; rs72851731 17:67009909 SE = 4.425 SE: 4.121 rs8068047 17:66978062 95% CI: 10.63 – 27.98 CI: 11 - 27.16 rs111770137 17:66979047 P = 1.41 x 10-5 P = 4.14 x 10-6 rs72851723 17:66982630 GC = 1.41 x 10-5 GC = 4.14 x 10-6 rs562377691 17:66989001 rs143067584 17:67001219 rs74825376 17:67006846 rs72851733 17:67015528 rs76119476 17:67021212 rs72851734 17:67021373 rs72844573 17:67028212 ABCA9 r2 =0.899 Beta = 21.75 Beta; 20.19 SE = 4.53 SE: 4.226 95% CI: 12.87 – 30.63 CI: 11.9-28.47 P = 1.80 x 10-6 P = 2.04 x 10-6 GC = 1.80 x 10-6 GC = 2.04 x 10-6 rs7217005 17:66984180 ABCA9 r2 = 0.884 Beta = 19.06 Beta; 18.64

127 rs61697543 17:66991216 SE = 4.372 SE: 4.074; 95% CI: 10.49 – 27.63 CI: 10.65 - 26.62 P = 1.43 x 10-5 P = 5.35 x 10-6 GC = 1.43 x 10-5 GC = 5.35 x 10-6 rs72851720 17:66971664 ABCA9 r2 = 0.878 Beta = 21.81 Beta; 20.77 SE = 4.723 SE: 4.407; 95% CI: 12.55 – 31.07 CI: 12.13 - 29.41 P = 4.36 x 10-6 P = 2.77 x 10-6 GC = 4.36 x 10-6 GC = 2.77 x 10-6 SNP, single nucleotide polymorphism; LD, linkage disequilibrium; rs, reference SNP cluster ID; F_A, frequency in affected; F_U, frequency in unaffected; SE, standard error; GC, adjusted for genomic inflation; CI, confidence interval

128 (A)

(B)

Figure 2.5| Regional Association Plot for rs72851722 found at 17:66972055 in ABCA9 from association analysis of eGFR phenotype without covariates. In part A rs72851722 (purple diamond) has been highlighted and shows strong LD with several surrounding SNPs. Notably only moderate LD was observed between rs72851722 and rs72844573 as well as rs72851731. In part B rs72844573 (purple diamond) has been highlighted and moderate LD was seen with all surrounding SŇPs apart from rs72851731 which was not found to be in LD according to LocusZoom. According to PLINK all significant SNPs in this region were part of this LD block.

129 2.3.2.3.7 Genes on Chromosome 22 Associated with eGFR In chromosome 22 there were eight SNPs in one gene found to have a strong but highly variable positive correlation with eGFR (Table 2.15). These SNPs were all found in RNF185 and were part of a 25kb LD block ranging from chr22:31580766 - 31606652. This LD block was represented by the index SNP, rs142983932 (P = 4.99 x 10-6; beta = 19.81; SE = 4.316) and remained suggestive of significant association after adjusting for covariates (P = 3.36 x 10-7; beta = 20.72; SE = 4.035; CI: 12.81-28.63) but not after adjusting for genomic inflation (P = 6.01 x 10-5). This index SNP, rs142983932, was found to be suggestive of significant association with eGFR before and after adjusting for covariates and therefore these were included in the DNCRI look-up. The remaining SNPs in this LD block were only found to be suggestively significant after adjusting for covariates and these were rs186798260, rs140100403, rs74491643, rs139495640 and rs139285271.

130 Table 2.15| SNPs in NEMGs on Chromosome 22 Significantly Associated with Estimated Glomerular Filtration Rate in UK-ROI Samples SNP Position Gene LD with index SNP Without Covariates With Covariates rs142983932 22:31590757 RNF185 r2 = 1.0 Beta = 19.81 Beta = 20.72 SE = 4.316 SE = 4.035 95% CI: 11.35 – 28.27 95% CI: 12.81 – 28.63 P = 4.99 x 10-6 P = 3.36 x 10-7 GC = 5.89 x 10-5 GC = 6.19 x 10-6 rs186798260 22:31580766 RNF185 r2 = 0.9 Beta = 18.39 Beta = 19.78 rs141390087 22:31582442 SE = 4.223 SE = 3.949 95% CI: 10.11 – 26.66 95% CI: 12.04 – 27.52 P = 1.47 x 10-5 P = 6.44 x 10-7 GC = 1.37 x 10-4 GC = 1.04 x 10-5 rs140100403 22:31590089 RNF185 r2 = 0.918 Beta = 18.39 Beta = 19.49 rs8138024 22:31587943 r2 = 0.903 SE = 4.223 SE = 3.906 95% CI: 10.20 – 26.57 95% CI: 11.83 – 27.14 P = 1.18 x 10-5 P = 7.15 x 10-7 GC = 1.15 x 10-4 GC = 1.13 x 10-5 rs74491643 22:31601051 RNF185 r2 = 0.903 Beta = 17.12 Beta = 18.08 rs139495640 22:31605702 SE = 4.136 SE = 3.870 rs139285271 22:31606652 95% CI: 9.016 – 25.23 95% CI: 10.49 – 25.66 P = 3.76 x 10-5 P = 3.39 x 10-6 GC = 2.87 x 10-4 GC = 3.88 x 10-5 SNP, single nucleotide polymorphism; LD, linkage disequilibrium; rs, reference SNP cluster ID; F_A, frequency in affected; F_U, frequency in unaffected; SE, standard error; GC, adjusted for genomic inflation; CI, confidence interval

131 2.3.2.3.8 Genes on Chromosome 15, 16 and 20 Associated with eGFR

The remaining SNPs found to be suggestive of significant association with eGFR were found in chromosome 15, 16 and 20 and these comprised of eight SNPs in three different genes (Table 2.16). On chromosome 15 there was three SNPs in two different genes found to have a strong but highly variable positive correlation with eGFR. The highest ranked of these was rs17727979 found in CATSPER2 (P = 3.99 x 10-6; beta = 17.11; SE = 3.689) which remained suggestive of significance after adjusting for genomic inflation (P = 3.99 x 10-6) and covariates (P = 3.97 x 10-7; beta = 17.53; SE = 3.434; CI: 10.80 -24.26). This is the index SNP in a LD block spanning 120kb from chr15:43923018 – 44043421. Another two SNPs were positively correlated with eGFR and found to be part of this LD block. The first of these was also found in CATSPER2 and was rs17727871 (P = 5.78 x 10-6; Beta = 16.55; SE = 3.631), LD (r2 = 0.972); which remained suggestive of significance after adjusting for genomic inflation (P = 5.78 x 10- 6) and covariates (P = 7.31 x 10-7; beta = 16.83; SE = 3.377; CI: 10.21 -23.45). The other SNP in this haplotype was found in PDIA3 and was rs28653631 (P = 5.83 10-6; beta = 17.1; SE = 3.753), LD (r2 = 0.972); which remained suggestive of significance after adjusting for genomic inflation (P = 5.83 x 10-6) and covariates (P = 4.95 x 10-7; beta = 17.68; SE = 3.494; CI: 10.83 - 24.53). These three SNPs on chromosome 15 were found to be suggestive of significance before and after adjusting for covariates and were therefore included in the DNCRI lookup. The significant SNPs in chromosome 15 were found to be part of an LD block represented by the index SNP rs17727979 (Figure 2.6).

On chromosome 16 there was three SNPs in one gene found to have a relatively weak but consistent positive correlation with eGFR. These were part of a 1kb LD block ranging from chr16:78386507 – 78388056. The index SNP in this LD block is rs913499133 found in WWOX (P = 9.58 x 10-6; beta = 6.25; SE = 1.405) which was suggestive of significant association without adjustment but did not remain suggestive of significance after adjusting for genomic inflation (P = 3.90 x 10-5) or covariates (P =1.084 x 10-4). This SNP was found to be in high LD with rs9941074 (P = 9.58 x 10-6; Beta = 6.25; SE = 1.405), LD (r2 =1.0); and rs55822922 (P = 9.93 x 10-6; beta = 6.257; SE = 1.409), LD (r2 =0.996). The SNPs on chromosome 16 did not remain suggestive of association with eGFR after adjusting for covariates and were therefore not carried forward to the DNCRI look-up.

On chromosome 20 there were two SNPs in one gene found to have a strong but highly variable positive correlation with eGFR. These were part of a 92kb LD block ranging from chr20:9616983 – 9709936. The top ranked SNP was rs17408288 found in PAK5 (P = 2.45 x 10- 7; beta = 14.2; SE = 2.732) which was in LD (r2 = 0.916) with rs116959262 (P = 6.93 x 10-7; beta

132 = 13.21; SE = 2.645). Both were only found to be suggestive of significance after adjusting for covariates and were therefore not included in the look-up of the full DNCRI collection.

133 Table 2.16| SNPs in NEMGs on Chromosome 15, 16 and 20 Significantly Associated with Estimated Glomerular Filtration Rate in UK-ROI Samples SNP Position Gene LD with index SNP Without Covariates With Covariates rs17727979 15:43929666 CATSPER2 Beta = 17.11 Beta = 17.53 SE = 3.689 SE = 3.434 95% CI: 9.876 – 24.340 95% CI: 10.80 – 24.26 P = 3.99 x 10-6 P = 3.97 x 10-7 GC = 3.99 x 10-6 GC = 3.97 x 10-7 rs17727871 15:43923018 CATSPER2 r2 = 0.972 Beta = 16.55 Beta = 16.83 SE = 3.631 SE = 3.377 95% CI: 9.434 – 23.670 95% CI: 10.21 – 23.45 P = 5.78 x 10-6 P = 7.31 x 10-7 GC = 5.78 x 10-6 GC = 7.31 x 10-7 rs28653631 15:44043421 PDIA3 r2 = 0.972 Beta = 17.1 Beta = 17.68 SE = 3.753 SE = 3.494 95% CI: 9.744 – 24.460 95% CI: 10.83 – 24.53 P = 5.83 x 10-6 P = 4.95 x 10-7 GC = 5.83 x 10-6 GC = 4.95 x 10-7 rs913499133 16:78386507 WWOX Beta = 6.250 Beta = 5.120 SE = 1.405 SE = 1.318 95% CI: 3.496 – 9.004 95% CI: 2.538 – 7.703 P = 9.58 x 10-6 P = 1.08 x 10-4 GC = 3.37 x 10-5 GC = 1.13 x 10-4 rs9941074 16:78387254 WWOX r2 = 1.0 Beta = 6.250 Beta = 5.120 SE = 1.405 SE = 1.318 95% CI: 3.496 – 9.004 95% CI: 2.538 – 7.703 P = 9.58 x 10-6 P = 1.08 x 10-4 GC = 3.37 x 10-5 GC = 1.13 x 10-4 rs55822922 16:78388056 WWOX r2 = 0.996 Beta = 6.257 Beta = 5.119 SE = 1.409 SE = 1.321 95% CI: 3.495 – 9.018 95% CI: 2.529 – 7.709

134 P = 9.93 x 10-6 P = 1.14 x 10-4 GC = 3.48 x 10-5 GC = 1.19 x 10-4 rs17408288 20:9616983 PAK5 Beta = 12.65 Beta = 14.20 SE = 2.940 SE = 2.732 95% CI: 6.888 – 18.410 95% CI: 8.843 – 19.55 P = 1.85 x 10-5 P = 2.45 x 10-7 GC = 2.91 x 10-5 GC = 3.31 x 10-6 rs116959262 20:9686097 PAK5 r2 = 0.916 Beta = 11.17 Beta = 13.21 SE = 2.845 SE = 2.645 95% CI: 5.596 – 16.750 95% CI: 8.027 – 18.40 P = 9.16 x 10-5 P = 6.93 x 10-7 GC = 1.34 x 10-4 GC =7.77 x 10-6 SNP, single nucleotide polymorphism; LD, linkage disequilibrium; rs, reference SNP cluster ID; F_A, frequency in affected; F_U, frequency in unaffected; SE, standard error; GC, adjusted for genomic inflation; CI, confidence interval

135

Figure 2.6| Regional Association Plot for rs17727979 (purple diamond) found at 15:43929666 on CATSPER2 from association analysis of eGFR phenotype without covariates can be seen in the above plot. From the LocusZoom results there was high LD between rs17727979, rs17727871 and rs28653631. After using the clump function in PLINK these SNPs were found to be part of a LD block ranging from chr20:9616983 – 9709936 and rs17727979 was identified as the index SNP.

2.3.2.4 Mitochondrial DNA SNP Associations in UK-ROI Data From these results one SNP was strongly associated (P < 0.01) and three SNPs were weakly associated with DKD (P < 0.05). From the ESRD results there were three SNPs weakly associated (P < 0.05) with this phenotype. For the eGFR phenotype there was one SNP which remained significantly associated after Bonferroni correction (P < 5.88 x 10 -4), and eight SNPs which were weakly associated (P < 0.05) with this phenotype. No LD was observed in any of the significant SNPs in mtDNA. All the SNPs found to be significant in mtDNA can be seen in Table 2.17.

In total four SNPs in mtDNA were found to be associated with DKD. One of these was strongly associated (P < 0.01) before adjusting for covariates, this was rs201336470 (P = 0.008805; OR: 4.632; 95% CI: 1.314 - 16.32) and this remained significant after adjusting for genomic inflation (P = 0.008805) and was weakly associated with DKD (P < 0.05) after adjusting for covariates (P = 0.04549; OR: 3.889; 95% CI: 1.028 - 14.72). The other three SNPs were weakly

136 associated with DKD only after adjusting for covariates. These were rs2853496 (P = 0.0153; OR: 3.115; 95% CI: 1.243 - 7.801); 200610-37 (P = 0.03628; OR: 2.206; 95% CI: 1.052 - 4.626); rs200145866 (P = 0.03766; OR: 3.226; 95% CI: 1.069 - 9.736).

There were three SNPs in mtDNA associated with ESRD. All of these were weakly associated (P < 0.05) and one of these was only associated before adjusting for covariates and this was 2010-08-MT-981 (P = 0.02842; OR: 2.914; 95% CI: 1.072 - 7.922), GC (P= 0.03585). The other two SNPs in mtDNA significantly associated with ESRD were also weakly associated but only after adjusting for covariates and these were rs200145866 (P = 0.04401; OR: 4.579; 95% CI: 1.042 - 20.13) and MitoT6222C (P = 0.03183; OR: 4.509; 95% CI: 1.14 - 17.83).

Eight SNPs in mtDNA were significantly associated with decreased eGFR. One of these showed a strong but highly variable negative correlation with eGFR, and this was rs2853496 (P =0.007111; beta = -9.901; SE=3.671), GC (P=0.009698) and was also strongly associated after adjusting for covariates (P =0.0002832; beta =-24.87; SE=6.828). After adjusting for covariates, this SNP remained significantly associated after Bonferroni correction (P < 5.88 x 10 -4). Another four SNPs found to have a relatively weak but consistent negative correlation with eGFR both before and after adjusting for covariates, these were 2010-08-MT-395, rs28359178, MitoC295T, rs41528348. Two of these were found at the same genetic loci, MitoC295T and rs41528348, and these have since been merged to rs41528348, however both identifiers were included on the genotyping array and data is therefore presented for both. The remaining mtDNA SNPs demonstrated a strong negative correlation with eGFR only after adjusting for covariates and these were MitoT6222C, rs41337244 and rs2853826.

No mtDNA SNPs were included in the DNCRI lookup and therefore these are not presented in Table 2.17.

137 Table 2.17| SNPs in mtDNA Significantly Associated with DKD, ESRD and eGFR in Genotyped Data from UK-ROI Samples SNP Position Alleles MAF in DKD without DKD with ESRD without ESRD with eGFR without eGFR with GENIE Covariates Covariates Covariates Covariates Covariates Covariates rs201336470 mt: 2259 A1: T F_A: 0.01831 OR = 4.632 OR = 3.889 OR = 3.706 OR = 2.252 Beta = -7.527 Beta = -8.315 (exm2216239) A2: C F_U: 95% CI: 1.314 – 95% CI: 1.028 – 95% CI: 0.7425 95% CI: SE = 4.143 SE = 7.743 0.004011 16.32 14.72 – 18.50 0.4095 – 95% CI: -31.3 – CI: -23.49 – -6.86 P = 0.008805** P = 0.04549* P = 0.08708 12.38 1.186 P = 0.2831 GC = 0.008805** GC = 0.04549* GC = 0.1013 P = 0.3506 P = 0.06954 GC = 0.4251 GC = 0.4184 GC = 0.08118 rs2853496 mt:11915 A1: A F_A: 0.02116 OR = 1.999 OR = 3.115 OR = 0.9158 OR = 1.425 Beta = -9.901 Beta = -24.87 (MitoG11915A) A2: G (DKD) 95% CI: 0.8424 – 95% CI: 1.243 – 95% CI: 0.193 95% CI: SE = 3.671 SE = 6.828 F_U: 0.0107 4.745 7.801 – 4.346 0.2451– 8.29 95% CI: -34.19 – CI: -38.25 – (DKD) P = 0.1093 P = 0.0153* P = 0.9119 P = 0.6932 -5.411 -11.49 GC = 0.1093 GC = 0.0153* GC= 0.9156 GC = 0.7323 P = 0.007111** P = GC = 0.0002832*** 0.009698** GC = 0.0691** 200610-37 mt:16484 A1: G F_A: 0.03103 OR = 1.811 OR = 2.206 OR = 2.009 OR = 2.796 Beta = -0.794 Beta = -2.656 A2: A F_U: 0.01738 95% CI: 0.9050 – 95% CI: 1.052 – 95% CI: 0.7909 95% CI: 0.10 – SE = 2.887 SE = 5.39 3.622 4.626 – 5.103 7.89 95% CI: -12.9 – CI: -13.22 – P = 0.08898 P = 0.03628* P = 0.1349 P = 0.05004 - 9.727 -7.909 GC = 0.08898 GC = 0.03628* GC = 0.1523 GC = 0.08932 P = 0.7833 P = 0.6223 GC = 0.7916 GC = 0.7144 rs200145866 mt:11253 A1: C F_A: 0.01693 OR = 2.555 OR = 3.226 OR = 2.968 OR = 4.579 Beta = -4.703 Beta = -7.182 (exm2216399) A2: T (DKD) 95% CI: 0.8955 – 95% CI: 1.069 – 95% CI: 0.7897 95% CI: 1.042 SE = 4.149 SE = 7.739 0.01961 7.289 9.736 – 11.16 – 20.13 95% CI: -25.67– CI: -22.35 – (ESRD) P = 0.06928 P = 0.03766* P = 0.09132 P = 0.04401* - 6.859 -7.986 F_U: GC = 0.06928 GC = 0.03766* GC = 0.1059 GC = 0.08079 P = 0.2573 P = 0.3536 0.006693 GC = 0.2764 GC = 0.4906 (DKD & ESRD)

138 2010-08-MT- mt:93 A1: G F_A: 0.03431 OR = 1.529 OR = 1.519 OR = 2.914 OR = 2.54 Beta = -5.293 Beta = -11.81 981 A2: A F_U: 0.01205 95% CI: 0.649 – 95% CI: 0.610 – 95% CI: 1.072 95% CI: 0.830 SE = 3.817 SE = 7.142 3.600 3.783 – 7.922 – 7.773 95% CI: -25.55 – CI: -25.81 – P = 0.3273 P = 0.369 P = 0.02842* P = 0.1023 - 4.378 - 2.191 GC = 0.3273 GC = 0.369 GC = 0.03585* GC = 0.1566 P = 0.1659 P = 0.09859 GC = 0.1831 GC = 0.2196 MitoT6222C mt:6222 A1: C F_A: 0.01961 OR = 2.126 OR = 2.754 OR = 2.47 OR = 4.509 Beta = -6.002 Beta = -15.02 A2: T (ESRD) 95% CI: 0.7937 – 95% CI: 0.943 – 95% CI: 0.6904 95% CI: 1.14 – SE = 3.971 SE = 7.412 F_U: 5.696 8.049 - 8.837 17.83 95% CI: -27.57 – CI: -29.55 – 0.008032 P = 0.1248 P = 0.06405 P = 0.1508 P = 0.03183* 3.562 -0.4942 (ESRD) GC = 0.1248 GC = 0.06405 GC = 0.1689 GC = 0.06275 P = 0.131 P = 0.04296* GC = 0.1467 GC = 0.1324 2010-08-MT- mt:16070 A1: T OR = 1.035 OR = 1.051 OR = 1.453 OR = 1.394 Beta = -2.863 Beta = -5.684 395 A2: C 95% CI: 0.7463 – 95% CI: 0.7398 95% CI: 95% CI: 0.869 SE = 1.29 SE = 2.416 1.435 – 1.492 0.9305 - 2.27 – 2.253 95% CI: -10.78 – CI: -10.42 – P = 0.8377 P = 0.7822 P = 0.09886 P = 0.1759 -0.6685 -0.948 GC = 0.8377 GC = 0.7822 GC = 0.114 GC = 0.2407 P = 0.0267* P = 0.01885* GC = 0.03323* GC = 0.0808 rs28359178 mt:13708 A1: A OR = 1.023 OR = 1.065 OR 1.407 OR = 1.377 Beta = -2.75 Beta = -5.506 A2: G 95% CI: 0.7494 – 95% CI: 0.7623 95% CI: - 95% CI: 0.867 SE = 1.245 SE = 2.332 1.395 – 1.489 P = 0.1171 – 2.189 95% CI: -10.38 – CI: -10.08 – P = 0.8883 P = 0.7104 GC = 0.1334 P = 0.1756 -0.6202 -0.9361 GC = 0.8883 GC = 0.7104 GC = 0.2403 P = 0.02739* P = 0.01839* GC = 0.03404* GC = 0.07964 MitoC295T mt:295 A1: T OR = 1.056 OR = 1.069 OR = 1.496 OR = 1.429 Beta = -2.857 Beta = -5.586 A2: C 95% CI: 0.76 – 95% CI: 0.7511 95% CI: 0.9564 95% CI: 0.883 SE = 1.3 SE = 2.433 1.466 – 1.521 – 2.34 – 2.314 95% CI: -10.81 – CI: -10.35 – P = 0.7466 P = 0.7118 P = 0.07609 P = 0.1464 -0.6171 -0.817 GC = 0.7466 GC = 0.7118 GC = 0.08937 GC = 0.208 P = 0.02822* P = 0.02189* GC = 0.035* GC = 0.08836

139 rs41528348 mt:295 A1: T OR = 1.041 OR = 1.051 OR = 1.469 OR = 1.402 Beta = -2.794 Beta = -5.473 A2: C 95% CI: 0.751 – 95% CI: 0.740 – 95% CI: 0.940 95% CI: 0.867 SE = 1.295 SE = 2.423 1.444 1.494 – 2.296 – 2.268 95% CI: -10.66 – CI: -10.22 – P = 0.8102 P = 0.7797 P = 0.08991 P = 0.1688 -0.5132 -0.7245 GC = 0.8102 GC = 0.7797 GC = 0.1044 GC = 0.2328 P = 0.03114* P = 0.02409* GC = 0.03837* GC = 0.9356 rs41337244 mt:15758 A1: G OR = 1.13 OR = 1.714 OR = 0.7814 OR = 1.313 Beta = -5.449 Beta = -13.04 A2: A 95% CI: 0.542 – 95% CI: 0.767 – 95% CI: 0.2224 95% CI: SE = 3.345 SE = 6.246 2.358 3.829 - 2.746 0.3452 - 4.998 95% CI: -24.01– CI: -25.28 – P = 0.7446 P = 0.1890 P = 0.6998 P = 0.6892 - 2.213 -0.801 GC = 0.7446 GC = 0.1890 GC = 0.7120 GC = 0.7288 P = 0.1036 P = 0.03702* GC = 0.1178 GC = 0.1211 rs2853826 mt:10398 A1: G OR = 0.959 OR = 1.021 OR = 1.194 OR = 1.323 Beta = -1.553 Beta = -4.016 A2: A 95% CI: 0.743 – 95% CI: 0.775 – 95% CI: 0.8273 95% CI: SE: 1.04 SE = 1.941 1.237 1.344 - 1.722 0.8891 - 1.968 95% CI: -7.183 – CI: -7.821 – P = 0.7473 P = 0.8836 P = 0.3436 P = 0.1676 0.9716 -0.2107 GC= 0.7473 GC = 0.8836 GC = 0.3644 GC = 0.2315 P = 0.1358 P = 0.03885* GC = 0.1518 GC = 0.1246 *= P <0.05; **= P <0.01; ***= P < 5.88 x 10 -4 SNP, single nucleotide polymorphism; DKD, diabetic kidney disease; ESRD, end-stage renal disease; eGFR, estimated glomerular filtration rate; LD, linkage disequilibrium; rs, reference SNP cluster ID; MAF, minor allele frequency; F_A, frequency in affected; F_U, frequency in unaffected; OR, odds ratio; CI, confidence interval; GC, adjusted for genomic inflation; GENIE, Genetics of Nephropathy - an International Effort

140 2.3.2.5 Summary of UK-ROI Analyses Results From the targeted mitochondrial association analysis of the UK-ROI collection, there 81 SNPs were found to be suggestive of significant association with at least one of the phenotypes being investigated. Out of these 81 SNPs, 69 were found in NEMGs and 12 were found in mtDNA (Figure 2.7). There were 14 SNPs in total suggestive of significant association with the DKD phenotype, ten of these were found in NEMGs and four in mtDNA. There were seven SNPs in total suggestive of association with the ESRD phenotype, four of these were found in NEMGs and three in mtDNA. There were 63 SNPs in total suggestive of association with the eGFR phenotype, 55 of these were found in NEMGs and eight in mtDNA. The majority of the 81 suggestively significant SNPs only reached the thresholds for suggestive significance either before or after adjusting for covariates however 12 SNPs in NEMGs and six SNPs in mtDNA were found to be significant both before and after adjusting for covariates (summarised in Figure 2.7). Of these 18 SNPs, ten were also suggestively significant after adjusting for genomic inflation both with and without adjustment for covariates. In mtDNA, three SNPs were found to be suggestive of association in more than one phenotype; rs200145866 (suggestively associated with DKD and ESRD), mitoG11905A (suggestively associated with DKD and significantly associated with eGFR) and mitoT6222C (suggestively associated with ESRD and eGFR). No SNPs in NEMGs were found to be significantly associated or suggestive of significance with more than one phenotype in the current study. Table 2.18 shows SNPs found to be suggestive of significant association before and after adjusting for covariates in the UK-ROI analyses.

141 total genes involved with Nuclear DNA 2,563 Mitochondrial DNA mitochondrial function (autosomes + mtDNA)

2,880,474 SNPs in genes SNPs in nuclear genes 2,880,249 involved with mitochondrial 225 SNPs in mitochondrial associated with mitochondrial function function DNA

532,811 SNPs in NEMGs which passed 85 SNPs in mtDNA which QC passed QC

69 Significant SNPs 12* Significant SNPs

10 SNPs 4 SNPs 55 SNPs 4 SNPs 3 SNPs 8 SNPs associated associated associated with associated associated associated with DKD with ESRD eGFR with DKD with ESRD with eGFR (Table 3) (Table 5) (Tables 7 - 14) (Table 15) (Table 15) (Table 15)

1 without 3 without 33 without 1 without 1 without 5 without covariates covariates covariates covariates covariates covariates 9 with 1 with 34 with 4 with 2 with 8 with covariates covariates covariates covariates covariates covariates * 3 of these SNPs are associated with more than one phenotype Figure 2.7| Summary of SNPs Suggestive of Association with Each Phenotype Investigated in nDNA and mtDNA

142 Table 2.18| SNPs Suggestive of Significant Association in the UK-ROI data and Carried Forward to the DNCRI Lookup SNP Position Gene Pheno. Without With Covariates Covariates rs72668225 4:106325001 PPA2 eGFR Beta = 13.96 Beta = 13.30 SE = 2.97 SE = 2.77 95% CI: 8.14 – 95% CI: 7.88 – 19.78 18.72 P = 2.95 x 10-6 P = 1.76 x 10-6 GC = 3.23 x 10-5 GC = 1.01 x 10-5 rs72668234 4:106336012 PPA2 eGFR Beta = 13.96 Beta = 13.30 SE = 2.970 SE = 2.767 95% CI: 8.14 – 95% CI: 7.88 – 19.78 18.72 P = 2.95 x 10-6 P = 1.76 x 10-6 GC = 3.23 x 10-5 GC = 1.01 x 10-5 rs58721031 6:5482428 FARS2 eGFR Beta = 20.43 Beta = 19.69 SE = 4.11 SE = 3.82 95% CI: 12.37 – 95% CI: 12.20 – 28.49 27.18 P = 7.95 × 10-7 P = 3.09 × 10-7 GC = 7.95 × 10-7 GC = 3.09 × 10-7 rs56122389 12:53212107 KRT4 eGFR Beta = 21.88 Beta = 19.88 SE = 4.73 SE = 4.42 95% CI: 12.61 – 95% CI: 11.22 – 31.15 28.54 P = 4.24 x 10-6 P = 7.65 x 10-6 GC = 1.41 x 10-5 GC = 2.55 x 10-5 rs17727979 15:43929666 CATSPER2 eGFR Beta = 17.11 Beta = 17.53 SE = 3.69 SE = 3.43 95% CI: 9.88 – 95% CI: 10.80 – 24.34 24.26 P = 3.99 x 10-6 P = 3.97 x 10-7 GC = 3.99 x 10-6 GC = 3.97 x 10-7 rs17727871 15:43923018 CATSPER2 eGFR Beta = 16.55 Beta = 16.83 SE = 3.63 SE = 3.38 95% CI: 9.43 – 95% CI: 10.21 – 23.670 23.45 P = 5.78 x 10-6 P = 7.31 x 10-7 GC = 5.78 x 10-6 GC = 7.31 x 10-7 rs28653631 15:44043421 PDIA3 eGFR Beta = 17.1 Beta = 17.68 SE = 3.75 SE = 3.49 95% CI: 9.74 – 95% CI: 10.83 – 24.46 24.53 P = 5.83 x 10-6 P = 4.95 x 10-7 GC = 5.83 x 10-6 GC = 4.95 x 10-7 rs72851722 17:66972055 ABCA9 eGFR Beta = 21.70 Beta = 21.05 SE = 4.66 SE = 4.35 95% CI: 12.58 – 95% CI: 12.53 – 30.83 29.57 P = 3.56 x 10-6 P = 1.46 x 10-6

143 GC = 3.56 x 10-6 GC = 1.46 x 10-6 rs568286059 17:66976506 ABCA9 eGFR Beta = 20.62 Beta = 19.82 SE = 4.54 SE = 4.23 95% CI: 11.73 – 95% CI: 11.52 – 29.51 28.11 P = 6.10 x 10-6 P = 3.21 x 10-6 GC = 6.10 x 10-6 GC = 3.21 x 10-6 rs72844573 17:67028212 ABCA9 eGFR Beta = 21.75 Beta = 20.19 SE = 4.53 SE = 4.23 95% CI: 12.87 – 95% CI: 11.90 – 30.63 28.47 P = 1.80 x 10-6 P = 2.04 x 10-6 GC = 1.80 x 10-6 GC = 2.04 x 10-6 rs72851720 17:66971664 ABCA9 eGFR Beta = 21.81 Beta = 20.77 SE = 4.72 SE = 4.41 95% CI: 12.55 – 95% CI: 12.13 – 31.07 29.41 P = 4.36 x 10-6 P = 2.77 x 10-6 GC = 4.36 x 10-6 GC = 2.77 x 10-6 rs142983932 22:31590757 RNF185 eGFR Beta = 19.81 Beta = 20.72 SE = 4.32 SE = 4.04 95% CI: 11.35 – 95% CI: 12.81 – 28.27 28.63 P = 4.99 x 10-6 P = 3.36 x 10-7 GC = 5.89 x 10-5 GC = 6.19 x 10-6 rs201336470 mt: 2259 MT-RNR2 DKD OR = 4.63 OR = 3.89 (exm2216239) 95% CI: 1.31 - 95% CI: 1.03 – 16.32 14.72 P = 0.0088** P = 0.0455* GC = 0.0088** GC = 0.0455* rs2853496 mt:11915 MT-ND4 eGFR Beta = -9.90 Beta = -24.87 (MitoG11915A) SE = 3.67 SE = 6.83 95% CI: -34.19 – CI: -38.25 – - -5.41 11.49 P = 0.0071** P = 2.83 × 10-4 GC = 0.0097** *** GC = 0.0691** 2010-08-MT- mt:16070 Non- eGFR Beta = -2.863 Beta = -5.684 395 coding SE = 1.29 SE = 2.416 region 95% CI: -10.78 – CI: -10.42 – - -0.67 0.95 P = 0.0267* P = 0.0189* GC = 0.0332* GC = 0.0808 rs28359178 mt:13708 MT-ND5 eGFR Beta = -2.75 Beta = -5.51 SE = 1.245 SE = 2.332 95% CI: -10.38 – CI: -10.08 – - -0.62 0.94 P = 0.0274* P = 0.0184* GC = 0.0340* GC = 0.0796

144 MitoC295T mt:295 Non- eGFR Beta = -2.86 Beta = -5.59 coding SE = 1.30 SE = 2.43 region 95% CI: -10.81 – CI: -10.35 – - -0.62 0.82 P = 0.0282* P = 0.0219* GC = 0.035* GC = 0.0884 rs41528348 mt:295 Non- eGFR Beta = -2.79 Beta = -5.47 coding SE = 1.30 SE = 2.42 region 95% CI: -10.66 – CI: -10.22 – - -0.51 0.73 P = 0.0311* P = 0.0241* GC = 0.0384* GC = 0.9356 SNP, single nucleotide polymorphism; rs, reference SNP cluster ID; OR, odds ratio; CI, confidence interval; GC, adjusted for genomic inflation; UK-ROI is the second to last study in the Direction column, displaying direction of effect for each study in meta-analysis

2.3.3 DNCRI Lookup of Suggestively Significant SNPs Any SNPs found to be suggestive of significant association (P < 1.0x10-5) in the targeted association analyses of the UK-ROI data were carried forward to a look-up of the full DNCRI collection. In this larger dataset, five SNPs were found to be significantly associated with DKD related phenotypes the minimum and maximum adjusted models, one SNP was significantly associated in the maximum adjusted model only and one SNP was significantly associated in the minimum adjusted model only. A summary of the results of the DNCRI look-up can be seen in Table 2.19 and the full set of results can be found in Appendix 5.

Two SNPs on chromosome 4, located in PPA2, rs72668225 and rs72668234, were found to be suggestive of significant association with eGFR before and after adjusting for covariates in the UK-ROI collection and were therefore included in the DNCRI lookup, however, these did not reach the threshold for significance in the DNCRI collection.

On chromosome 6, one SNP in FARS2, rs58721031 (P = 7.95 x 10-7; beta = 20.43; SE = 4.113) was significantly positively correlated with eGFR and remained significant after adjusting for genomic inflation and after adjusting for covariates and was included in the DNCRI lookup. This SNP was found to be associated with ESRD development with albuminuria within two studies in the DNCRI data set (Table 2.19) with the minimum adjusted model only (P = 0.02045; Effect = 0.653; SE = 0.2821).

On chromosome 12, rs56122389 was found to be suggestive of significant association with eGFR before and after adjusting for covariates and was therefore included in the DNCRI lookup. This SNP was found to be associated with eGFR < 45 ml/min/1.73m2, albuminuria and ESRD within three studies in the DNCRI data set (Table 2.19) with minimum adjusted

145 model (P = 0.03856; Effect = 0.3547; SE = 0.1715) and with the maximum adjusted model (P = 0.02301; Effect = 0.4491; SE = 0.1976).

Three SNPs on chromosome 15 included in the DNCRI lookup were in CATSPER2 and PDIA3. Of the two SNPs in CATSPER2, rs17727871 was seen to have a significant positive association the CKD phenotype (P = 0.04224; Effect = 0.3576; SE = 0.176) in four studies in the DNCRI collection in the maximum effect model only. The other SNP in CATSPER2, rs17727979, was seen to have a borderline significant, negative effect (P = 0.05008; Effect = -0.3828; SE = 0.1954) in three studies in the DNCRI collection, however this did not cross the threshold for significance in the current look-up. The remaining SNP on chromosome 15 which was suggestive of significant association in the UK-ROI collect was rs28653631, found in PDIA3, however this SNP did not reach the threshold for significant association for any phenotypes in the DNCRI lookup.

On chromosome 17 four SNPs, all found in ABCA9 were found to be suggestive of significant association with eGFR before and after adjusting for covariates and therefore these were included in the DNCRI lookup. From the result of the DNCRI lookup four SNPs were found to be associated with the CKD phenotype (Table 2.19), two of which (rs568286059 and rs72844573) were seen to have a negative effect in the minimum adjusted model (four studies) and the maximum adjusted model (three studies). The remaining two SNPs (rs72851722 and rs72851720) were found to have a similar effect size but in a positive direction in the minimum adjusted model (four studies) and the maximum adjusted model (three studies).

On chromosome 22, rs142983932, found in RNF185, was found to be suggestive of significant association with eGFR before and after adjusting for covariates and therefore these were included in the DNCRI lookup; however, this did not remain significant in the DNCI collection.

146 Table 2.19| SNPs in nDNA Suggestive of Significant Association in UK-ROI Analyses and DNCRI Lookup SNP Position Gene Pheno Without With Covariates DNCRI Min. Adjusted DNCRI Max. Adjusted Effect direction for type Covariates Model Model each study in DNCRI meta-analysis rs58721031 6:5482428 FARS2 eGFR Beta = 20.43 Beta = 19.69 Effect = 0.6539 Not present in max Min Mode= SE = 4.113 SE = 3.821 Std Error = 0.2821 adjusted model for ???+???+??????? 95% CI: 12.37 – 95% CI: 12.20 – P = 0.02045 EsrdvMacro 28.49 27.18 Associated Trait = phenotype Max Model= P = 7.95 × 10-7 P = 3.09 × 10-7 Esrdvmacro Not present GC = 7.95 × 10-7 GC = 3.09 × 10-7 No. of Studies = 2 does not include UKROI A1= A A2= G rs56122389 12:53212107 KRT4 eGFR Beta = 21.88 Beta = 19.88 Effect = 0.3547 Effect = 0.4491 Min Model= SE = 4.732 SE = 4.419 Std Error = 0.1715 Std Error = 0.1976 ????++??+???????? 95% CI: 12.61 – 95% CI: 11.22 – P = 0.03856 P = 0.02301 31.15 28.54 No. of Studies = 3 No. of Studies = 3 Max Model= P = 4.24 x 10-6 P = 7.65 x 10-6 Associated Trait = Associated Trait = ????++??+???????? GC = 1.41 x 10-5 GC = 2.55 x 10-5 ckddn ckddn does not include does not include UKROI UKROI A1= A A1= A A2= G A2= G rs17727979 15:43929666 CATSPER2 eGFR Beta = 17.11 Beta = 17.53 Effect = -0.2274 Effect = -0.3828 Min Model= SE = 3.689 SE = 3.434 Std Error = 0.1765 Std Error = 0.1954 ????-???+??????-? 95% CI: 9.876 – 95% CI: 10.80 – P = 0.1977 P = 0.05008 24.340 24.26 No. of Studies = 3 No. of Studies = 3 Max Model= P = 3.99 x 10-6 P = 3.97 x 10-7 Associated Trait = Associated Trait = ????-???-??????-? GC = 3.99 x 10-6 GC = 3.97 x 10-7 CKD CKD

147 A1= A A1= A A2= C A2= C rs17727871 15:43923018 CATSPER2 eGFR Beta = 16.55 Beta = 16.83 Effect = 0.2144 Effect = 0.3576 Min Model= SE = 3.631 SE = 3.377 Std Error = 0.1589 Std Error = 0.176 ????++??-??????+? 95% CI: 9.434 – 95% CI: 10.21 – P = 0.1774 P = 0.04224 23.670 23.45 No. of Studies = 4 No. of Studies = 4 Max Model= P = 5.78 x 10-6 P = 7.31 x 10-7 Associated Trait = Associated Trait = ????++??+??????+? GC = 5.78 x 10-6 GC = 7.31 x 10-7 CKD CKD A1= T A1= T A2= C* A2= C* rs72851722 17:66972055 ABCA9 eGFR Beta = 21.70 Beta = 21.05 Effect = 0.3901 Effect = 0.448 Min Model= SE = 4.656 SE = 4.345 Std Error = 0.1797 Std Error = 0.2068 ????+???+???-??+? 95% CI: 12.58 – 95% CI: 12.53 – P = 0.02989 P = 0.03027 30.83 29.57 Associated Trait = Associated Trait = Max Model= P = 3.56 x 10-6 P = 1.46 x 10-6 CKD CKD ????+???????-??+? GC = 3.56 x 10-6 GC = 1.46 x 10-6 No. of Studies = 4 No. of Studies = 3 A1=T A1=T A2= G* A2= G* rs568286059 17:66976506 ABCA9 eGFR Beta = 20.62 Beta = 19.82 Effect = -0.3931 Effect = -0.4342 Min Model= SE = 4.535 SE = 4.231 Std Error = 0.1726 Std Error = 0.1994 ????-???-???+??-? 95% CI: 11.73 – 95% CI: 11.52 – P = 0.02278 P = 0.02939 29.51 28.11 Associated Trait = Associated Trait = Max Model= P = 6.10 x 10-6 P = 3.21 x 10-6 CKD CKD ????-???????+??-? GC = 6.10 x 10-6 GC = 3.21 x 10-6 No. of Studies = 4 No. of Studies = 3 A1= A A1= A A2= G A2= G rs72844573 17:67028212 ABCA9 eGFR Beta = 21.75 Beta = 20.19 Effect = -0.3689 Effect = -0.4529 Min model= SE = 4.53 SE = 4.226 Std Error = 0.1647 Std Error = 0.1933 ????-???-???+??-? 95% CI: 12.87 – 95% CI: 11.90 – P = 0.02508 P = 0.01912 30.63 28.47 Max model=

148 P = 1.80 x 10-6 P = 2.04 x 10-6 Associated Trait = Associated Trait = ????-???????+??-? GC = 1.80 x 10-6 GC = 2.04 x 10-6 CKD CKD No. of Studies = 4 No. of Studies = 3 A1= C A1= C A2= G A2= G rs72851720 17:66971664 ABCA9 eGFR Beta = 21.81 Beta = 20.77 Effect = 0.3906 Effect = 0.4489 Min Model= SE = 4.723 SE = 4.407 Std Error = 0.1802 Std Error = 0.2074 ????+???+???-??+? 95% CI: 12.55 – 95% CI: 12.13 – P = 0.03022 P = 0.03046 31.07 29.41 Associated Trait = Associated Trait = Max Model= P = 4.36 x 10-6 P = 2.77 x 10-6 CKD CKD ????+???????-??+? GC = 4.36 x 10-6 GC = 2.77 x 10-6 No. of Studies = 4 No. of Studies = 3 A1= A A1= A A2*=G A2*=G *Minor Allele is A2 in this case. SNP, single nucleotide polymorphism; rs, reference SNP cluster ID; OR, odds ratio; CI, confidence interval; GC, adjusted for genomic inflation; UK-ROI is the second to last study in the Direction column, displaying direction of effect for each study in meta-analysis.

149 2.4 Discussion

The results presented in the previous section highlight potential associations between DKD and genetic variants required for mitochondrial function. There is a clear relationship between diabetes, hypertension and kidney disease with 80% of ESRD cases worldwide attributable to diabetes, hypertension or a combination of both. Hypertension has been proposed as both a cause and consequence of kidney failure. Systemic hypertension could result in higher intra-glomerular pressure with subsequent damage to the glomerular microcirculation and increased glomerular permeability leading to albuminuria and eventually proteinuria. Increasing glomerular permeability and glomerular fibrosis is accompanied by increased vascular resistance and activation of the renin-angiotensin- aldosterone system resulting in salt and water retention leading to volume-mediated hypertension829–831.

From the analyses of the UK-ROI data there was a total of 12 SNPs in seven NEMGs found to be significantly associated with at least one of the phenotypes being investigated both before and after adjusting for covariates. Out of these, there were eight SNPs in four NEMGs which also reached the threshold for significance in the DNCRI lookup and these were found in FARS2 (rs58721031), KRT4 (rs56122389), ABCA9 (rs72851722, rs568286059, rs72844573 and rs72851720) and CATSPER2 (rs17727979 and rs17727871).

The variant rs58721031 is found at chr6:5482428 within the gene FARS2 and results in the replacement of a guanine nucleotide with an adenine. This SNP has not previously been reported as clinically significant and is not known to be an expression quantitative trail loci (eQTL). The FARS2 gene encodes a mitochondrial phenylalanyl tRNA synthetase which is synthesized outside the mitochondria and transported into the mitochondrial matrix832,833. Mitochondrial aminoacyl-tRNA synthetases are essential for the translation of genes encoding OXPHOS subunits encoded by mtDNA and defects in these genes have been associated with various mitochondrial diseases in humans834–838. Pathogenic variants in FARS2 have been shown to affect complex I and IV activity and have been associated with epileptic encephalopathy839–843 and autosomal recessive hereditary spastic paraplegia844,845. Within the UK-ROI data the SNP, rs58721031 was found to have a positive association with eGFR. The direction of the effect of this SNP was replicated in the DNCRI where this SNP was found to be significantly associated with ESRD in the FinnDiane and Joslin Diabetes Center collections only with a positive effect direction. No SNPs in this gene have previously been associated with DKD or any other form of kidney disease and rs58721031 has not previously been associated with any disease phenotype.

150 The gene KRT4 encodes a type II cytokeratin that is expressed in differentiated layers of the mucosal and oesophageal epithelia along with keratin family member KRT13. Mutations in this gene is known to be associated with White Sponge Nevus846–848. In the UK-ROI data the SNP, rs56122389 was found to have a large positive association with eGFR. This SNP is found at chr12:52818323 in KRT4 and no clinical significance has been reported and this SNP is not a known eQTLs have been identified for this variant. The direction of the effect of this SNP was replicated in the DNCRI where this SNP was found to be significantly associated with severe CKD based on eGFR in three of the studies (FinnDiane, GENEDIAB and JOSLIN) in both the minimally and maximally adjusted models. No SNPs in this gene have previously been associated with DKD or any form of kidney disease and rs56122389 has not previously been associated with any disease phenotype.

ABCA9 is a member of the ABC1 subfamily of ATP-binding cassette (ABC) transporters and expression of this gene is induced during monocyte differentiation into macrophages and is supressed by cholesterol import. This gene, along with other ABC transporters have been associated with ovarian cancer849,850, multidrug resistance in melanoma851 and a potential role in predicting response to oxaliplatin treatment in colorectal cancer852. There were four SNPs in ABCA9 (rs72851722, rs568286059, rs72844573 and rs72851720) found to be significant in the UK-ROI data which also reached the threshold for significance in the DNCRI lookup. These were part of a LD block 87 kb wide ranging from chr17:66971664 – 67059245. The index SNP in this LD block was rs72851722 (Figure 2.5) and was found to be significantly positively correlated with eGFR both before and after adjusting for covariates in the UK-ROI collection and remained significantly associated with the CKD based on eGFR in the follow- up results from the full DNCRI collection (in four collections: FinnDiane, JOSLIN, SDRNT1BIO and UK-ROI). This SNP is thought be an eQTL for tibial nerve tissue853(along with rs72844573 and rs72851720). Two of the SNPs in this LD block (rs72851722 and rs72851720) were found to have a significant positive correlation with eGFR in two studies in the DNCRI lookup in the maximum adjusted model and in three studies in the minimum adjusted model. These SNPs were found to be negatively correlated with eGFR in one study in both the minimally and maximally adjusted models. The other two SNPs in this LD block (rs568286059 and rs72844573) were seen to have a similar effect size but in the opposite direction to rs72851722 and rs72851720 in both the UK-ROI analysis and the DNCRI follow-up. These SNPs had a significant negative correlation with eGFR in two studies in the maximum adjusted model and in three studies in the minimally adjusted model; these were also found to be positively correlated with eGFR in one study in both the minimally and maximally adjusted

151 models in the DNCRI follow-up. The effect direction that these SNPs had was consistent in the UK-ROI study in the initial analysis and in the follow-up in the DNCRI subgroup. No SNPs in ABCA9 have previously been associated with DKD or any form of kidney disease and the four SNPs found to be significant in the UK-ROI and DNCRI collections had not previously been associated with any disease phenotype. No clinical significance has previously been reported for any of the SNPs in ABCA9.

The CATSPER2 gene encodes a cation channel protein which localises to the of spermatozoa and defects in this gene are known to cause male infertility854–857. There were two SNPs in CATSPER2 (rs17727979 and rs17727871) found to be significantly associated with eGFR in the UK-ROI data. These were found to be part of a LD block and the index SNP was identified as rs17727979 (Figure 2.6) and was found to be positively correlated with eGFR both before and after adjusting for covariates in the UK-ROI collection, however this SNP was close to the threshold for significance in the DNCRI cohort but did not cross this threshold (P = 0.05008) and therefore this cannot be said to be significantly associated with the CKD phenotype in the follow-up analysis. As rs17727979 did not reach the threshold for significance in the DNCRI lookup the only SNP in the initial analysis found to be significantly associated with eGFR in CATSPER2 was rs17727871. This was found to be positively correlated with CKD based on eGFR in four studies in the maximum adjusted model (FinnDiane, GENEDIAB, JOSLIN and UK-ROI). In the minimum adjusted model, this SNP was found to be positively correlated with CKD based on eGFR in three studies (FinnDiane, GENEDIAB and UK-ROI) and negatively correlated with this phenotype in one study (JOSLIN). Overall, the effect direction of rs17727871 was found to be consistent between the initial UK-ROI analysis and the lookup of the DNCRI collections. The variant rs17727871 has previously been identified as and eQTL in a wide range of tissue type but no evidence has been reported in kidney tissue858. No SNPs in CATSPER2 have previously been associated with DKD or any form of kidney disease and the SNPs in this gene found to be significant in the UK-ROI and DNCRI collections had not previously been associated with any disease phenotype.

A SNP in PDIA3 (rs28653631) was also found to be in LD with the SNPs in CATSPER2 (Figure 2.6) and this was found to be significantly associated with eGFR in the initial UK-ROI analysis, however this did not remain significant in the DNCRI follow-up. Although this SNP was found to be in LD with rs17727979 and rs17727871 the LD block did not extend across all these genes as can be seen from the low r2 values for SNPs in CKMT1A in Figure 2.6. Another two SNPs in PPA2 (rs72668225 and rs72668234), and one in RNF185 (rs142983932) were found

152 to be significantly associated with eGFR in the UK-ROI collection but these did not appear in any of the studies included in the full DNCRI collection. None of these SNPs have previously been associated with DKD or any form of kidney disease.

In mtDNA, two SNPs (rs201336470 and rs2853496) were found to be significantly associated with at least one of the phenotypes being investigated before and after adjusting for covariates. An additional ten SNPs in mtDNA showed minimal evidence for association with at least one of the phenotypes included. Four of these were found to show minimal evidence for association with eGFR both before and after adjusting for covariates (2010-08-MT-395, rs28359178, MitoC295T and rs41528348) and are therefore will not be focused on in detail in this section.

One of the SNPs in mtDNA, rs201336470, was found to be associated with the DKD phenotype before adjusting for covariates (P = 0.008805) and showed minimal evidence for association after adjusting for covariates (P = 0.04549), however at this lower level of significance there is increased likelihood that a type 1 error may have been made. This SNP, which has not previously been associated with any DKD related phenotype, is found in the MT-RNR2 gene which encodes several proteins; 16s RNA which is a mitochondrial ribosomal RNA; humanin which is a mitochondrial-derived peptide (MDP) containing a three-turn α- helix with no symmetry; and another six MDPs known as small humanin-like peptides (SHLP1, SHLP2, SHLP3, SHLP4, SHLP5 and SHLP6). Humanin is known to exhibit antioxidant properties and has been seen to have a protective role in many diseases primarily due to suppression of apoptosis by binding to BAX therefore inhibiting its transport from the cytosol to mitochondria859,860. The protective role of humanin has been observed in various age-related diseases71 including protecting against neurotoxicity in Alzheimer’s disease861–863. Humanin and its more potent variant, humanin G, have also been seen to protect retinal pigment epithelial cells against oxidative stress in AMD864–867. Most notably humanin has been seen to be significantly decreased in those with impaired fasting glucose868, protects against albuminuria in mouse models of DKD869 and reduces renal microvascular remodelling, inflammation and apoptosis in the early stage of kidney disease in mouse models of atherosclerosis870. This highlights an important role of humanin and its analogues as a possibly therapeutic option for treating neurodegenerative disorders, CVD and diabetic complications71. The SHLP family of MDPs have also been shown to exhibit a similar protective role to humanin in AMD871. SNPs in MT-RNR2 have been associated with AMD in Mexican-American individuals (mt1736)872, with hypertrophic cardiomyopathy (m.2336T>C) in a familial case of this disease873, and with T2DM (m.1888G>A) in Caucasian-Brazilian

153 patients from southern Brazil. No SNPs in this gene have previously been associated with DKD.

In close proximity to MT-RNR2 is MT-ND1 which is a key component of complex 1874 and is located approximately 1 kb upstream of rs201336470. Dysfunction or mutations in this gene are known to be associated with several mitochondrial diseases. The MT-ND1 gene is implicated in Leber’s hereditary optic neuropathy (LHON), a mitochondrial inherited degeneration of retinal ganglion cells that leads to loss of central vision. Several mutations in MT-ND1 are reported to be involved in LHON and these include m.3472T>C875,876, m.3460G>A877,878, m.3394T>C878, m.3890G>A879,880, m.3700G>A881, m.3733G>A878,881, m.3736G>A882, m.4171C>A881, m.3635G>A878, m.3634A>G883 and m.3866T>C878. Haplogroup specific effect variants have also been shown to enhance penetrance, increase susceptibility and also lead to variation in the clinical phenotype of LHON881,884,885. Another distinct clinical phenotype which results from pathogenic mutations in MT-ND1 is mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) and specific mutations have been shown to result in LHON / MELAS overlap and these include 3376G>A.886,887 Mutations in MT-ND1 have also been associated with MELAS independent of LHON syndrome such as m.3959G>A888, m.3995A>G888, 3380G>A889. A pathogenic mutation in MT-ND1, m.3697G>A, has been associated with several mitochondrial diseases including LHON, MELAS and also Leigh syndrome, so this SNP may be useful as a biomarker of mitochondrial dysfunction890– 892. Mutations in this gene have also been associated with Leigh syndrome independent of MELAS and LHON (m.3928G>C)893, West syndrome leading to Lennox-Gastaut syndrome (m.3946G>A)894, maternally inherited nonsyndromic hearing loss (m.3388C>A)895, cancer progression (m.3571insC)896 and body mass index (mt3336T→G)897. No SNPs or mutations in this gene have previously been associated with DKD. rs2853496 (MitoG11915A) was significantly negatively associated with eGFR before adjusting for covariates (P = 0.007) and showed strong evidence for association after adjusting for covariates (P = 0.0002) and remained significant after Bonferroni correction. This SNP also had minimal association with the DKD phenotype after adjusting for covariates (P = 0.0153) but not before which may indicate that the association is with the covariates rather than the DKD phenotype. rs2853496 is located in MT-ND4 which encodes the ND4 subunit of complex 1 and is known to pay a role in MELAS (m.12015T>C)898 and LHON (m.11778G>A, m.14484T>C, m.14502T>C).899,900 Mutations in MT-ND4 have also been associated with prostate cancer (m.11087T > C)901 and AMD (rs2853497)872. The SNP rs2853496 is found 2kb downstream of MT-ND5 that is another key component of complex 1 of the ETC, encoding

154 the ND5 subunit. Similar to MT-ND1 and MT-ND4, variations and mutations in MT-ND5 have been linked with mitochondrial diseases such as LHON (m.13259G>A)902, Leigh syndrome with ophthalmological manifestations (m.13513G>A)903 and also in mitochondrial cardiomyopathy (c.1315A>G and c.13805A>G)904,905. Mutations in MT-ND5 have also been observed in mtDNA of oesophageal cancer cell lines and such mutations may be related the incidence of oesophageal cancer.906 A large scale study carried out by the CHARGEmtDNA+ Consortium investigated the influence of mtDNA and NEMGs on seven metabolic traits as risk factors for CVD using mtDNA association analyses and meta-analyses identified an mtDNA variant in MT-ND5 which was associated with IR in CVD907 which may suggest a relationship between MT-ND5 and diabetic complications. No SNPs or mutations in MT-ND5 have previously been associated with DKD, however the association of rs2853496 (MitoG11915A) with reduced eGFR in the current study may indicate that this SNP could be a potential biomarker for DKD development in individuals with T1DM however follow-up functional studies are required to investigate this further.

The two mtDNA SNPs found to be significant in the current analyses, rs201336470 and rs2853496, were both found close to genes which encode subunits of respiratory complex I. Mitochondrial complex 1 deficiency can lead to Leigh disease and LHON, and mitochondrial disease has been seen to affect multiple organs including kidneys908–910. Early investigations of mitochondrial DNA identified a large deletion of 5778-bp in mtDNA which may lead to a defective mitochondrial oxidative phosphorylation and contribute to the pathogenesis of several diabetic complications 911. High glucose levels in diabetes are associated with increased ROS production, 94,912,913 upregulate TGF-β1 and increased fibronectin deposition912,914–916 which are features seen in the development of DKD. It has been suggested that generation of cytosolic ROS may lead to or enhance specific defects in respiratory complex I which in turn leads to complex I deficiency, enhances susceptibility to mitochondrial permeability transition and leads to overproduction of superoxide and eventually results in diabetic end-organ injury917. Mouse models have also demonstrated that mitochondrial-derived ROS are increased in diabetic conditions in the kidney918. Recently, it has been suggested that inhibition of complex I can improve glucose metabolism by inducing glycolysis and reducing hepatic glucose output, these findings suggest that respiratory complex I may be a useful drug target to restore mitochondrial integrity and protect against diabetic complications including DKD917–920.

These analyses have identified potential associations between DKD related phenotypes and SNPs in NEMGs and mtDNA which have not previously been reported. Out of the SNPs

155 identified, two SNPs in mtDNA (rs201336470 and rs2853496) and one SNP in nuclear DNA (rs58721031) were found in or near genes coding for key components of respiratory complex I and genes essential for the actions of ROS and antioxidants. Although none of these SNPs has previously been associated with DKD, the kidneys are particularly sensitive to REDOX imbalance. Respiratory complex I dysfunction and aberrant ROS generation are known to impact kidney function921 and excessive ROS production is thought to be a key step in the pathogenicity of DKD922,923. These findings may indicate that these SNPs are suitable biomarkers for identifying individuals predisposed to develop DKD however, significant follow-up work is necessary to validate these observations.

Most SNPs do not directly affect health, and many are synonymous i.e. the nucleotide variation does not alter the amino acid sequence of the protein product due to codon degeneracy924,925. Out of the SNPs found to be significant in these analyses one was located in an intergenic region near KRT4, another six were synonymous SNPs found within genes (rs58721031, rs72844573, rs72851720, rs17727871, rs2853496 and rs201336470) and three were non-synonymous SNPS found within genes (rs72851722, rs568286059 and rs17727979). Non-synonymous SNPs are those which alter amino acid sequence which may directly affect the structure and function of the synthesised protein and as a result contribute to inter-individual variation within biological networks, genetic predisposition to disease, and response to drugs926. Early research utilised a targeted approach to investigate a panel of non-synonymous SNPS for association with common complex diseases927,928 including diabetes929 and diabetic kidney disease289,930.

Although synonymous SNPs do not directly alter amino acid structure there is evidence to suggest that these changes can also have an impact on mRNA folding931, mRNA structure,932 gene expression,933–936 co-translational folding, protein levels931,937–939 and have potentially toxic effects940. More recently codon degeneracy has been used to create E. coli with an entirely synthetic genome using a reduced number of sense codons to encode amino acids and these synthetic E. coli were found to be slightly larger and have an increased doubling time941. This observation demonstrates the functional impact of synonymous mutations on a genome-wide scale in bacteria, although this impact is likely more complex in higher organisms. In humans, it is thought that approximately 5-10% of genes contain at least one site at which synonymous mutations may be damaging942,943. Synonymous SNPs have been shown to be causal variants in human diseases such as Treacher–Collins syndrome944, X-linked infantile spinal muscular atrophy945, Seckel syndrome946 and cystic fibrosis947. Such variations have also been seen to correlate with disease susceptibility and

156 response to treatment while not having a direct link to disease mechanisms939,948–952. Despite these observations, there is still a lack of experimental evidence to characterise the molecular mechanisms by which synonymous mutations affect physiology and lead to human disease953,954. Findings from other GWASs suggest that mutations in non-coding and intergenic regions are also associated with disease955,956. It is possible that both synonymous and non-synonymous mutations in or near mitochondrial genes (mtDNA and NEMGs) identified in the current study may contribute to the pathogenesis of DKD by altering protein levels or conformation without affecting the amino acid sequence. These SNPs may also be associated with susceptibility to DKD without being directly involved in pathogenic mechanisms. Functional follow-up studies are required to further investigate the role of these SNPs in DKD and their impact on gene expression.

The direction of effect from the DNCRI results (including the UK-ROI collection) was found to be in the opposite direction to that observed in the initial UK-ROI analysis for two SNPs in ABCA9 (rs568286059 and rs72844573) and one SNP in CATSPER2 (rs17727979) although this did not reach the threshold for significance in the DNCRI lookup. For the other SNPs in these genes (rs72851722 and rs72851720 in ABCA9; and rs17727871 in CATSPER2), the minor allele reported from the DNCRI lookup was the A2 allele, but for the rest of the SNPs in the DNCRI lookup the minor allele was the A2 allele as can be seen from Table 2.20. Effect directions are also seen to vary between individual studies within the DNCRI lookup. It has been suggested that such observations could be due to linkage disequilibrium between causal SNPs which is known to lead to predicting opposing effect directions between studies957. Differences in effect direction such as those observed in the current study may occur due to differences in LD between populations which are emphasized by genotyping using lower SNP density typical of older GWAS panels958. The UK-ROI data included in the current study was genotyped using the Illumina Omni1-Quad array and the DNCRI data had been genotyped using the Illumina HumanCore BeadChip. Current arrays offer significantly better coverage of annotated SNPs compared to the arrays used in this study. Imputation can be used to overcome these issues, but it may be more accurate to perform new genotyping with a more modern array. The SNPs which show opposite effect directions are found to be in LD with other SNPs significantly associated in the current study and such “flip-flop” in association effect direction is known to occur due to LD between a non-causal variant and the causal variant and when the non-causal variant is positively associated with risk of disease the target allele appears as the risk allele and vice versa when the non-causal variant in negatively associated. DKD is a complex disease involving interaction between multiple genes and with

157 environmental factors and therefore the effect direction at any individual loci may also be influenced or confounded by other disease-associated SNPs or environmental factors959.

The genetic landscape of DKD is complex and the role of mitochondrial genetics is only beginning to emerge. Here we have identified SNPs in nuclear DNA and mtDNA that may be associated with DKD in a white European population. From the UK-ROI data and the DNCRI results most SNPs suggestive of association in NEMGs had MAFs which reflected those values derived from 1000 Genomes Project Phase3 (1000 Genomes) (Table 2.20). None of the significant SNPs in NEMGs from this study was present in East Asian populations according to MAFs derived from 1000 Genomes in dbSNP. The only significant SNPs in NEMGs in the current study found in a South Asian population according to dbSNP (1000 Genomes) were those in ABCA9 (rs72851722, rs568286059, rs72844573 and rs72851720). The MAFs observed in the lookup of the DNCRI data reflected those in dbSNP derived from 1000 Genomes for a European population for all SNPs except those found in CATSPER2 (rs17727979 and rs17727871) which appeared in the DNCRI data at approximately half the expected MAFs derived from 1000 Genomes. These SNPs were also found to have a higher MAF in the UK-ROI collection when compared to the full DNCRI collection and therefore may represent underlying differences in population structure. It is also possible that the presence of the minor allele of this SNP is protective against DKD development. However, from Table 2.20 it can be seen that the MAF for this SNP in a European population is almost four times that of the global MAF and therefore this SNP may be over represented in European populations, however this was not reflected in the results from the DNCRI or the UK-ROI collections. No 1000 Genomes MAF information was available for mtDNA and SNPs in the mitochondrial genome were not included in DNCRI data. Frequency data is available for rs2853496 (MT-ND4) on dbSNP derived from a HapMap population, however, this was genotyped almost 15 years ago with a small number of samples and a smaller SNP panel than is currently available and may not be an accurate representation of the minor allele frequencies in a European population or globally.

158 Table 2.20| MAFs for SNPs Significant in Targeted Association Analysis of UK-ROI Data and DNCRI Lookup ID Model Direction No of Associated Minor Alt MAF MAF Global MAFs African East Asian European South America Studi trait in from from Asian n es DNCRI DNCRI UK-ROI lookup rs58721031 Min ???+???+??????? 2 Esrdvmacr a g 0.0203 0.0205 A=0.0563/282 A=0.187 A=0.000 A=0.020 A=0.00 A=0.02 o 1 (1000Genomes) rs56122389 Min ????++??+???????? 3 ckddn a g 0.0195 0.0133 A=0.007 A=0.000 A=0.000 A=0.021 A=0.00 A=0.02 Max ????++??+???????? 3 ckddn a g 0.0185 3 (33/5008, 1000G) rs72851722 Min ????+???+???-??+? 4 ckd g t 0.0133 0.0160 G=0.049 G=0.160 G=0.000 G=0.012 G=0.01 G=0.02 Max ????+???????-??+? 3 ckd g t 0.0135 6 (243/5008, 1000G) rs568286059 Min ????-???-???+??-? 4 ckd a g 0.0134 0.0167 A=0.038 A=0.121 A=0.000 A=0.012 A=0.01 A=0.02 Max ????-???????+??-? 3 ckd a g 0.0136 5 (191/5008, 1000G) rs72844573 Min ????-???-???+??-? 4 CKD c g 0.015 0.0170 C=0.042 C=0.131 C=0.000 C=0.014 C=0.01 C=0.02 Max ????-???????+??-? 3 CKD c g 0.0151 9 (209/5008, 1000G) rs72851720 Min ????+???+???-??+? 4 CKD g a 0.0133 0.0157 G=0.049 G=0.160 G=0.000 G=0.012 G=0.01 G=0.02 Max ????+???????-??+? 3 CKD g a 0.0134 2 (243/5008, 1000G) rs17727979 Max ????-???-??????-? 3 CKD a c 0.0177 0.0246 A=0.011 A=0.000 A=0.000 A=0.037 A=0.00 A=0.02 1 (55/5008, 1000G) rs17727871 Max ????++??+??????+? 4 CKD c t 0.0191 0.0252 C=0.013 C=0.005 C=0.000 C=0.040 C=0.00 C=0.03 9 (67/5008, 1000G) MAF, minor allele frequency; DNCRI, Diabetes Nephropathy Collaborative Research Initiative; UK-ROI, All Ireland, Warren 3, Genetics Of Kidneys in Diabetes UK; Esrdvmacro, ESRD vs macroalbuminuria; ckddn, No CKD and normal AER vs. CKD and albuminuria/ESRD; ckd, CKD vs no CKD ; rs, reference SNP identifier

159 Ancient adaptive polymorphic changes in mtDNA are known as haplogroups and these correlate with geographic origins of populations960. Each haplogroup has related patterns of mtDNA sequences (haplotypes) that represent that population and these have the potential to alter amino acid sequences and thus affect energy production, ROS formation, apoptosis and cell death961. Such changes mean that different haplogroups may have unique biochemical and bioenergetics properties and this has been correlated with a range of common disorders including age-related diseases962–971. These alterations to amino acid sequence and resulting impact on mitochondrial function have been seen to modulate the penetrance and expressivity of diseases within certain populations900 and may also have a role in differential susceptibility to disease between populations from different geographic areas972,973. Mitochondrial haplogroups have previously been associated with T2DM and many studies have assessed mtDNA variation in T2DM patients.

An mtDNA genome-wide association analysis investigated a potential role for mtDNA in diabetic complications and found that the incidence of retinopathy was increased in those with the mtDNA H haplotype, specifically the H3 haplogroup and haplogroups U3 and V were risk factors for DKD974. It is not clear if these finding can be applied to T1DM in a white European population. The UK-ROI data in the current study contains samples originating from mitochondrial haplogroups H (rs2015062), I (exm-rs3928305, rs2853826 and MitoG16393A), J (rs28359178) and K (rs2853498) which is consistent with the haplogroup distribution expected from a population collected from the United Kingdom and Ireland. These haplogroups have not previously been associated with DKD and no association between individual haplotypes and DKD was observed in the current study. It is noteworthy that most of the SNPs found to be significant in the current analysis are not present in Asian populations and individuals of South-East Asian origin are known to be more susceptible to the risks of impaired fasting glucose and of impaired glucose tolerance compared with populations of a European origin975. Prevalence and rate of progression of DKD, is significantly higher in citizens of Asian origin, as observed both in the United Kingdom976 and in Canada977 and it is thought that this is due to genetic differences and/or lifestyle and lack of awareness of kidney complications of diabetes1,978.

In the current study only one significant SNP was found in a non-coding region, this is due to the targeted approach employed which investigated only SNPs in NEMGs and up to 2kb either side of these genes. While there is evidence that LD can exist over a longer range than this and long distance regulatory elements may be found outside of this range801, within the context of the current analyses 2kb either side of the gene was considered sufficient to

160 reduce the likelihood of false positive results and ensure any associations are with mitochondrial-related genes.

One limitation of the current study is that the SNPs found to be significant in this targeted mitochondrial association analysis may not be associated with DKD but may represent association with underlying disease, comorbidities or uncategorised subtypes of DKD. Non- diabetic kidney disease in diabetes is thought to occur in over 55% of diabetic patients with renal disease979. Several studies have previously shown that DM duration exceeding 10 years was a predictor of DKD in diabetic patients 980–982 and only those with DM for more than 10 years and evidence of kidney disease were included in the current study. While it is therefore likely that the patients selected represent true DKD, however it remains difficult to differentiate DKD from non-diabetic renal disease in a clinical setting without the use of renal biopsy and this may have led to bias in our results.

This research highlights the complexity of DKD and demonstrated that this disorder is associated with SNPs in mtDNA and in NEMGs. The complex interplay of genetic, environmental and behavioural factors contributing to DKD underlines the need for multidisciplinary approaches to research, treatment and prevention of DKD and other diabetic complications. It is also essential to clearly define a consensus renal phenotype that may allow more informative GWAS, fine-mapping and sequencing to uncover rare variants associated with DKD. The following chapter addresses some of these issues by developing a next generation sequencing workflow to further investigate the complex interaction between mtDNA, DNA methylation and gene expression and how these contribute to development of DKD and progression to ESRD.

161 Chapter 3| Designing a Next Generation Sequencing Workflow for mtDNA

3.1 Introduction

Genetic association studies, including genome-wide association studies (GWAS), investigating complex disease have often been subject to criticism due to limited study power, low prior probability and issues with multiple testing which increase the chances of false negatives and false positives983–986. In order to address some of these issues, independent replication of associations in a larger sample is desirable. Another approach which has been adopted is the use of massively parallel sequencing technology, often referred to as next generation sequencing (NGS), that can generate many reads at high coverage and target specific genomic regions to provide better sequencing depth of coverage of specific genomic regions. This approach is an attractive methodology to provide unbiased assessment of genomic material with high throughput and lower costs relative to classical methods such as Sanger sequencing987. Whole exome sequencing (WES) and whole genome sequencing (WGS) approaches have led to the discovery of previously unrecognised genetic and epigenetic markers which are associated with kidney disease45,988–1000. Comprehensive genomic analysis using high-throughput methods allows identification of difficult-to- diagnose diseases and lowers the risk of misdiagnosis as the underlying primary genetic cause can be more easily identified1001. In the case of diseases with more complex aetiologies, such as diabetic kidney disease (DKD), sequencing multiple genes can provide insights into the variability of such diseases by identifying causative mutations and variants in other genes that contribute to pathogenesis1002. NGS has proven to be a useful tool for research and diagnosis of kidney disease. Larsen and colleagues1003 developed an NGS panel covering 301 genes involved with kidney disease which could provide a molecular pathogenesis-based diagnosis in 46% of cases. More recently, the Nephrology Research Group (NephRes) at Queen's University Belfast (QUB) developed an NGS panel to investigate genetic variations in the UMOD gene and their association with kidney disease1004. WES approaches have also been used to identify diagnostic mutations in individuals with chronic kidney disease (CKD) of unknown cause1005. Limited research has been conducted using NGS to investigate DKD, however, in a recent array-based exome-wide association study found 11 loci in eight genes associated with end-stage renal disease (ESRD) in individuals of African American origin with type 2 diabetes1006. However, the use exome sequencing may be prone to overestimation of possible pathogenic variants that may limit its utility in a clinical setting1007. NGS is not limited

162 to WES and WGS but can also be used for a variety of applications including targeted sequencing, transcriptome sequencing (RNA-seq)1008, methylation sequencing (BST- seq)1009,1010, targeted sequencing of transcription factor–binding sites (ChIP-Seq)1011, regulatory regions (FAIRE-Seq and DNase-Seq)1012–1014, and chromatin structure (MNase- Seq)1015,1016. A summary of the types of genetic material that can be investigated using NGS is outlined in Figure 3.11017.

Figure 3.1| High Throughput Methods Used for Detection of Functional Genetic and Epigenetic Features1017 Genetic material for sequencing can come from sources such as normal and diseased tissues, from which NGS experiments can sequence the whole genome (WGS) or targeted genomic/epigenetic elements. Commonly used NGS approaches include whole genome sequencing of genomic DNA (WGS); Targeted “exome” sequencing focusing on protein coding regions of genomic DNA; RNA sequencing which uses cDNA produced from RNA as the source material (RNA-seq); Bisulfite-treated DNA sequencing (Bisulfite-seq); Immunoprecipitated DNA sequencing (ChIP-seq); sequencing cDNA made from immunoprecipitated RNA (RIP-seq); DNase-digested chromatin DNA (DNase-seq); Sequencing of accessible chromatin regions (FAIRE-seq), sequencing nucleosome-associated DNA (MNase-seq); sequencing of captured chromosome conformations (Hi-C/5C-seq; and sequencing of microbial DNA (Metagenomics).

In addition to nuclear DNA, mitochondrial DNA (mtDNA) can also be sequenced using an NGS approach that provides superior sensitivity and specificity of base calling when compared to Sanger sequencing as well as allowing assessment of the degree of heteroplasmy at each individual base1018. NGS for investigation of mtDNA has led to the identification of underlying genetic defects in mitochondrial diseases1019–1021 and other disorders including hepatocellular carcinoma1022, gout1023 and IgA nephropathy1024. The work presented in this chapter outlines an NGS workflow that can be used to investigate features of mtDNA and nuclear encoded mitochondrial genes (NEMGs) that may increase or decrease risk of DKD and ESRD in persons

163 with type 1 diabetes mellitus (T1DM). This workflow makes use of the facilities offered by the Genomic Core Technology Unit (GCTU) at QUB and Illumina BaseSpace for data analysis. This workflow has also been be evaluated to determine the feasibility and resources required to carry out this sequencing project on a larger scale.

3.1.1 Illumina Sequencing Experiments employing NGS methodology will follow a similar protocol involving sample collection and extraction of genetic material followed by library and template preparation, after which platform specific sequencing protocols are used to generate raw read data followed by genome alignment and read assembly during data analysis. Experimental design and sample type may influence the choice of pipeline and sequencing platform. NGS can be used to sequence whole genomes, transcriptomes and genomes from multiple organisms contained within the same sample (metagenomics). Samples may be obtained from many sources provided there is enough genetic material for sequencing and sources include biological tissue, microbiota, soil, food, water and single microorganism. The sequencing work for this project was performed by QUB GCTU which provides tailor-made solutions across a wide range of biological applications using NGS technology (further details of the services offered can be found at on their website1025). Sequencing by the GCTU was performed using Illumina sequencing platforms, specifically the NextSeq500 (mtDNA and RNA sequencing)1026,1027 and the NovaSeq6000 (bisulfite-treated-DNA sequencing)1028. There are four basic steps in an Illumina NGS workflow (Figure 3.2). Firstly, a sequencing library is prepared by randomly fragmenting the DNA or cDNA sample after which 5’ and 3’ adapters are attached to the fragments by ligation. These steps can also be combined into a single step using “tagmentation” which increases efficiency of library preparation1029. Once fragments have had an adapter attached, they are then amplified with PCR and purified on gels. Cluster generation is then performed by loading the library on to a flow cell where the fragments are captured by complementary oligonucleotides and amplified into distinct clusters by bridge amplification. Sequencing is performed using the Illumina sequencing by synthesis (SBS) chemistry that utilises a reversible terminator–based method capable of detecting bases as they are incorporated into the DNA template strands. The use of all four reversible terminator–bound dNTP in each sequencing cycle enables natural competition to minimise incorporation bias and therefore greatly reduces raw error rates when compared to other methods1030,1031. Following sequencing, data generated by Illumina sequencing is available for analysis with BaseSpace which allows encrypted data flow from sequencing instrument to the BaseSpace Sequence Hub where data can be managed and analysed with a range of

164 curated analysis apps1032. This is used for data analysis and alignment of the sequence reads to a reference genome after which many different analyses can be performed within BaseSpace and data can be exported in various formats for further analysis with third party software.

Figure 3.2| Four Main Steps in an Illumina NGS Workflow1032 a| Library preparation in which an NGS library is prepared by fragmenting the DNA sample and ligation of adapters. b| Cluster generation by bridge amplification on a flow cell surface. c| Sequencing by synthesis during which the flow cell is imaged and the emission from each cluster is recorded and used to identify individual bases. d| Alignment to reference genome and data analysis

The time taken for an Illumina sequencing run will depend on the instrument used and this includes cluster generation, SBS chemistry cycles, paired-end chemistry, and wash steps. Table 3.1 summarizes the estimated length of each sequencing step for the platforms used in this sequencing project. For the NovaSeq platform the cycle time will vary depending on the choice of flow cell (SP, S1, S2, or S4)1028. The SP, S1 and S2 flow cells offer quick and powerful sequencing for many high-throughput applications whereas the S4 flow cell is more

165 suited to cost effective WGS and WES. On the NextSeq 500 platform flow cells can be configured to allow high-output or mid-output depending on the number of reads required1026,1027.

Table 3.1| Approximate Times for Each Step on Illumina Platforms Used for Sequencing Instrument cBot On Cycle Template Pair End Quick Cluster Instrument Time Building* Turnaround Wash/Manual Generation Cluster (Minutes) or Generation Maintenance Wash (Minutes) NovaSeq N/A 130 SP/S1/S2 N/A 48 80 Minutes = 4 Minutes S4 = 6 NextSeq N/A 2.5 hr 4.6 Run 60 20 / 90 continued Minutes *Template building time will vary based on cluster density NovaSeq cycle time will vary depending on Flowcell used (SP, S1, S2, S4)

These sequencing instruments make use of paired-end sequencing, dual-indexing and library multiplexing. Paired-end sequencing involves sequencing both ends of the DNA fragment and aligning the forward and reverse reads as pairs which permits more accurate alignment and allows detection of indels which is not possible with single-read data1033. As the distance between each paired read is known this can be used to map reads more precisely allowing better alignment particularly in regions of the genome with repeated sequences. This approach can also be used with dual-indexing in which two index reads are performed following Read 1. The number of cycles in each read will always be 1 more than the number of cycles analysed. For example, a paired-end 150-cycle run performs reads of 151- cycles (2 × 151) for a total of 302 cycles. At the end of the run, 2 x 150 cycles are analysed. The extra cycle is required for phasing and pre-phasing correction calculations. Phasing and pre- phasing occur when a strand is out of phase with the current incorporation cycle. Phasing occurs if a base falls behind and pre-phasing occurs when a base jumps ahead of the current cycle1034. In a typical indexing workflow the Index 1 Read will follow directly after Read 1, however For dual-indexing on a paired-end flow cell (as in the current project) the workflow will differ depending on the platform and the workflows employed by the NextSeq500 and NovaSeq6000 are illustrated in Figure 3.3. On the NovaSeq6000 system the Index 2 Read is performed before Read 2 re-synthesis (Workflow A), so the Index 2 adapter is sequenced on the forward strand whereas on the NextSeq500 the Index 2 Read is performed after Read 2 re-synthesis (Workflow B), creating the reverse complement of the Index 2 index adapter sequence

166 Figure 3.3| Two Methods of Dual Indexed Sequencing used by Illumina Workflow A is used by the NovaSeq system and Workflow B is used by NextSeq1035

The use of multiplexing allows the pooling of multiple libraries that can then be sequenced together in a single sequencing run. This involved the addition of a unique index sequence to each DNA fragment to allow identification and sorting prior to data analysis. The use of paired-end sequencing and multiplexing allows rapid analysis and higher throughput in multi- sample studies. Prior to alignment and data analysis these samples must be de-multiplexed based on the unique index sequence assigned to each fragment1032. In the current project, three methods of sequencing were used to analyse mtDNA, RNA and methylation. For sequencing mtDNA, a targeted sequence approach was used to isolate and sequence only mtDNA. The use of targeted sequencing allows coverage of the target region at 500 – 1000x allowing the identification of rare variants which would not otherwise be captured using WGS. In order to gain insights into transcription and gene expression RNA-seq is used to provide a comprehensive overview of the cellular transcriptional profile at the time the sample was obtained. RNA-seq can be targeted to specific regions and used to sequence small RNA and non-coding RNA. An important step in RNA-seq is the removal of ribosomes prior to conversion to cDNA1032. In order to investigate DNA methylation sodium bisulfite chemistry is used to convert non-methylated cytosines to uracils which are then converted to thymines in the sequence reads. Following this bisulfite treatment, fragments between 100 – 150 bp are isolated and enriched for CpG and promotor containing DNA regions. Libraries can then be constructed using standard NGS protocols1036.

167 3.2 Methods and Materials

This section describes the steps involved in the sequencing and analysis of genetic material to investigate features of mtDNA, gene expression and DNA methylation that may be associated with DKD. This pilot study can then be used to assess the resources required and feasibility of applying the same workflow to a larger sample set. Genomic samples were sent to GCTU for DNA sequencing, RNA sequencing and BST-DNA sequencing of the mitochondrial genome. Samples were sequenced by GCTU and FASTQ files were provided on Illumina BaseSpace for further analyses.

3.2.1 Sample Collections and Comparison Groups The genetic material for this study was obtained from three sources. Samples from patients with ESRD (with and without diabetes) were selected from molecular material collected from recipients of kidney transplant procedures in Northern Ireland performed at the Belfast City Hospital (BCH)(BRTx)). Further details of this collection have been previously described by McCaughan and colleagues1037. All recipients had their primary renal diagnosis classified according to the European Dialysis and Transplantation Association coding system. For the current study, DNA and RNA were extracted from samples obtained from whole blood prior to kidney transplant from 10 patients with diabetes and kidney disease who had received a kidney transplant and 10 patients who had kidney disease with no evidence of diabetes who had received a kidney transplant. DNA and RNA were extracted from whole blood samples from 10 patients with no evidence of kidney disease or diabetes. These samples were previously collected as part of the Northern Ireland Cohort for the Longitudinal study of Ageing (NICOLA) study population, details of which can be found in the Wave 1 report which is available online at: https://www.qub.ac.uk/sites/NICOLA/FileStore/Filetoupload,783215,en.pdf. The NICOLA study began in 2013 with the aim of collecting information regarding ageing to gain a better understanding of the factors that affect social and health outcomes in the older Northern Ireland population. NICOLA has recruited a random sample of 8,309 people over 50 years old who were living in their own homes and 195 spouses or partners aged less than 50 who shared a residency with a participant. Further details of this collection have previously been described elsewhere1038. DNA samples were extracted from blood samples taken from patients with renal disease and diabetes who had not received a kidney transplant collected as part of a study investigating genetic predisposition to DKD in patients with T1DM. Samples for the current study were selected from a collection of white individuals with parents and

168 grandparents born in Ireland (All-Ireland Collection). From this collection nephropathy patients with persistent proteinuria (>0.5 g/24 h) developing more than 10 years after initial diagnosis of T1DM and had hypertension (>135/85 mmHg and/or treatment with antihypertensive agents) were selected for inclusion in the current study. A total of 10 DNA samples were selected from this collection for sequencing and details of this collection have previously been described elsewhere1039. No RNA was available from the All-Ireland collection and therefore RNA from 10 participants from the NICOLA study who had diabetes and renal disease but had not received a renal transplant were sex-matched to the DNA from the All Ireland collection. Only sex-matching was performed, as it was not possible to age- match these due to the inclusion criteria of the NICOLA study. The full list of samples along with age at recruitment, diabetes status and kidney status can be seen in Table 3.2.

Table 3.2| Samples Used for Next-Generation Sequencing Project Anon ID Source Age Sex Diabetes Kidney Status Genetic Range Status Material Type DKD_ESRD1 BRTx 45 - 49 Male Diabetes Type I RRT (CKD Stage 5) DNA & RNA DKD_ESRD2 BRTx 25 - 29 Female Diabetes Type I RRT (CKD Stage 5) DNA & RNA DKD_ESRD3 BRTx 30 - 34 Female Diabetes Type I RRT (CKD Stage 5) DNA & RNA DKD_ESRD4 BRTx 25 - 29 Male Diabetes Type I RRT (CKD Stage 5) DNA & RNA DKD_ESRD5 BRTx 40 - 44 Male Diabetes Type I RRT (CKD Stage 5) DNA & RNA DKD_ESRD6 BRTx 30 - 34 Male Diabetes Type I RRT (CKD Stage 5) DNA & RNA DKD_ESRD7 BRTx 45 - 49 Male Diabetes Type I RRT (CKD Stage 5) DNA & RNA DKD_ESRD8 BRTx 35 - 39 Female Diabetes Type I RRT (CKD Stage 5) DNA & RNA DKD_ESRD9 BRTx 35 - 39 Male Diabetes Type I RRT (CKD Stage 5) DNA & RNA DKD_ESRD10 BRTx 35 - 39 Male Diabetes Type I RRT (CKD Stage 5) DNA & RNA ESRDonly1 BRTx 45 - 49 Male No Diabetes RRT (CKD Stage 5) DNA & RNA ESRDonly2 BRTx 25 - 29 Female No Diabetes RRT (CKD Stage 5) DNA & RNA ESRDonly3 BRTx 35 - 39 Female No Diabetes RRT (CKD Stage 5) DNA & RNA ESRDonly4 BRTx 25 - 29 Male No Diabetes RRT (CKD Stage 5) DNA & RNA ESRDonly5 BRTx 40 - 44 Male No Diabetes RRT (CKD Stage 5) DNA & RNA ESRDonly6* BRTx 30 - 34 Male No Diabetes RRT (CKD Stage 5) DNA & RNA ESRDonly7 BRTx 50 - 54 Male No Diabetes RRT (CKD Stage 5) DNA & RNA ESRDonly8 BRTx 35 - 39 Female No Diabetes RRT (CKD Stage 5) DNA & RNA ESRDonly9 BRTx 35 - 39 Male No Diabetes RRT (CKD Stage 5) DNA & RNA ESRDonly10 BRTx 35 - 39 Male No Diabetes RRT (CKD Stage 5) DNA & RNA Healthy1 NICOLA 40 - 44 Female No Diabetes Healthy (CKD Stage 1) DNA & RNA Healthy2 NICOLA 40 - 44 Female No Diabetes Healthy (CKD Stage 1) DNA & RNA Healthy3 NICOLA 40 - 44 Female No Diabetes Healthy (CKD Stage 1) DNA & RNA Healthy4* NICOLA 45 - 49 Male No Diabetes Healthy (CKD Stage 1) DNA & RNA Healthy5 NICOLA 50 - 54 Male No Diabetes Healthy (CKD Stage 1) DNA & RNA Healthy6 NICOLA 45 - 49 Male No Diabetes Healthy (CKD Stage 1) DNA & RNA

169 Healthy7 NICOLA 50 - 54 Male No Diabetes Healthy (CKD Stage 1) DNA & RNA Healthy8 NICOLA 50 - 54 Male No Diabetes Healthy (CKD Stage 1) DNA & RNA Healthy9 NICOLA 50 - 54 Male No Diabetes Healthy (CKD Stage 1) DNA & RNA Healthy10 NICOLA 50 - 54 Male No Diabetes Healthy (CKD Stage 1) DNA & RNA DKDonly1 Ireland 40 - 44 Female Diabetes Type I Proteinuria DNA DKDonly2 Ireland 70 - 74 Female Diabetes Type I Proteinuria DNA DKDonly3 Ireland 40 - 44 Female Diabetes Type I CRF DNA DKDonly4 Ireland 40 - 44 Male Diabetes Type I Proteinuria DNA DKDonly5 Ireland 55 - 59 Male Diabetes Type I Proteinuria DNA DKDonly6 Ireland 50 - 54 Male Diabetes Type I Proteinuria DNA DKDonly7 Ireland 45 - 49 Male Diabetes Type I Proteinuria DNA DKDonly8 Ireland 45 - 49 Male Diabetes Type I Proteinuria DNA DKDonly9 Ireland 40 - 44 Male Diabetes Type I Proteinuria DNA DKDonly10 Ireland 30 - 34 Male Diabetes Type I Proteinuria DNA DKDonly11 NICOLA 60 - 64 Female Diabetes Type I CKD Stage 3b RNA DKDonly12* NICOLA 60 - 64 Female Diabetes Type I CKD Stage 3b RNA DKDonly13 NICOLA 65 - 69 Female Diabetes Type I CKD Stage 3b RNA DKDonly14 NICOLA 65 - 69 Male Diabetes Type I CKD Stage 3b RNA DKDonly15 NICOLA 65 - 69 Male Diabetes Type I CKD Stage 3b RNA DKDonly16 NICOLA 70 - 74 Male Diabetes Type I CKD Stage 3b RNA DKDonly17 NICOLA 70 - 74 Male Diabetes Type I CKD Stage 3b RNA DKDonly18 NICOLA 75 - 79 Male Diabetes Type I CKD Stage 4 RNA DKDonly19 NICOLA 75 - 79 Male Diabetes Type I CKD Stage 4 RNA DKDonly20 NICOLA 75 - 79 Male Diabetes Type I CKD Stage 4 RNA *RNA from these samples was excluded due to an issue during library preparation DKD_ESRD, DKD with ESRD; ESRDonly, ESRD without Diabetes; Healthy, “Healthy” Controls: DKDonly, DKD without Transplant BRTx, samples from renal transplant patients from Belfast City Hospital; Ireland, All Ireland Collection; NICOLA, Northern Ireland Cohort for the Longitudinal study of Ageing.

For the current feasibility study four distinct phenotypes were chosen for sequencing: DNA and RNA from 10 BRTx samples with diabetes, with renal disease who required a kidney transplant formed the “DKD with ESRD” group (Age 25 - 49, mean age = 37 years); DNA and RNA from 10 BRTx patients without diabetes but with renal diseases requiring a kidney transplant formed the “ESRD without Diabetes” group (Age 25 - 79, mean age = 37 years); 10 DNA and RNA from NICOLA participants without diabetes or renal disease who had not had a kidney transplant formed the “Healthy” Controls” group (Age 40 - 54, mean age = 48 years). DNA from 10 samples from All-Ireland collection (Age 30 - 74, mean age = 47 years) and RNA from 10 NICOLA samples from patients with diabetes and kidney disease who had not received a kidney transplant formed the “DKD without Transplant” group (Age 60 - 79, mean = 70 years). The samples included in each of these groups can be seen in Table 3.3 and these will be used for the analyses detailed in Section 3.2.5.

170 Table 3.3| Phenotype Groups to be Used for Data Analysis Phenotype DKD with ESRD “Healthy” DKD without Transplant group ESRD without Controls Diabetes Molecular DNA and RNA DNA and DNA and DNA RNA Material RNA RNA Anon ID DKD_ESRD1 ESRDonly1 Healthy1 DKDonly1 DKDonly11 DKD_ESRD2 ESRDonly2 Healthy2 DKDonly2 DKDonly12* DKD_ESRD3 ESRDonly3 Healthy3 DKDonly3 DKDonly13 DKD_ESRD4 ESRDonly4 Healthy4* DKDonly4 DKDonly14 DKD_ESRD5 ESRDonly5 Healthy5 DKDonly5 DKDonly15 DKD_ESRD6 ESRDonly6* Healthy6 DKDonly6 DKDonly16 DKD_ESRD7 ESRDonly7 Healthy7 DKDonly7 DKDonly17 DKD_ESRD8 ESRDonly8 Healthy8 DKDonly8 DKDonly18 DKD_ESRD9 ESRDonly9 Healthy9 DKDonly9 DKDonly19 DKD_ESRD10 ESRDonly10 Healthy10 DKDonly10 DKDonly20 *RNA from these samples was excluded due to an issue during library preparation DKD, diabetic kidney disease; ESRD, end-stage renal disease

3.2.2 Sample Preparation and Quality Control Genetic material was collected and extracted as part of NICOLA, BRTx and All-Ireland collections. Stock DNA was stored in TE buffer (10 mM Tris Cl (pH 7.6), 1 mM EDTA (pH 8.0)) and normalised DNA was suspended in TE0.1 (10 mM Tris, 0.1 mM EDTA) and stored at -80oC. TE0.1 has a lower concentration of EDTA than TE buffer as too much EDTA can interfere with downstream reactions. DNA samples from the BRTx collection were prepared in 12 µL aliquots that were normalised to 100 ng/µl and stored at -80oC in the NephRes lab at BCH. RNA from the BRTx collection was also stored at -80oC in NephRes lab at BCH. DNA obtained from the All-Ireland collection was also prepared in 12 µL aliquots and normalised to 100 ng/µl and stored at -80oC in NephRes lab at BCH. No corresponding RNA was available from the All-Ireland collection and therefore samples were sex matched to the NICOLA samples. Due to the NICOLA cohort inclusion criteria it was not possible to age match these samples. The NICOLA RNA was normalised to 50 ng/µL in 5 µL to give 250 ng which was sent to the GCTU at QUB for quality control (QC) and sequencing. Samples were defrosted on ice which took around three hours after which these were normalised to 50 ng/µL in a volume of 5 µL and stored in BCH at -80oC overnight. In total 40 DNA and 40 RNA samples were sent to GCTU for mtDNA sequencing, RNA sequencing and bisulfite-treated DNA (BST-DNA) sequencing.

In Section 3.3.1 the amount of material and estimated concentrations sent to GCTU can be seen along with the actual concentration determined by GCTU. Prior to sequencing genetic material was quantitated using the Illumina Fragment Analyser Pipeline and further details

171 on this can be found in the associated documentation1040. For DNA Qubit results are also presented as this was used for quantification by GCTU. Following quantitation and QC by GCTU four samples were reported as not having enough DNA for sequencing and therefore extra material in a new aliquot was provided for these samples.

3.2.3 Illumina Sequencing by QUB Genomics Core Technology Unit All Illumina sequencing work was performed by GCTU at QUB1025. 40 DNA and 40 RNA samples were sent to GCTU for mitochondrial DNA-sequencing, RNA-sequencing and BST- DNA sequencing. GCTU were provided with 1200 ng of DNA (200 ng for QC, 200 ng for mtDNA sequencing, 500 ng for BST-DNA sequencing and 300 ng excess as a backup) and 250 ng of RNA (50 ng for QC and 200 ng for RNA sequencing) for QC and sequencing.

A summary of work performed by GCTU can be seen in Sections 3.2.4.1 – 3.2.4.3. In order to track and analyse a sequencing experiment all samples must be identifiable and for this, a comma-delimited file known as a Sample Sheet was used. This stored all information required for the sequencing run including a list of the samples, index sequences and the workflow used638. Every BaseSpace run requires an associated sample sheet and this must be named SampleSheet.csv. In the current project the sample sheets were produced by GCTU but Illumina Experiment Manager Software can also be used to create a sample sheet for any library preparation protocol638. The integration of Illumina sequencing platforms with BaseSpace Sequence Hub for downstream analysis requires data to be in a compatible file format and several Illumina sequencing systems provide outputs in binary base call format (BCL) format which can then be converted to FASTQ files for further analysis1041. Raw data files were also available to be downloaded from BaseSpace for use with other software platforms. The NextSeq and NovaSeq Systems produce BCL files that must be converted to FASTQ format using bcl2fastq Conversion Software which de-multiplexes data and converts BCL files to FASTQ file formats for downstream analysis. FASTQ is a text-based sequencing file format that stores raw sequence data with quality scores, and these can be used as input for many secondary data analysis workflows. BaseSpace Sequence Hub can also generate other file formats for example during downstream analysis of NGS data software platforms and BaseSpace applications may convert files from FASTQ format to other formats such as .vcf or.bam which can then be used for additional analysis or data visualisation. Data shared on BaseSpace can be accessed through the Public domain or Private domain both of which are accessed through unique login IDs and data is securely stored in a user’s individual profile accessible through the dashboard. The public domain is available for use by anyone provided

172 they create an account with a valid email address, a free trial is available after which a fee must be paid. In order to use the private domain a separate account must be created and registered to a domain such as the QUB domain.

After sequencing, various QC procedures were performed by GCTU. Limited QC metrics were available for BST-DNA as this was included in the GCTU workflow for BST-DNA sequencing. Raw sequencing data and FASTQ files were also shared by GCTU on Illumina BaseSpace.

3.2.3.1 Mitochondrial DNA Sequencing QUB GCTU performed mitochondrial DNA sequencing, FASTQ Generation and alignment. Mitochondrial DNA sequencing. Library work was performed using the Kapa HyperPlus1042 before capture on the Beckmann FXP automated robot and sequencing with Roche SeqCap EZ Developer (Mitochondrial Genome Design) kit1043. Library QC was performed before and after library preparation on the Agilent Fragment Analyzer1044,1045 with the appropriately sized kit then pooled to 4 nM and denatured before sequencing. The Illumina NextSeq 500 desktop sequencer was used for paired-end sequencing of mtDNA samples. The use of paired-end sequencing provides higher resolution and better alignment of reads across difficult to sequence, repetitive genomic regions. Dual-indexed sequencing was used in which two sequencing reads are performed as well as two index reads. In the current sequencing project 151 sequencing cycles and eight indexing cycles were performed for each read. Following sequencing two reads of 150 cycles were analysed as the extra sequencing cycle was required for phasing and pre-phasing calculations and was not included in the analysis. The NextSeq Mid output protocol was used for sequencing which can sequence up to 96 mtDNA samples. Full details of the NextSeq workflow can be found in the associated documentation from Illumina1026,1027. Sequencing outputs were accessed through BaseSpace via the users’ personal account on the public domain: “(Public Domain) My Data > Runs > PN046A_mitoseq”. Data on BaseSpace was shared by GCTU on the Public Domain and Private Domain, both of which require the user to create a personal account to access the data. FASTQ files were exported from BaseSpace via a web browser from the appropriate Run under the “Runs” tab in “My Data”. In order to export BCL files the BaseSpace Command Line must be used, and this can be accessed through the BaseSpace developer’s website if required.

Data produced from the mtDNA sequencing was in binary base call (BCL) format which was then converted to FASTQ format for use with downstream analysis software. FASTQ files are text files which contain sequence data along with a quality score each base represented as

173 an ASCII character. BCL files generated by GCTU were converted to FASTQ format as part of the sequencing workflow. Output files for each sample were provided by GCTU in FASTQ format and on-target Binary Alignment Map (BAM) format aligned to the NC_012920.1 NCBI reference sequence. QC reports were also provided by GCTU and these are summarised in a MultiQC report produced using MultiQC (Version 1.3.dev0) which is a modular tool used for aggregation of results from bioinformatics analyses into a single report. This report includes aggregated reports from Picard (Version 20.2.6), Samtools (Version 1.9) and FastQC (Version 0.11.8). The Picard toolkit is an open-source set of command line tools used for manipulating NGS data in various formats including SAM/BAM/CRAM and VCF. Samtools is a suite of programs used for reading/writing/editing/indexing/viewing files in SAM/BAM/CRAM format. FastQC is a quality control (QC) tool for use with NGS data and is used in conjunction with Picard for QC of BAM, SAM and FASTQ files.

3.2.3.2 RNA Sequencing QUB GCTU performed RNA library work and sequencing. Library preparation was performed using the KAPA Hyperplus kit before capture on the Beckmann FXP automated robot with the KAPA HyperPrep Kit with RiboErase. All 40 samples were taken in one batch and QC was performed before and after library preparation using the Agilent Fragment analyser. Libraries were pooled to 4 nM and denatured. Sequencing was performed on Illumina NextSeq 500 desktop sequencer using paired-end sequencing (75 base pair paired-end reads) with 20 million reads produced per sample. Dual-indexed library sequencing was employed to allow multiple libraries to be pooled and sequenced together. In the current sequencing project 76 sequencing cycles and 8 indexing cycle were performed for each read. Following sequencing two reads of 75 cycles were analysed as the extra sequencing cycle was required for phasing and pre-phasing calculations and was not included in the analysis. When used for gene expression profiling with RNA the Next Seq platform is capable of sequencing up to 40 samples in a single run1046. Sequencing outputs were accessed through BaseSpace via the users’ personal account on the public domain: “(Public Domain) My Data > Runs > PN0046B”. FASTQ files were exported from BaseSpace via a web browser from the appropriate Run under the “Runs” tab in “My Data”. In order to export BCL files the BaseSpace Command Line must be used, and this can be accessed through the BaseSpace developer’s website if required.

Data produced from the RNA sequencing was in BCL format which was converted to FASTQ format for use with downstream analysis software. BCL files generated by GCTU were

174 automatically converted to FASTQ format when imported into BaseSpace using FASTQ Generation (Version 1.0.0). The outputs from the FASTQ generation were found in BaseSpace in the users’ personal account on the public domain: “(Public Domain) My Data > Analyses > FASTQ Generation 2019-06-25 14:21:04Z” and “My Data > Analyses > FASTQ Generation 2019-06-25 14:22:37Z”. QC reports were also provided by GCTU and these are summarised in a MultiQC report produced using MultiQC (Version 1.3.dev0) and this included aggregated reports from QualiMap (Version 2.2.1), HTSeq count (Version 0.11.1), Picard (Version 20.2.6), STAR (Version 2.7.2a) and FastQC (Version 0.11.8). The Picard toolkit and FastQC have been outlined in section 3.2.4.1. QualiMap is a platform-independent application written in Java and R with both a Graphical User Interface (GUI) and a command-line interface to allow QC of alignment sequencing data and its derivatives. HTSeq Count is part of the Python package HTSeq and provided a list of genomic features and counts of how many reads map to each feature based on aligned sequence reads. STAR is a universal RNA-seq aligner, and this was used to align the sequencing data to the reference genome.

For RNA sequencing data produced by GCTU two FASTQ generation runs were performed to improve accuracy. If a Biosample has more than one FASTQ file all FASTQ files are merged for that Biosample in any downstream analysis on BaseSpace. Only those Biosamples that have been reviewed and unlocked can be used in the analysis and all FASTQ files which were marked as QC passed were aggregated for each sample. In the current analysis two sets of FASTQ files were generated for each sample and aggregated as outlined in Figure 3.41047.

Figure 3.4| Data Aggregation Workflow for Multiple FASTQ Files. When multiple FASTQ files exist for the same BaseSpace sample ID the data are aggregated as outlined above.

175 3.2.3.3 Methylation Sequencing In order to assess methylation patterns in the genome, BST-DNA sequencing was used. BST- DNA sequencing and FASTQ Generation were performed by QUB GCTU. Methylated regions were selected using SeqCap® Epi CpGiant Probes during library preparation. Library QC was performed before and after library preparation on the Agilent Fragment analyser. Paired-end sequencing was performed on Illumina NovaSeq 6000 sequencing system. Full details of the NovaSeq 6000 workflow can be found in the associated documentation from Illumina1028. Single-indexed library sequencing was used to generate two sequencing reads with 101 sequencing cycles and one indexing read with 6 cycles. Following sequencing two reads of 100 cycles were analysed as the extra sequencing cycle was required for phasing and pre- phasing calculations and was not included in the analysis. Sequencing outputs were found in BaseSpace on the user’s personal account on the private domain: “(Private Domain) My Data > Runs > GCTU_PN0047”. FASTQ files can be exported from BaseSpace via a web browser from the appropriate Run under the “Runs” tab in “My Data”, in order to export BCL files the BaseSpace Command Line must be used, and this can be accessed through the BaseSpace developer’s website. Data produced from the methylation sequencing is in binary base call (BCL) format which was then be converted to FASTQ format for use with downstream analysis software. The outputs from the FASTQ generation can be found in BaseSpace on the users’ personal account on the private domain: “(Private Domain) My Data > Analyses > FASTQ Generation 2019-05-03 12:15:12Z”. The GCTU do not provide support for alignment of MethylSeq data and therefore this is not presented at this stage. QC was performed on the FASTQ files using FastQC (Version 0.11.8).

3.2.4 Data Analysis with BaseSpace The following sections describes the BaseSpace Applications used for downstream analysis and step-by-step methods on how this was carried out. Time, cost and storage considerations are presented in the corresponding results section. Cost of work on BaseSpace is reported in iCredits which is the common currency for BaseSpace Sequence Hub services and are used for storage, computation and third-party app fees. For RNA and BST-DNA five analyses were performed as outlined in Figures 3.7 and 3.9. For mtDNA only visualisation of data is possible in BaseSpace, data can also be exported for analysis and comparison.

3.2.4.1 Mitochondrial DNA Sequencing Data Analysis Following FASTQ generation and post sequencing QC, FASTQ files were then used as inputs for the mtDNA Variant Processor (Version 1.0.0) app on BaseSpace. This app allows variant

176 analysis of any part of the circular mitochondrial genome with customisable quality and coverage thresholds. Step-by-step methods on how this was carried out can be seen below. Full details of the mtDNA Variant Processor can be found in the associated documentation1048..

1. After logging to the appropriate BaseSpace domain the Sequence Hub was opened. 2. From the Dashboard menu the Apps tab was then opened (https://basespace.illumina.com/apps/). 3. The mtDNA Variant Processor was then opened by clicking the appropriate icon. 4. This page provided information on how to use the app including an example project, additional release notes and access to the developer’s website from which full documentation and user guides were accessed. 5. After selecting the appropriate version from the dropdown box and the app was launched by clicking “Launch Application”. 6. A storage location to store the mtDNA Variant Processor output was selected from the “Project” heading. 7. Under the “Biosample(s)” heading the samples to be processed in the analysis were selected. This contained all samples in all BaseSpace projects available to the user on the current domain and only those generated from the mtDNA sequencing were chosen to be used with the mtDNA Variant Processor. 8. For “Genome” currently only the Revised Cambridge Reference Sequence (rCRS) is supported. The type of PCR primer used must also be selected under the “PCR Primer Description” dropdown box which includes options for D-loop Amplicon Manifest, Whole Mito Genome Amplicon Manifest and the ability to add a custom manifest with user defined coordinates. In the current analysis the Whole Mito Genome Amplicon Manifest was selected, and other settings used can be seen in Figure 3.5.

177

Figure 3.5| mtDNA Variant Processor User Interface with Settings used for Successful Analyses

9. Once the desired settings, input files, output location and analysis name were entered the analysis was initiated by clicking “Launch Application”. This opened the Analysis window that provided a summary of general information regarding the analysis duration, computing costs in iCredits and a Log with details of what was done by the mtDNA Variant Processor throughout the analysis. 10. Upon completion of the analysis, an email notification was sent to the email address registered to the BaseSpace account. Output files were stored in the project folder previously selected in the “Analyses” tab and named according to the analysis name selected by the user. A summary of the workflow performed by the mtDNA Variant Processor can be seen in Figure 3.6

178 VCF files used with mtDNA Variant Analyser for visualisation

Figure 3.6| Workflow Summary for mtDNA Variant Processor1049

VCF files generated from the mtDNA Variant Processor where used as inputs for the mtDNA Variant Analyser (Version 1.0.0) App which displayed the VCF file in web-based format allowing visualisation and report generation in Excel format (.xlsx). Sequence data can be compared between samples but direct comparison between groups is not possible in BaseSpace for mtDNA and therefore the mtDNA Variant Analyzer was used to produce an Excel report for each phenotype that could be used for further downstream analysis. Further details of the mtDNA Variant Analyzer can be found in the associated documentation1050. Step-by-step methods on how the mtDNA Variant Analyzer was used can be seen below.

1. The Apps tab on BaseSpace was accessed as described previous, and from here the mtDNA Variant Analyzer application was opened by clicking the appropriate icon. 2. After selecting the appropriate version from the dropdown box, the app was launched by clicking “Launch Application”. Pop-ups were enabled on web browser at this stage as the mtDNA Variant Analyzer opened in a new window. 3. After navigating to the project folder containing the output files from the mtDNA Variant Processor the samples to be included in the analysis were selected and submitted. 4. The mtDNA variant Analyser allowed visualisation of genomic features such as SNPs, insertion and deletions and allows comparison to the reference genome or between

179 individual samples. This can be seen in Figure 3.7 with NICOLA mtDNA samples used as an example:

Figure 3.7| mtDNA Variant Analyser Results Viewer Unique Identifiers have been removed to preserve sample anonymity.

5. The “Sample Compare” can be used to compare genetic features between two samples. In the current analysis this was not selected and therefore all samples were compared to the rCRS. 6. An Excel workbook report was produced by “Generate Report” which contained a summary of all variants for each sample selected. A total of four were produced, one for each of the phenotype groups.

3.2.4.2 RNA Sequencing Data Analysis In order to target mitochondrial RNA a custom reference genome was generated for analysis of RNA sequencing results. This was produced using the RNA Reference Genome Builder in BaseSpace using FASTA files and GTF files for mitochondrial gene expression from ENSEMBL as references. This custom reference genome was used for mitochondrial alignment of the FASTQ files using the RNA Seq Alignment App. Due to unforeseen circumstances, depth of coverage for mitochondrial RNA was suboptimal and therefore this analysis was not taken any further. The following analyses were performed with RNA Express using the default alignment settings. Aligned BAM files were provided by GCTU aligned to HG38, however this reference genome is not currently available on BaseSpace and therefore FASTQ files were used as input for RNA Express and analyses were performed with alignment based on HG19 and also with a custom mitochondrial reference genome derived from mitochondrial DNA

180 sequence from HG38. Following FASTQ generation and post sequencing QC, FASTQ files were used as inputs for the RNA Express (Version 1.0.0) App on BaseSpace. This app combines the capabilities of the STAR aligner and DE-Se analysis tools into one workflow which can be seen in Figure 3.9. This allows use of the most frequently used RNA analysis features within a single package, providing alignment of RNA-Seq reads with the STAR aligner that are then assigned to genes and differential gene expression is assessed with DESeq2. Up to 192 total control and comparison samples can be compared. Full details of the RNA Express app can be found in the associated documentation1051. The methods described below outline the steps used for all analyses using aligned RNA sequencing data generated by GCTU.

1. After opening the relevant BaseSpace Domain, the Apps window (https://basespace.illumina.com/apps/) was accessed as described in Section 3.2.4.1 and the RNA Express application was opened by clicking the appropriate icon. 2. This page provided information on how to use the app including an example project, additional release notes and access to the developer’s website from which full documentation and user guides can be accessed. 3. The appropriate version of the app was then selected from the dropdown box and launched by “Launch Application”. 4. Under the “App Session Name” heading, a uniquely identifiable name was given to each analysis. After clicking “Select Project”, under the “Save Results To” heading, a storage location was chosen to store the results of the RNA Express Analysis. 5. Under the “Control Group” heading, an appropriate name was given to this group (i.e. “Controls”) and control samples were selected by clicking “Select Biosample(s)”. This was also done for cases under the “Comparison Group”. The setting used for successful RNA Express analyses can be seen in Figure 3.8

181 Figure 3.8| RNA Express User Interface with Settings Used for Successful Analyses

6. Once the desired settings, input files, output location and analysis name were entered, the analysis was initiated by clicking “Launch Application”. This opened the Analysis window that provided a summary of general information regarding the analysis duration, computing costs in iCredits and a Log with details of what was being done throughout the analysis.

182 7. Upon completion of the analysis, an email notification was sent to the email address registered to the BaseSpace account. Output files were stored in the project folder previously selected in the “Analyses” tab and named according to the analysis name selected by the user. 8. Reports, charts and read metrics produced by RNA Express were then viewed from the “Analyses” tab and all data generated by the analysis was exported from BaseSpace and stored locally from “File > Download > Analysis”. A summary of the workflow performed by RNA Express can be seen in Figure 3.9.

Per-Sample Reads (FASTQ)

Alignment, Trimming (STAR, SAMtools)

Sorting, Indexing (bedtools)

Read counting

Global Expression (R, DESqe2)

Pairwise DE (R, DESqe2)

RNA Express Alignments Coverage Report (BAM) (bedGraph files)

Read Counts Expression Gene Read (count.csv) Heatmap Counts (heatmap.pdf) (counts.genes)

Figure 3.9| Workflow Summary for RNA Express1052

183

The output of RNA express provides aligned reads in BAM format for each sample along with counts of the number of reads mapped to genes and differential gene expression analysis results. The RNA Express app has a compute cost of 3.00 iCredits per hour. A total of five RNA Express analyses were performed in the current study and these are summarised in Figure 3.10.

Figure 3.10| Summary of Comparisons Performed using RNA Express

184

3.2.4.3 Bisulfite-Treated DNA Sequencing Data Analysis As the GCTU do not provide support for alignment of BST-DNA data this was carried out using the MethylSeq (version 2.0.0.0) application in BaseSpace. BST-DNA sequencing data was shared on the private QUB domain on BaseSpace which was accessed from the BaseSpace login screen by selecting “Private domain sign in”. Alignment was performed with a custom mitochondrial DNA template using MethylSeq and DRAGEN Methylation Pipeline. Additional alignment was performed with HG19 as a reference genome using MeythlSeq and DRAGEN Methylation pipeline.

3.2.4.3.1 MethylSeq Alignment The MethylSeq app was used with BST-DNA sequencing data and performs alignment, methyl calling and calculated alignment and methylation metrics. The MethylSeq workflow employs Isas (version 2.29.0) analysis software, SAMtools (version 1.2), Bismark (version 0.14.4) aligner for methyl calling which also includes Bowtie2 (version 2.2.2). Full details of the MethylSeq app can found in the associated documentation1053. The MethylSeq app has a compute cost of 3.00 iCredits per node per hour. The FASTQ files generated by GTUC were used input for MethylSeq. Step-by-step methods for FASTQ alignment performed can be seen below.

1. After opening the relevant BaseSpace Domain, the Apps window (https://basespace.illumina.com/apps/) was accessed as described in Section 3.2.5.1 and the MethylSeq application was opened by clicking the appropriate icon. 2. The following page provided some information on how to use the app including an example project, additional release notes and access to the developers’ website from which full documentation and user guides can be accessed. 3. After selecting the appropriate version of the app from the dropdown box it was launched by clicking “Launch Application”. 4. Under the “Analysis Name” a uniquely identifiable name was given to the analysis and a storage location selected from “Select Project” under the “Save Results To” heading. 5. Samples were then selected under the “Biosample(s)” heading by clicking “Select Biosample(s)”. All 40 BST-DNA samples were selected for alignment to HG19 and methyl calling. For targeted alignment to the mitochondrial genome a custom

185 reference and custom targeted manifest were also included. The custom reference was in FASTA format from UCSC HG38 build for the mitochondrial “chromosome” only1054. The custom manifest was extracted from the design files for the SeqCap® Epi CpGiant Probes1055. Other available options can be seen in Figure 3.11.

Figure 3.11| MethylSeq User Interface with Settings Used for Successful Analyses 6. Once the desired settings, input files, output location and analysis name were entered the analysis was initiated by clicking “Launch Application”. This opened the Analysis window that provided a summary of general information regarding the analysis duration, computing costs in iCredits and a Log with details of what was done throughout the analysis.

186 7. Upon completion of the analysis, an email notification was sent to the email address registered to the BaseSpace account. Output files are stored in the project folder previously selected in the “Analyses” tab and is named according to the analysis name selected by the user. Aligned files can be exported at this stage or used for differential methylation analysis with MethylKit. A summary of the workflow performed by MethylSeq can be seen in Figure 3.12.

Per-Sample Reads (FASTQ)

Read Conversion

C-to-T (converted FASTQ) G-to-A (converted FASTQ)

Alignment (bisulfite converted genome)

Alignment Results (BAM)

Methyl Calling

Bismark Bismark .bedGraph Coverage Report Processing Report

Statistics Reporting

Analysis Report PDF Report Summary Report (*.summary.csv) Figure 3.12| Workflow Summary for MethylSeq1056

This app generated several reports including aligned reads in BAM format; bedGraph; Cytosine Report; Bismark Alignment Report; Per-sample PDF reports containing targeted methylation metrics and the BaseSpace output was used with the MethylKit. (Version: 2.0.1) app for cytosine epigenetic profiling and differential methylation analyses.

3.2.4.3.2 Alignment with DRAGEN Methylation Pipeline From the results presented in Section 3.3.3.3 very little differential methylation was observed and therefore the BST-DNA alignment was repeated using the DRAGEN Methylation Pipeline in order to assess its performance compared to MethylSeq. The DRAGEN Methylation Pipeline app is designed to quickly analyse whole genome and targeted bisulfite DNA

187 sequence data. The workflow performed alignment, methyl calling, and calculated alignment and methylation metrics. The workflow for this app follows the same steps as the MethylSeq app. Full details of the DRAGEN Methylation Pipeline can be seen in the associated documentation1057. The DRAGEN Methylation Pipeline can support up to 96 samples in a single run in FASTQ, BAM or CRAM format and has a compute cost of 5.00 iCredits per hour. The FASTQ files generated by GTUC were used as the input for DRAGEN Methylation Pipeline. Alignment was performed to both HG19 and a custom mitochondrial reference genome produced using the DRAGEN Reference Builder (v3.4.5). To generate the custom reference genome for DRAGEN the FASTA file from UCSC HG38 build for the mitochondrial “chromosome” only1054 was used as input. A seed length of 10 was chosen due to the small size of the mitochondrial genome and the “Methylation” option was selected. All 40 BST-DNA samples were selected for alignment to HG19 and methyl calling. For targeted alignment to the mitochondrial genome a custom reference and custom targeted manifest were also included. The custom reference was in FASTA format from UCSC HG38 build for the mitochondrial “chromosome” only1054. The custom manifest was extracted from the design files for the SeqCap® Epi CpGiant Probes1055. The methods used for DRAGEN Methylation Pipeline analyses are described below.

1. After opening the relevant BaseSpace Domain, the Apps window (https://basespace.illumina.com/apps/) was accessed as described in Section 3.2.4.1 and the DRAGEN Methylation Pipeline application was opened by clicking the relevant icon 2. The following page provided some information on how to use the app including an example project, additional release notes and access to the developers’ website from which full documentation and user guides can be accessed. 3. After selecting the appropriate version of the app from the dropdown box it was launched by clicking “Launch Application”. 4. Under the “Analysis Name” a uniquely identifiable name was given to the analysis. A storage location was then selected from “Select Project” under the “Save Results To” heading.

188 5. FASTQ files were then selected under “Input FASTQs” and samples were selected by clicking “Select Biosample(s)”. All 40 BST-DNA samples were selected for alignment and methyl calling. The “Human (UCSC hg19)” genome build was selected from the “Reference” dropdown for the general alignment and the custom reference produced by the DRAGEN Reference Builder was used for targeted mitochondrial analysis. Other available options are shown in Figure 3.13.

Figure 3.13| DRAGEN Methylation Pipeline User Interface with Settings Used for Successful Analyses 6. Once the desired settings, input files, output location and analysis name were entered the analysis was initiated by clicking “Launch Application”. This opened the Analysis window that provided a summary of general information regarding

189 the analysis duration, computing costs in iCredits and a Log with details of what was done throughout the analysis. 7. Upon completion of the analysis, an email notification was sent to the email address registered to the BaseSpace account. Output files are stored in the project folder previously selected in the “Analyses” tab and is named according to the analysis name selected by the user. Aligned files can be exported at this stage or used for differential methylation analysis with MethylKit

This app generated several reports including aligned reads in BAM format; Cytosine Report (TXT.gz); Insert stats; M-bias TXT; Mapping metrics; Methyl metrics; and Time metrics. The Cytosine Report generated can be used for targeted and genome-wide cytosine epigenetic profiling and differential methylation analyses with MethylKit.

3.2.4.3.3 MethylKit Analysis Following alignment, the MethylKit (Version: 2.0.1) app was used for cytosine epigenetic profiling and differential methylation analyses. Due to issues with depth of coverage of the mitochondrial genome for BST-DNA this analysis was not taken any further. The following analyses were performed with MethylKit using the FASTQ files aligned to HG19 using both DRAGEN Methylation Pipeline and MethylSeq.

1. After opening the relevant BaseSpace Domain the Apps window (https://basespace.illumina.com/apps/) was accessed as described in Section 3.2.4.1 and the MethylKit app was opened by clicking the appropriate icon. 2. The following page provided some information on how to use the app including an example project, additional release notes and access to the developers’ website from which full documentation and user guides were accessed. 3. After selecting the appropriate version of the app from the dropdown box it was opened by clicking “Launch Application”. 4. Under the “App Session Name” a uniquely identifiable name was entered for each analysis. A storage location was chosen by clicking on “Select Project” under the “Save Results To” heading. 5. Cases were then selected from the “SAMPLES FOR CONDITION 1 (EXPERIMENT)” heading by clicking “Select Datasets”. All samples to be used as cases in the current comparison were selected. Controls were selected in a similar way from the “SAMPLES FOR CONDITION 2 (REFERENCE)” heading by clicking “Select Datasets”. Other available options can be seen in Figure 3.14.

190 6. Once the desired settings, input files, output location and analysis name were entered the analysis was initiated by clicking “Launch Application”. This opened the Analysis window that provided a summary of general information regarding the analysis duration, computing costs in iCredits and a Log with details of what was done throughout the analysis. 7. Upon completion of the analysis, an email notification was sent to the email address registered to the BaseSpace account. Output files are stored in the project folder previously selected in the “Analyses” tab and is named according to the analysis name selected by the user. 8. Reports, read metrics and differential methylation data produced by MethylKit were viewed from the “Analyses” tab and all data generated by the analysis was exported from BaseSpace and stored locally from “File > Download > Analysis“

Figure 3.14| MethylKit User Interface with Settings Used for Successful Analyses

191 The MethylKit app has a compute cost of 3.00 iCredits per hour. For the targeted mitochondrial files and the HG19 aligned files 10 MethylKit case-control comparisons were performed (Five with MethylKit output and Five with DRAGEN output) and these are summarised in Figure 3.15.

Figure 3.15| Summary of comparisons performed using MethylKit

3.3 Results

The following section will provide a summary of the results generated at each stage of the sequencing and analysis workflow for all successful analyses described in Section 3.2 for mtDNA, RNA and BST-DNA. This will include estimates of run times, computing costs, storage considerations and requirements for data analysis. Due to the methods used for library

192 preparation and sequencing targeted analysis of the mitochondrial genome was not possible for RNA and BST-DNA and therefore results are not presented for these analyses. In Section 3.4 these workflows will be evaluated and the feasibility of scaling up this project assessed.

3.3.1 Sample Preparation and Quality Control Samples were prepared in the NephRes Labs at BCH and sent to GCTU as described in Section 3.2.1. Prior to sequencing GCTU performed additional QC to ensure all samples were compatible with the sequencing workflow. Following QC by GTUC four DNA samples did not have enough genetic material remaining for mtDNA and BST-DNA sequencing and therefore additional DNA was provided for these. During quantitation by GCTU, it was reported that there was some inconsistency with the concentrations of the samples (from Table 3.4). Due to limited reserves of genetic material and the nature of the current work as a small scale pilot study it was decided that the concentrations of RNA and DNA would be sufficient for this purpose although these inconsistencies in concentrations will have to be addressed for any future work using these samples. The reports provided by GCTU (Table 3.4) show that one DNA sample and ten RNA samples were found to have a low quality score (GQN and RQN, respectively) which is typically an indicator of sample degradation and it is samples with a quality score lower than 7 are not recommended to be used for library preparation and sequencing.1058.

193 Table 3.4| Pre-Sequencing Quality Control and Quantification by GCTU

DNA RNA

Sample ID Fragment analyser results QUBIT Fragment Analyser Sample Total % Size DV200 28S/ GQN Conc. RQN Conc. >2000bp (bp) [%] 18S [ng/ul] [ng/ul] DKD_ESRD1 89.7 26292 9.5 48.9 48.7 2.6 0 120.9 DKD_ESRD2 93.4 19231 9.8 58.8 90.7 7.6 1 59.3 DKD_ESRD3 67.9 - 5 74.7 87.2 7.2 1.4 99.8 DKD_ESRD4 86.5 18462 8.4 73.8 34.4 6.3 0 1.1 DKD_ESRD5 96.3 17885 9.1 65.1 92.5 8.9 1.8 49.4 DKD_ESRD6 91.9 20572 9.7 64.5 92.4 8.1 1.4 152.8 DKD_ESRD7 97.5 24766 9.1 88.8 75.9 7.1 1.2 87.2 DKD_ESRD8 97 29056 9.9 113.7 63.1 7.4 1.5 51.2 DKD_ESRD9 96 17981 9.7 75.9 71 6.1 1.3 172 DKD_ESRD10 85.9 24766 8.4 84.3 89.9 8.2 1.2 55.8 ESRDonly1 94.1 18846 9.5 89.7 86.5 8.1 1.6 74.8 ESRDonly2 87.2 26101 9.8 91.2 93.9 8.6 1.5 76.1 ESRDonly3 93.9 18173 9.5 84 93 8.8 1.5 60.7 ESRDonly4 93.9 12020 9.6 97.2 83.7 7.6 1.5 62.3 ESRDonly5 92.8 21335 9.1 83.4 90.2 7.8 1.4 53.7 ESRDonly6 97.1 27340 9.6 87.9 82.2 7.5 1.5 26.7 ESRDonly7 96.3 26101 9.6 81 91.4 8.5 1.7 47.8 ESRDonly8 96.7 24862 9.7 83.4 91.3 10 0 18.3 ESRDonly9 90.7 27436 9.5 112.8 87.2 7 1.2 25.5 ESRDonly10 89.1 9945 9.4 108.9 91.4 8.4 1.4 88 Healthy1 93.8 30486 9.4 233.7 85 6.9 2 53 Healthy2 91.3 30295 8.8 195 76 6.2 2 44.1 Healthy3 97 28389 9.8 92.7 84.8 8.5 2.4 49.6 Healthy4 94.1 28293 9.4 117.9 81.8 8.6 1.9 49 Healthy5 92.4 30200 9.2 291.3 84.8 8 1.9 30.5 Healthy6 98.2 25910 9.9 195.3 80.7 8.4 2.8 44.9 Healthy7 98.9 31058 9.9 153.6 90.4 7.9 1.9 60.4 Healthy8 94.7 31630 9.4 103.8 84.1 8.6 2.6 42.2 Healthy9 95.5 28770 9.8 152.4 29.8 1.1 0 51.1 Healthy10 99 26292 9.9 179.7 83.7 7.2 2.6 44.6 DKDonly1 86.5 30391 9 217.2 NA NA NA NA DKDonly2 70.4 20858 7.1 351 NA NA NA NA DKDonly3 88.4 11443 9 179.7 NA NA NA NA DKDonly4 96.2 19135 9.7 225.6 NA NA NA NA DKDonly5 86.5 9727 8.7 360 NA NA NA NA

194 DKDonly6 95.9 27626 9.8 236.7 NA NA NA NA DKDonly7 93.9 32774 9.4 264.3 NA NA NA NA DKDonly8 95.3 23813 9.6 124.2 NA NA NA NA DKDonly9 95.4 24671 10 128.4 NA NA NA NA DKDonly10 94.4 23527 10 116.7 NA NA NA NA DKDonly11 NA NA NA NA 84.1 7.8 3.3 41.9 DKDonly12 NA NA NA NA 85.9 7.3 2.3 51.7 DKDonly13 NA NA NA NA 81.6 5.4 1 52.3 DKDonly14 NA NA NA NA 86.5 8.3 2.3 48.6 DKDonly15 NA NA NA NA 79.8 7.6 2 30.1 DKDonly16 NA NA NA NA 85.5 7.2 2.2 48.5 DKDonly17 NA NA NA NA 86 8.5 2.8 53.8 DKDonly18 NA NA NA NA 69.2 2.1 0 51.2 DKDonly19 NA NA NA NA 81.1 2.7 7.3 49 DKDonly20 NA NA NA NA 83.1 1.9 7.6 56.5 GQN, genomic quality number; DV200, percentage of RNA fragments > 200 nucleotides; RQN, RNA quality number

3.3.2 Illumina Sequencing by Genomics Core Technology Unit QUB GCTU performed sequencing and output files were shared via Alfresco and provided on a hard drive. Illumina projects were also shared on BaseSpace to allow further analysis without the need to re-upload the raw sequencing data. Below is a summary of post sequencing QC metrics provided by GCTU and from MultiQC reports. For each sequencing project different QC procedures have been performed by GCTU and details of the QC carried out for each can be found in Section 3.2.4.

3.3.2.1 Mitochondrial DNA Sequencing Sequencing of mtDNA using the Illumina NextSeq 500 took just over 27 hours and generated 27 GB of data in BCL and binary call index (BCI) format. Like BCL files, BCI files are also in binary format, however these contain one record per tile from the sequencing run and are used as an index when converting BCL to FASTQ. Files were exported from BaseSpace from “My Data > Runs > PN046A_mitoseq [mtDNA Sequencing Run] > Summary > File > Download > Run”. This did not include BCL or BCI files and therefore only contained 71.9 MB of data that included details on run parameters and run information as well as files which can be used with the sequence analysis viewer (SAV) to view metrics generated by real-time analysis software on Illumina sequencing systems. Details of data generated during this sequencing run are summarised in Table 3.5.

195 Table 3.5| mtDNA Sequencing Performed by QUB GCTU Input files Library generated from samples sent to GCTU Output files BCL and .BCI files Time taken 27 hours 5 minutes 40 seconds iCredits used 0 Sequencing NextSeq 500 Instrument Size on BaseSpace 27 GB (in 2 586 files) Files exported from Sequencing info, metrics, parameters and Sequence Analysis BaseSpace Viewer data (.SAV files) Size on disk 71.9 MB (in 15 files) BCL, basecall files; BCI, Basecall index files

Following sequencing several metrics (summarised in Table 3.6, 3.7 and Appendix 6) were available on BaseSpace and these were accessed through “My Data > Runs > [mtDNA Sequencing Run] > Metrics”. Summary metrics are presented in Table 3.6. Additional information regarding each sequencing read for mtDNA sequencing found under the “Per Read Metrics” heading and this is presented in Table 3.7. These metrics include number of cycles; yield; projected total yield; percentage of reads from clusters in each tile aligned to PhiX control genome (a control genome derived from the bacteriophage PhiX genome); percentage error; intensity statistic at cycle 1; and percentage of bases with a quality score ≥ 30. Metrics were also available for each lane under the “Per Lane Metrics” heading which included more detailed metrics split out by lane such as percentage of clusters which passed filtering; percentage of bases with a quality score ≥ 30; number of bases sequenced which passed filter; error rate; number of tiles per lane; cluster density; phasing and pre-phasing information; and intensity measures. Per lane metrics can be seen in Appendix 6.

Indexing QC was also available on BaseSpace and was accessed through “My Data > Runs > PN046A_mitoseq [mtDNA Sequencing Run] > Indexing QC”. This provided information for indexes used in the run as stated in the sample sheet. This is only available for indexed runs and does not provide information regarding downstream analysis metrics. A summary of indexing performance for each lane was available as well as additional information regarding the frequency of individual indexes. This allowed samples to be pooled and sequenced at the same time after which the unique indexes are used to identify each sample based on the information recorded in the Sample Sheet. During indexed sequencing the index is sequenced as a separate read and in the case of the current analysis dual-indexed sequencing was used and therefore two additional reads are included (read 2 and read 3).

196 Table 3.6| Overview of Run Metrics from mtDNA Sequencing Performed by GCTU Run Metric Value (Total) Clusters PF 169,166,565 Clusters Raw 195,748,858 % PF 86.42% R1 Phasing 0.148 R1 Pre-phasing 0.149 R1 %Q30 80.35 R1 Error Rate 1.41 R2 Phasing 0.189 R2 Pre-phasing 0.151 R2 %Q30 79.00 R2 Error Rate 1.44 Average %30 80.23% Yield 53.10 Gbp Total clusters and number of clusters which passed filters along with Read 1 and Read 2 metrics

Table 3.7| mtDNA Post-Sequencing QC for Each Read Including Index Reads Cycles Yield Projected Aligne Error Intensity %≥ yield d (%) rate cycle 1 Q30 (%) Read 1 151 25.37 Gbp 25.37 Gbp 1.43 1.41 4,251 80.35 Read 2 (I) 8 1.18 Gbp 1.18 Gbp 0.00 0.00 2,548 91.57 Read 3 (I) 8 1.18 Gbp 1.18 Gbp 0.00 0.00 2,294 92.79 Read 4 151 25.37 Gbp 25.37 Gbp 1.39 1.44 3,608 79.00 Non-Index 302 50.74 Gbp 50.74 Gbp 1.41 1.43 3,930 79.68 Reads Total Totals 318 53.10 Gbp 53.10 Gbp 1.41 1.43 3,175 80.23 Yield and cycles for each read including indexing reads along with quality metrics. Aligned (%) refers to the percentage of reads aligned to the PhiX control genome which is used to estimate error rate and high % of clusters aligned to the PhiX genome may indicate under clustering or imprecise cluster density

Descriptive charts were also produced by BaseSpace to allow visualisation of the QC metrics and simplified assessment of any erroneous sequencing results. These included a flow cell chart; data charts by cycle and by lane; QScore distribution chart; and QScore heatmap. Multiple options were available for each of these charts and detailed information can be found on the help section of BaseSpace1059.

FASTQ Generation was also performed by GCTU using BCL and BCI files as inputs to generate FASTQ files. This took approximately 43 minutes and generated 31.70 GB of data on BaseSpace. This can be accessed on BaseSpace via the users’ personal account on the public domain:“(Public Domain) My Data > Analyses > FASTQ Generation 2019-03-26 16:05:30Z”

197 and can be exported via “Files > Download > Analysis” which downloads a total of 29.2 GB of FASTQ files. A summary of this can be seen in Table 3.8.

Table 3.8| FASTQ Generation for mtDNA Sequencing by QUB GCTU Input files BCL and .BCI files Output files FASTQ Time taken 43 minutes 8 seconds iCredits used 0 Software used FASTQ Generation | Version: 1.0.0 HiSeqFastQ Size on BaseSpace 31.70 GB Files exported from .fastq BaseSpace Size on disk 29.2GB (320 files) BCL, basecall files; BCI, Basecall index files

Following sequencing and alignment additional QC was performed by GCTU. These QC metrics were provided for all 40 samples sent to GCTU and a summary of averages and the range of results can be seen in Table 3.9 and the full set of results from this is available in Appendix 7.

Table 3.9| Post Alignment QC Summary for all 40 mtDNA Samples MitoSeq Total Mapped Paired Mean Median % % GC % M Summary reads reads reads coverage coverage Dups Failed Seqs Mean 7444821 6395467 6279692 54268.4 56721.3 47.80 44.15 20.83 3.71 Median 7346550 6221198 6109124 53018.86 55408 47.01 44.00 25.00 3.65 Min 5362122 4212760 4133695 35569.37 37619 36.04 44.00 12.50 2.70 Max 12385699 11111903 10931935 92092.04 96289 61.73 45.00 25.00 6.20 Range 7023577 6899143 6798240 56523 58670 25.69 1.00 12.50 3.50 % Dups, Percentage of duplicate reads; % GC, per sequence GC content; M seqs, millions of sequences

One DNA sample which was found to have a low GQN however, GCTU considered this to be within an acceptable range for sequencing and from the post-alignment QC of this sample no anomalous results were observed and therefore these results are included in the summary in Table 3.9.

198 3.3.2.2 RNA Sequencing Sequencing of RNA using the Illumina NextSeq 500 took almost 20 hours and generated 32 GB of data in BCL and BCI format. Files were exported from BaseSpace from “My Data > Runs > PN0046B [RNA Sequencing Run] > Summary > File > Download > Run”. This did not include BCL or BCI files and therefore only contained 113 MB of data including details on run parameters and run information as well as files which can be used with the SAV to view metrics generated by real-time analysis software on Illumina sequencing systems. Details of data generated during this sequencing run are summarised in Table 3.10.

Table 3.10| RNA Sequencing Performed by QUB GCTU Input files Library produced from samples sent to GCTU Output files BCL and .BCI files Time taken 19 hours 48 minutes 16 seconds iCredits used 0 Sequencing Instrument NextSeq 500 Size on BaseSpace 32 GB Files exported from Sequencing info, metrics, parameters and Sequence Analysis BaseSpace Viewer data (.SAV files) Size on disk 113 MB (15 files) BCL, basecall files; BCI, Basecall index files

Following the sequencing several metrics (summarised in Tables 3.11, 3.12 and Appendix 8) were available on BaseSpace and these were accessed through “My Data > Runs > PN0046B [RNA Sequencing Run] > Metrics”. Summary metrics of the RNA sequencing run are presented in Table 3.11 and additional metrics were available for each read under the “Per Read Metrics” heading and these are presented in Table 3.12. These metrics include number of cycles; yield; projected total yield; percentage aligned; percentage error; intensity statistic at cycle 1; and percentage of bases with a quality score ≥ 30. Metrics were also available for each lane under the “Per Lane Metrics” heading which included more detailed metrics split out by lane such as percentage of clusters which passed filtering; percentage of bases with a quality score ≥ 30; number of bases sequenced which passed filter; error rate; number of tiles per lane; cluster density; phasing and pre-phasing information; and intensity measures. Per lane metrics can be seen in Appendix 8.

Indexing QC was also available on BaseSpace and this can be accessed through “My Data > Runs > PN0046B [RNA Sequencing Run] > Indexing QC”. This provided information for indexes used in the run as stated in the sample sheet. This is only available for indexed runs and does not provide information regarding downstream analysis metrics. A summary of indexing performance for each lane is available as well as additional information regarding the

199 frequency of individual indexes. This allows samples to be pooled and sequenced at the same time after which the unique indexes can be used to identify each sample based on the information recorded in the Sample Sheet. During indexed sequencing the index is sequenced as a separate read and in the case of the current analysis dual-indexed sequencing was used and therefore two additional reads were included (read 2 and read 3).

Table 3.11| Overview of Run Metrics from RNA Sequencing Performed by GCTU. Run Metrics Value (Totals) Clusters PF 463,834,141 Clusters Raw 498,247,056 % PF 93.09% R1 Phasing 0.119 R1 Pre-phasing 0.120 R1 %Q30 95.46 R1 Error Rate 0.25 R2 Phasing 0.160 R2 Pre-phasing 0.149 R2 %Q30 92.82 R2 Error Rate 0.35 Average %30 94.28% Yield 76.04 Gbp Total clusters and number of clusters which passed filters along with Read 1 and Read 2 metrics.

Table 3.12| RNA Post-Sequencing QC for Each Read Including Index Reads Yield Projected Aligned Error Intensity %≥Q3 Cycles (Gbp) yield (Gbp) (%) rate (%) cycle 1 0 Read 1 76 34.79 34.79 1.06 0.25 6,198 95.46 Read 2 (I) 8 3.25 3.25 0 0 4,171 96.66 Read 3 (I) 8 3.25 3.25 0 0 4,039 94.96 Read 4 76 34.76 34.76 1.04 0.35 6,718 92.82 Non-Index 152 69.54 69.54 1.05 0.3 6,458 94.14 Reads Total Totals 168 76.04 76.04 1.05 0.3 5,281 94.28 Yield (Giga base pairs )and cycles for each read including indexing reads along with quality metrics. Aligned (%) refers to the percentage of reads aligned to the PhiX control genome which is used to estimate error rate and high % of clusters aligned to the PhiX genome may indicate under clustering or imprecise cluster density.

200 Descriptive charts were produced by BaseSpace to allow visualisation of the QC metrics and simplify assessment of any erroneous sequencing results. These included a flow cell chart; data charts by cycle and by lane; QScore distribution chart; and QScore heatmap. Multiple options can be selected for each of these charts and detailed information can be found on the help section of BaseSpace1059.

FASTQ Generation was also performed by GCTU using BCL and BCI files as inputs to generate FASTQ files. For RNA sequencing data, two sets of FASTQ files were generated which in order to improve accuracy in downstream analysis. In downstream analysis any FASTQ files with the same ID will be aggregated as described previously (Figure 3.4). Altogether, FASTQ generation for RNA sequencing data took approximately 1 hour 32 minutes and generated 83.0 GB of data on BaseSpace. This can be accessed on BaseSpace via “((Public Domain) My Data > Analyses > FASTQ Generation 2019-06-25 14:21:04Z” and “(Public Domain) My Data > Analyses > FASTQ Generation 2019-06-25 14:22:37Z”. Each set of FASTQ files was exported from the corresponding Analyses tab via “Files > Download > Analysis” which downloaded a total of 78.5 GB of FASTQ files. A summary of this can be seen in Table 3.13.

Table 3.13| FASTQ generation from RNA Seq performed by GCTU Input files BCL .and BCI files Output files FASTQ files Time taken FASTQ Generation 1: 46 minutes 7 seconds FASTQ Generation 2: 46 minutes 49 seconds iCredits used 0 Software used FASTQ Generation | Version: 1.0.0 Size on BaseSpace FASTQ Generation 1: 40.62 GB FASTQ Generation 2: 42.38 GB Files exported from .fastq BaseSpace Size on disk FASTQ Generation 1:38.4 GB (320 Files) FASTQ Generation 2:40.1 GB (320 Files) BCL, basecall files; BCI, Basecall index files

Following RNA sequencing, three samples were reported by GCTU to have high ribosomal contamination due to an error with the robot and therefore RNA from these samples was excluded from all further analysis. It was also reported by GCTU that the rest of the RNA samples were of poor quality which can be seen in Table 3.9 and possible reasons for this will be discussed in Section 3.4. However, as the current project is a pilot study these were used for downstream analysis in order to assess RNA Express workflow and determine resources required to carry out a similar project on a larger scale. Following RNA sequencing, FASTQ generation and alignment additional QC was performed by GCTU and these metrics are

201 summarised in Table 3.14 and presented in full in Appendix 9. Unlike the additional QC from the mtDNA sequencing, the RNA sequencing QC performed by GCTU did not contain any information regarding paired reads, mean coverage and median coverage and therefore this is not shown in Table 3.14.

Table 3.14| Post Alignment QC Summary for 37 RNA Samples RNASeq Total reads Mapped % % GC % M % % Summary reads Dups Failed Seqs rRNA mRNA Mean 63466610 41226366 45.42 49.76 24 21.9 3.8 43.60 Median 59393804 42130520 41.30 49.50 21 22.4 1.5 41.50 Min 28395348 21406350 31.35 46.00 17 11.2 0.1 29.60 Max 102189000 59742502 73.30 56.00 38 31.7 22.5 66.30 Range 73793652 38336152 41.95 10.00 21 20.5 22.4 36.70 (3 samples excluded due to robot errors) % Dups, Percentage of duplicate reads; % GC, per sequence GC content; M seqs, millions of sequences.

Table 3.14 included 10 RNA samples previously mentioned to have a low RQN score. The values for these were below the recommended value for use with Illumina library preparation protocols which may therefore impact on downstream analysis, however, from the post-alignment QC of these samples no anomalous results were observed (Appendix 9) and these samples were included in the RNA Express analysis.

3.3.2.3 Bisulfite-Treated DNA Sequencing Sequencing of BST-DNA using the Illumina NovaSeq6000 took almost 22 hours and generated 255 GB of data in concatenated basecall (CBCL) format. Files were exported from BaseSpace from “(Private Domain) My Data > Runs > GCTU_PN0047 [BST-DNA Sequencing Run] > Summary > File > Download > Run”. The exported folder contains 213 MB of data which included details on run parameters and run information as well as files which can be used with the SAV to view metrics generated by real-time analysis software on Illumina sequencing systems. Details of data generated during this sequencing run are summarised in Table 3.15.

202 Table 3.15| BST-DNA sequencing run performed by QUB GCTU Input files Library produced from biosamples sent to GCTU Output files CBCL files Time taken 21 hours 40 minutes 34 seconds iCredits used 0 carried out by GCTU Sequencing NovaSeq6000 Instrument Size on BaseSpace 255 GB Files exported from Sequencing info, metrics, parameters and Sequence Analysis BaseSpace Viewer data (.SAV files) Size on disk 213 MB (20 files) CBCL, concatenated basecall files produced by NovaSeq 6000

Following the sequencing several metrics (summarised in Tables 3.16, 3.17 and Appendix 10) were available on BaseSpace and were accessed through “(Private Domain) My Data > Runs > GCTU_PN0047 [BST-DNA Sequencing Run] > Metrics”. Summary metrics of the BST-DNA sequencing run are presented in Table 3.16 and additional metrics are available for each read under the “Per Read Metrics” heading and these are presented in Table 3.17. These metrics include number of cycles; yield; projected total yield; percentage aligned; percentage error; intensity statistic at cycle 1; and percentage of bases with a quality score ≥ 30. Metrics were also available for each lane under the “Per Lane Metrics” heading which included more detailed metrics split out by lane such as percentage of clusters which passed filtering; percentage of bases with a quality score ≥ 30; number of bases sequenced which passed filter; error rate; number of tiles per lane; cluster density; phasing and pre-phasing information; and intensity measures. Per lane metrics can be seen in Appendix 10.

Indexing QC is also available on BaseSpace and this was accessed through “(Private Domain) My Data > Runs > GCTU_PN0047 [BST-DNA Sequencing Run] > Indexing QC”. This provided information for indexes used in the run as stated in the sample sheet. This is only available for indexed runs and does not provide information regarding downstream analysis metrics. A summary of indexing performance for each lane was available as well as additional information regarding the frequency of individual indexes. This allowed samples to be pooled and sequenced at the same time after which the unique indexes can be used to identify each sample based on the information recorded in the Sample Sheet. During indexed sequencing the index is sequenced as a separate read and in the case of the current analysis single- indexed sequencing was used and therefore only one additional read is included (Read 2).

203 Table 3.16| Overview of Run Metrics from BST-DNA Sequencing Performed by GCTU Metric Value Clusters PF 3,968,662,631 Clusters Raw 5,761,400,832 % PF 68.88% R1 Phasing 0.089 R1 Pre-phasing 0.072 R1 %Q30 89.60 R1 Error Rate 0.23 R2 Phasing 0.039 R2 Pre-phasing 0.029 R2 %Q30 81.71 R2 Error Rate 0.97 Average %30 85.64 Yield 813.86 Gbp Total clusters and number of clusters which passed filters along with Read 1 and Read 2 metrics.

Table 3.17| BST-DNA Post-Sequencing QC for Each Read Including Index Reads Cycle Yield Projected Aligned Error Intensity %≥ (Gbp) yield (Gbp) (%) rate (%) cycle 1 Q30 Read 1 101 397.01 397.01 7.56 0.23 1,215 89.60 Read 2 (I) 6 19.84 19.84 0.00 0.00 1,681 85.11 Read 3 101 397.01 397.01 6.27 0.97 1,312 81.71 Non-Index 202 794.02 794.02 6.92 0.60 1,263 85.65 Reads Total Totals 208 813.86 813.86 6.92 0.60 1,403 85.64 Yield (giga base pairs) and cycles for each read including indexing reads along with quality metrics. Aligned (%) refers to the percentage of reads aligned to the PhiX control genome which is used to estimate error rate and high % of clusters aligned to the PhiX genome may indicate under clustering or imprecise cluster density.

Descriptive charts were produced by BaseSpace which allowed visualisation of the QC metrics and simplified assessment of any erroneous sequencing results. These included a flow cell chart; data charts by cycle and by lane; QScore distribution chart; and QScore heatmap. Multiple options were available for each of these charts and detailed information can be found on the help section of BaseSpace1059.

FASTQ Generation was also performed by GCTU using CBCL files as inputs to generate FASTQ files. For BST-DNA sequencing data, two FASTQ generation runs were performed however, the first FASTQ generation failed as the run was re-queued and therefore a partial set of FASTQ files was also found on BaseSpace, however only the completed FASTQ generation marked as “QC Passed” was included in downstream analysis and exported from BaseSpace.

204 FASTQ generation for BST-DNA sequencing data took approximately 2 hours 10 minutes and generated 403 GB of data on BaseSpace. This was accessed on BaseSpace via “(Private Domain) My Data > Analyses > FASTQ Generation 2019-05-03 12:15:12Z”. FASTQ files were exported from the Analyses tab via “(Private Domain) My Data > Analyses > FASTQ Generation 2019-05-03 12:15:12Z > Files > Download > Analysis” which downloaded a total of 331 GB of FASTQ files. A summary of this can be seen in Table 3.18.

Table 3.18| FASTQ Generation from BST-DNA Sequencing Performed by GCTU Input files CBCL Files Output files FASTQ files Time taken 2 hours 10 minutes 48 seconds iCredits used 0 Software used FASTQ Generation | Version: 1.0.0 Size on BaseSpace 403.73 GB Files exported from BaseSpace .fastq Size on disk 331 GB (40 Files) CBCL, concatenated basecall files produced by NovaSeq 6000

Unlike in mtDNA and RNA sequencing, alignment was not performed by GCTU for BST-DNA sequencing data and therefore the results of this are not presented in this section (See Section 3.3.3.3). A summary of the QC metrics produced using FastQC can be seen in Table 3.19 and these are presented in full in Appendix 11. In contrast to mtDNA sequencing and RNA sequencing alignment was not performed by GCTU for BST-DNA and therefore only limited metrics are presented in Table 3.19 (FastQC output only).

Table 3.19| BST-DNA FastQC summary for all 40 samples

% Dups % GC Length % Failed M Seqs Mean 36.12 29.10 99 bp 15.08 76.5 Median 31.70 29.00 99 bp 18.00 75.6 Min 24.10 28.00 99 bp 9.00 59.0 Max 73.30 32.00 99 bp 27.00 103.8 Range 49.20 4.00 99 bp 18.00 44.8 % Dups, Percentage of duplicate reads; % GC, per sequence GC content; M seqs, millions of sequences.

3.3.3 Data Analysis with BaseSpace

3.3.3.1 Mitochondrial DNA Sequencing Analysis The FASTQ files produced from mtDNA sequencing were used as input for the mtDNA Variant processor application and this analysis took approximately 1 hour 7 minutes to produce 22.63 GB of data that contains binary alignment map files (BAM); binary indexed bam files

205 (.bam.bai); VCF files; and meta-data in .json format. This data was accessed via “(Public Domain) My Data > Analyses > mtDNA Variant Processor 07/04/2019 11:01:36”. Details on the mtDNA Variant Processor run can be seen in Table 3.20.

Table 3.20| Summary of mtDNA Variant Processor Output from BaseSpace Input files FASTQ Output files .bam; .bam.bai; .vcf Time taken 1 hour 7 minutes 17 seconds iCredits used 0 credits Software used mtDNA Variant Processor | Version: 1.0.0 Size on BaseSpace 22.63 GB Files exported from .bam; .bam.bai; .vcf BaseSpace Size on disk 22.6 GB (200 files) .bam, binary sequence alignment file; .bam.bai, index file for corresponding .bam file; .vcf, variant call format file

Following this the VCF and BAM output files from the mtDNA Variant Processor were used as input for mtDNA Variant Analyser and this can be visualised on BaseSpace as seen in Section 3.2.4.1. Direct comparison of case-control data from mtDNA is not possible using the mtDNA Variant Analyser and this data was exported in Excel format which can be used for downstream analysis. A summary of the Excel outputs can be seen in Table 3.21 and further information is not presented here in order to preserve sample anonymity. There were eight SNPs which appeared in all samples in this project and these were 263G, 315.1C, 750G, 1438G, 3107d, 4769G, 8860G and 15326G. In the samples taken from those with ESRD without Diabetes from the BRTx collection one SNP was found to be unique to these samples and this was 16304C which was found in two samples in this group. In the healthy control group obtained from the NICOLA study samples, two SNPs were found to be unique to this study group and these were 513.1C and 513.2a 3 which were both found in the same three samples in this group. In the samples taken from individuals with DKD with no evidence of ESRD, two SNPs were found to be unique to these samples and these were 513A and 4820A both of which appeared in two DNA samples, one of which contained both these SNPs. One SNP was found to be unique to individuals with DKD regardless of ESRD status and this was 150T. Due to small sample size robust statistical testing was not possible from these results.

206 Table 3.21| Summary of Results Exported from mtDNA Variant Analyser

DKD with ESRD DNA DKD_ESRD1 DKD_ESRD2 DKD_ESRD3 DKD_ESRD4 DKD_ESRD5 DKD_ESRD6 DKD_ESRD7 DKD_ESRD8 DKD_ESRD9 DKD_ESRD10 Single Nucleotide Variants 31 12 42 13 16 16 16 33 15 12 Insertions 1 1 7 1 2 2 1 2 2 1 Deletions 3 1 1 1 1 3 1 1 1 1 ESRD without Diabetes DNA ESRDonly1 ESRDonly2 ESRDonly3 ESRDonly4 ESRDonly5 ESRDonly6 ESRDonly7 ESRDonly8 ESRDonly9 ESRDonly10 Single Nucleotide Variants 31 13 10 14 32 11 10 16 37 14 Insertions 4 1 3 3 2 1 2 2 1 1 Deletions 1 1 1 2 1 1 1 1 1 3 “Healthy” Controls DNA Healthy1 Healthy2 Healthy3 Healthy4 Healthy5 Healthy6 Healthy7 Healthy8 Healthy9 Healthy10 Single Nucleotide Variants 34 38 13 15 35 15 29 11 11 10 Insertions 2 3 4 2 1 2 3 1 1 1 Deletions 1 1 1 1 1 3 1 1 1 1 DKD without Transplant DNA DKDonly1 DKDonly2 DKDonly3 DKDonly4 DKDonly5 DKDonly6 DKDonly7 DKDonly8 DKDonly9 DKDonly10 Single Nucleotide Variants 15 13 42 30 18 29 14 13 19 32 Insertions 2 1 3 2 2 2 2 1 2 1 Deletions 1 1 1 1 1 1 1 2 1 1

207 3.3.3.2 RNA Sequencing The FASTQ files produced from RNA sequencing were used as input for the RNA Express application. General information regarding the use of RNA express can be seen in Table 3.22. A total of 5 case-control comparisons were carried out and altogether this took a total of 58 hours and 9 minutes and 176 iCredits to produce 230.20 GB of data from a total of 121 samples. Details of the number of samples in each comparison run and the time taken for each of these is presented in Table 3.23. On average the time taken to analyse one FASTQ file from one RNA sample was just over 30 minutes and generated approximately 2.07 GB of data per RNA sample. For each comparison group the output contained BAM files, .bam.bai files, gene counts and DESeq2 results and this data was accessed via “(Public Domain) My Data > Analyses > [User defined RNA Express output name]”.

Table 3.22| Summary of RNA Express Output from BaseSpace General Info General info Input files FASTQ files Output files .bam; .bam.bai; .VCF files; Gene counts; DESeq2 results Software used RNA Express | Version: 1.1.0 which include STAR aligner and DE-Seq Files exported from .bam; .bam.bai; .VCF files; and FASTQ files if required BaseSpace .bam, binary sequence alignment file; .bam.bai, index file for corresponding .bam file; .vcf, variant call format file

Table 3.23| Summary of RNA Express Output from BaseSpace Run Specific Info Run 1 Run 2 Run 3 Run 4 Run 5 Time taken 8 hours 32 9 hours 29 9 hours 10 13 hours 36 17 hours 22 minutes 18 minutes 20 minutes 21 minutes 38 minutes 9 seconds seconds seconds seconds seconds iCredits 26.00 29.00 28.00 41.00 52.00 used Size on 50.58 GB 54.21 GB 51.41 GB 78.10 GB 99.76 GB BaseSpace Size on disk 50.5 GB 54.2 GB 51.4 GB (324 78.1 GB(494 99.7GB (647 (340 Files) (341 Files) Files) Files) Files) Number of Cases: 10 Cases: 10 Cases: 9 Cases: 19 Cases: 19 Samples Controls: 9 Controls: 9 Controls: 9 Controls: 9 Controls: 18 Run 1, DKD with ESRD vs ESRD without Diabetes; Run 2, DKD with ESRD vs “Healthy” Controls; Run 3, ESRD without Diabetes vs “Healthy” Controls; Run 4, All ESRD vs “Healthy” Controls; Run 5, All Diabetes vs No Diabetes

RNA Express also provided primary analysis information, alignment information, read counts, counts of the number of significantly differentially expressed genes between cases and controls, a sample correlation matrix heatmap and control vs. comparison plots and tables

208 for all genes that could be filtered by status, significance level or fold change. BAM files and DeSeq Counts and results were downloaded from “(Public Domain) My Data > Analyses > [User defined RNA Express run name] > Reports”. A summary of gene counts and the number of differentially expressed genes for all comparisons can be seen in Table 3.24. Significantly differentially expressed genes can be seen in the table at the bottom of the RNA express report on BaseSpace and this was saved as an Excel file.

Table 3.24| Summary of Differential Expression Counts from RNA Express Output Comparison group Annotation Gene Assessed Gene Differentially Expressed Gene Count Count count DKD with ESRD vs 23,710 12,006 0 ESRD without diabetes DKD with ESRD vs 23,710 12,252 2,907 “Healthy” Controls ESRD without 23,710 12,232 3,938 Diabetes vs “Healthy” Controls ALL ESRD vs 23,710 12,177 4,162 “Healthy” Controls ALL DIABETES vs 23,710 12,364 2 NO DIABETES DKD, diabetic kidney disease: ESRD, end-stage renal disease

3.3.3.3 Bisulfite-Treated DNA Sequencing Unlike in the mtDNA and RNA sequencing projects, for BST-DNA alignment is not performed by GCTU. Alignment can be performed with MethylSeq or DRAGEN Methylation Pipeline and both were carried out. Output files from both apps were then be analysed with MethylKit. The input for MethylSeq was the FASTQ files provided by GCTU and the alignment took approximately 23 hours and 14 minutes to generate 593.67 GB at a cost of 1,716.00 iCredits. Data generated was in BAM and .bam.bai format on BaseSpace and these were found at “(Private Domain) My Data > Analyses > MethylSeq 07/01/2019 9:46:02”. These outputs were used with MethylKit for further analysis. Downloading the MethylSeq output also exports alignment reports and BED files which are tab-delimited files with one line for each genomic region. A summary of the MethylSeq alignment can be seen in Table 3.25.

209 Table 3.25| BST-DNA FASTQ Alignment performed with MethylSeq App in BaseSpace Input files Fastq files Output files .bam; .bam.bai Time taken 23 hours 14 minutes iCredits used 1,716 iCredits Software used MethylSeq | Version: 2.0.0 Size on BaseSpace 593.67 GB Files exported from .bam ; .bai; .bed; and reports BaseSpace Size on disk 593 GB (1077 Files) .bam, binary sequence alignment file; .bam.bai, index file for corresponding .bam file; .bed, tab delimited file with genome annotation data

The input for DRAGEN Methylation Pipeline was also the FASTQ files provided by GCTU and the alignment took approximately 1 hour and 10 minutes to generate 884.63 GB of data at a cost of 187 iCredits. Data generated was in BAM, .bam.bai and Cytosine Reports in TXT.gz format on BaseSpace and these were found at “(Private Domain) My Data > Analyses > DRAGEN Methylation Pipeline 09/03/2019 12:48:35”. These outputs can be used with MethylKit for further analysis. Downloading the DRAGEN Methylation Pipeline output exports 884 GB of data including alignment reports and BED files which are tab-delimited files with one line for each genomic region. A summary of the MethylSeq alignment can be seen in Table 3.26.

Table 3.26| BST-DNA FASTQ Alignment Performed with DRAGEN Methylation Pipeline in BaseSpace Input files Fastq files Output files .bam; .bai; and Cytosine Report TXT.gz Time taken 1 hour 10 minutes 34 seconds iCredits used 187.00 iCredits Software used DRAGEN Methylation Pipeline | Version: 3.3.7 Size on BaseSpace 884.63 GB Files exported from .bam; .bai; .bed; and reports BaseSpace Size on disk 884 GB .bam, binary sequence alignment file; .bam.bai, index file for corresponding .bam file

3.3.3.3.1 Analysis of MethylSeq Output The BAM and .bam.bai files produced by MethylSeq during BST-DNA FASTQ alignment were used as input for the MethylKit application. General information regarding these analyses with MethylKit can be seen in Table 3.27. A total of 5 case-control comparisons were carried out and altogether this took approximately 96 hours and 287 iCredits to produce 87.20 MB of data from a total of 130 samples. Details of the number of samples in each comparison

210 run and the time taken for each of these is presented in Table 3.28. On average the time taken to carry out methylation analysis on one BST-DNA sample was around 45 minutes and generated approximately 0.67 MB of data per sample. For each comparison group the output contained BAM files, .bam.bai files, BED files and reports. These data were accessed via (Public Domain) My Data > Analyses > [User defined MethylKit output name].

Table 3.27| General Info from Methyl Kit Analyses with MethylSeq Output General info Input files MethylSeq Output for each sample Output files Differential Methylation in .csv and .bigwig format Software used MethylKit | Version: 2.0.1 Files exported from BaseSpace PDF Reports

Table 3.28| Summary of Methyl Kit Analyses using MethylSeq Output Run 1 Run 2 Run 3 Run 4 Run 5 Time taken 13 hours 54 14 hours 2 16 hours 38 21 hours 37 1 day 5 minutes 50 minutes 38 minutes 22 minutes 59 hours 28 seconds seconds seconds seconds minutes 37 seconds iCredits used 42.00 42.00 50.00 65.00 88.00 iCredits iCredits iCredits iCredits iCredits Size on 20.80 MB 20.97 MB 14.31 MB 21.59 MB 8.91 MB BaseSpace Size on disk 21.00 MB 21.2 MB 14.5 MB 21.9 MB 9.39 MB (125 Files) (125 Files) (126 Files) (175 Files) (225 Files) Number of Cases: 10 Cases: 10 Cases: 10 Cases: 20 Cases: 20 Samples Controls: 10 Controls: 10 Controls: 10 Controls: 10 Controls: 20 Run 1, DKD with ESRD vs ESRD without Diabetes; Run 2, DKD with ESRD vs “Healthy” Controls; Run 3, ESRD without Diabetes vs “Healthy” Controls; Run 4, All ESRD vs “Healthy” Controls; Run 5, All Diabetes vs No Diabetes

MethylKit provided a summary report which was accessed via “(Private Domain) My Data > Analyses > [MethylKit Run Name] > Report” and this gave a summary of the samples used as cases and controls along with details of the number of CpGs used for the analysis per sample. This also contained the MethylKit App Summary which included number of CpGs compared, total differentially methylated bases (DMB), hyper DMB, hypo DMB, total differentially methylated regions (DMR), hyper DMR and hypo DMR. Detailed information was accessed by clicking on “Report” and this included Coverage Stats Plot for each sample; Methylation Stats Plot for each sample; Methylation Correlation Plot; Differential Methylation Summary Table (Per Chromosome); Differential Methylation Regions (in csv file and bigwig file); Methylation Stats Summary; Methylation Stats Percentile Information; and Pdf Reports. A

211 summary of the differential methylation results from these analyses can be seen in Table 3.29.

Table 3.29| Differential Methylation Summary from MethylKit Runs using MethylSeq Output Comparison # of CpG Total Hyper Hypo Total Hyper Hypo Compared DMB DMB DMB DMR DMR DMR DKD with ESRD 1154647 0 0 0 0 0 0 vs ESRD without diabetes DKD with ESRD 1172398 0 0 0 0 0 0 vs “Healthy” Controls ESRD without 2682361 3 0 3 0 0 0 Diabetes vs “Healthy” Controls ALL ESRD vs 1152309 0 0 0 0 0 0 “Healthy” Controls ALL DIABETES 1142613 0 0 0 0 0 0 vs NO DIABETES DKD, diabetic kidney disease: ESRD, end-stage renal disease

3.3.3.3.2 Analysis of DRAGEN Methylation Pipeline Output The Cytosine Report files produced by DRAGEN Methylation Pipeline during BST-DNA FASTQ alignment were used as input for the MethylKit application. General information regarding the current MethylKit analyses can be seen in Table 3.30. A total of 5 case-control comparisons were carried out and altogether this took approximately 119 hours and 362 iCredits to produce 23.88 MB of data from a total of 130 samples. Details of the number of samples in each comparison run and the time taken for each of these is presented in Table 3.31. On average the time taken to carry out methylation analysis on one BST-DNA was around 55 minutes and generated approximately 0.2 MB of data per sample. For each comparison group the output contained BAM files, .bam.bai files, BED files and reports. These data can be accessed via (Public Domain) My Data > Analyses > [User defined MethylKit output name]. Summary reports produced are the same as those described in Section 3.3.3.3.1 and a summary of the differential methylation results from these analyses can be seen in Table 3.32.

212 Table 3.30| General Info from Methyl Kit Analyses with DRAGEN Methylation Pipeline Output General info Input files Cytosine Reports from DRAGEN Output files Differential Methylation in .csv and .bigwig format Software used MethylKit | Version: 2.0.1 Files exported from BaseSpace PDF Reports

Table 3.31| Summary of Methyl Kit Analyses using DRAGEN Methylation Pipeline Output Run 1 Run 2 Run 3 Run 4 Run 5 Time taken 17 h 46 min 16 h 44 min 18 h 49 min 27 h 11 min 38 h 18 min 14 s 27 s 24 s 8 s iCredits used 54.00 54.00 57.00 82.00 115 iCredits iCredits iCredits iCredits iCredits Size on 1.86 MB 1.86 MB 6.26 MB 10.15 MB 3.75 MB BaseSpace Size on disk 2.14 MB 2.15 MB 6.53 MB 10.50 MB 4.25 MB Number of Cases: 10 Cases: 10 Cases: 10 Cases: 20 Cases: 20 Samples Controls: 10 Controls: 10 Controls: 10 Controls: 10 Controls: 20 Run 1, DKD with ESRD vs ESRD without Diabetes; Run 2, DKD with ESRD vs “Healthy” Controls; Run 3, ESRD without Diabetes vs “Healthy” Controls; Run 4, All ESRD vs “Healthy” Controls; Run 5, All Diabetes vs No Diabetes

Table 3.32| Summary of Differential Methylation from the MethylKit Analyses Using the DRAGEN Methylation Pipeline Output Comparison # of CpG Total Hyper Hypo Total Hyper Hypo Compared DMB DMB DMB DMR DMR DMR DKD with ESRD vs ESRD 2911436 136 134 2 5 5 0 without diabetes DKD with ESRD vs “Healthy” 2916591 41 25 16 0 0 0 Controls ESRD without Diabetes vs 3395955 14 1 13 0 0 0 “Healthy” Controls ALL ESRD vs “Healthy” 2884748 2 0 2 0 0 0 Controls ALL DIABETES 0 0 0 0 0 0 0 vs NO DIABETES

213 3.3.4 Workflow Overview A summary of the total time, total storage required and total cost in iCredits for each of the successful sequencing workflows (mtDNA, RNA and BST-DNA) are presented in Table 3.33. In Figure 3.16 the workflow followed for the analyses for which the sequencing kits where optimised are shown. Each workflow is presented along with time, required storage and cost in iCredits. Details for targeted mitochondrial RNA and BST-DNA analysis are not presented as library preparation and sequencing was not optimised for this approach. The full workflow for mtDNA took approximately 29 hours for library preparation, sequencing and data analysis (computation time only) to generate 81.33 GB of data on BaseSpace and 52.52 GB of data was downloaded. This did not require any iCredits and additional Excel files were produced which can be used for comparison. The full workflow for RNA took approximately 78 hours 45 minutes for library preparation, sequencing and data analysis (computation time only) to generate 407.56 GB of data on BaseSpace and 373.31 GB of data was downloaded. This had a total cost of 176.00 for all five analyses with RNA Express. The full workflow for BST-DNA using the MethylSeq app for FASTQ alignment took approximately 142 hours 47 minutes for library preparation, sequencing and data analysis (computation time only) to generate 1252.46 GB of data on BaseSpace and 924.30 GB of data was downloaded. This had a total cost of 2003 iCredits for all five analyses with MethylKit. The full workflow for BST-DNA using the DRAGEN Methylation Pipeline app for FASTQ alignment took approximately 143 hours 51 minutes for library preparation, sequencing and data analysis (computation time only) to generate 1543.35 GB of data on BaseSpace and 1215.24 GB of data was downloaded. This had a total cost of 549 iCredits for all 5 analyses with MethylKit. Altogether all sequencing, alignment and data analysis for mtDNA, RNA and BST-DNA took approximately 250 hours 29 minutes when the MethylSeq app was used for BST-DNA alignment. This cost a total of 2,179 iCredits, required 1741.35 GB of storage space on BaseSpace and to download this data 1350.13 GB of disk space was required. With the DRAGEN Methylation Pipeline app used for FASTQ alignment the BST-DNA workflow took approximately 251 hours 32 minutes. This cost 725 iCredits and required 2032.24 GB of storage space on BaseSpace and to download this 1641.07 GB of disk space was required. Multiple data analyses can be performed in parallel for RNA and BST-DNA. For example, all five analyses with MethylKit using the output from DRAGEN Methylation Pipeline were performed simultaneously which greatly reduced computation time during data analysis. For RNA sequencing this equated to a reduction in computing time of almost 41 hours allowing the RNA sequencing workflow to be performed in around 38 hours. For BST-DNA with FASTQ alignment performed with MethylSeq this

214 equated to a reduction in computing time of almost 66 hours to allow the full BST workflow to be performed in 76 hours 33 minutes. For BST-DNA with FASTQ alignment performed with DRAGEN Methylation Pipeline this equated to a reduction in computing time of 80 hours 31 minutes to allow the full BST workflow to be performed in 63 hours 20 minutes. For sequencing, FASTQ generation and alignment the values given in Figure 3.10 are for 40 samples for each of mtDNA, RNA and BST-DNA. For data analysis values shown are for 40 samples for mtDNA, 121 samples for RNA and 130 samples and averages for one sample were calculated based on this. For BST-DNA. For RNA sequencing, FASTQ generation totals in Table 3.34 are the average of two runs as typically only one set of FASTQ files would be generated. In Figure 3.10 the totals for RNA FASTQ generation can be seen with the averages in brackets. During FASTQ generation for BST-DNA sequencing a second run was initiated but this was restarted, and this is not included in the summary statistics in Table 3.34 or Figure 3.16.

Table 3.33| Summary of Totals for Full Workflow Including Time, Data Generated and Cost Size of Files Size of Downloaded Time Taken online (GB) Files (GB) iCredits mtDNA 28 h 56 min 05 s 81.33 52.52 0.00 RNA 78 h 45 min 30 s 407.56 373.31 176.00 BST-DNA 142 h 47 min 48 MethylSeq s 1252.46 924.30 2003.00 143 h 51 min 09 BST-DNA DRAGEN s 1543.35 1215.24 549.00 Total 250 h 29 min 23 w/MethylSeq s 1741.35 1350.13 2179.00 251 h 32 min 44 Total w/DRAGEN s 2032.24 1641.07 725.00 Targeted mitochondrial RNA and BST-DNA sequencing procedures were suboptimal and therefore details for these are not presented

215 Figure 3.16| Full workflow for sequencing and analysis of mtDNA, RNA and BST-DNA Metrics presented were generated using the analyses performed with default settings as targeted mitochondrial analyses of RNA and BST-DNA was not possible due to unforeseen circumstances.

216 3.4 Discussion

The workflows established in this chapter highlight the steps involved in mtDNA, RNA and BST-DNA sequencing using Illumina NGS technology in order to investigate mitochondrial features, gene expression and methylation patterns which may be associated with DKD. The methods described used a total of 40 samples. Due to small sample size and quality issues with genetic material it was not suitable to perform additional statistical analysis on the results, however, the aim of this work was to design and evaluate a workflow to be used to investigate mitochondrial genetic and epigenetic features associated with DKD. It was hoped that the NGS facilities provided by GCTU could be used to investigate the mitochondrial DNA for DNA variations, differential gene expression and DNA methylation which might be involved with DKD development. After sequencing by QUB GCTU it was found that the library preparation methods used for RNA sequencing (KAPA RNA HyperPrep Kit with RiboErase)1060 and BST-DNA sequencing (Roche SeqCap Epi CPGiant capture)1055 were not optimised to target mitochondrial RNA and mitochondrial BST-DNA, respectively and therefore downstream analysis targeted towards mitochondrial RNA and BST-DNA was unsuccessful. Additional bioinformatic tools outside BaseSpace could potentially be used to extract only mitochondrial DNA reads from RNA and BST-DNA from FASTQ files generated by GCTU which may allow more in-depth investigation of mitochondrial methylation and gene expression. However, the development of a targeted sequencing approach used with a specialised bioinformatics pipeline for analysis of gene expression and methylation in mitochondrial DNA specifically may allow greater depth of coverage of the mitochondrial genome only and provide more robust results1000.

Single nucleotide polymorphisms (SNPs) and Indels in mtDNA were investigated using mtDNA Variant Processor and mtDNA Variant Analyser. From the results of mtDNA sequencing, one SNP was unique to samples from those with ESRD without diabetes and this was T16304C found in the hypervariable region 1 of mtDNA and has not previously been associated with ESRD. This is a defining SNP for the H5 mitochondrial haplogroup and it is likely that these two individuals belong to this haplogroup1061. The H5 haplogroup is found throughout western Eurasia and North Africa but is more abundant is central Europeans and the highest frequencies are observed in Wales (8.5%)1062. This haplogroup has previously been proposed as a risk factor for late onset Alzheimer’s disease1063. In the healthy control samples from the NICOLA collection a C and an A were found to be inserted at mt:513 in three samples in this group (513.1c and 513.2a). This site is found in the D-loop region and has not been reported as a transcription binding siteDNase1-protected site, or splice cut site334,1064. This is a known

217 site of heteroplasmy which may explain the presence of multiple Insertions at this site in the same samples. In the samples taken from individuals with DKD with no evidence of ESRD two SNPs were found to be unique to these samples and these were G513A and G4820A both of which appeared in two samples, one of which contained both these SNPs. The presence of both of these SNPs are characteristic of haplogroup A2i and A2k. Haplogroup A2 and its subclades are found almost exclusively in the Americas and the presence of this Haplogroup in individuals from Northern Ireland would not be expected however, the G513A SNP is commonly found in several subclades of haplogroup M which is thought to have originated after the out-of-Africa migration event and is found in individuals in all continents. One SNP was found to be unique to individuals with DKD regardless of ESRD status. This was C150T which was found in 7 out of 20 samples with DKD. This SNP is found in many haplogroups but has a significantly higher frequency in centenarians and has been tentatively associated with stress resistance, ROS production and longevity1065–1067. This SNP is also used to define Haplogroup U3 which is found most frequently in the eastern Mediterranean and the Caucasus. The U3a1c is a subclade of this haplogroup found in Ireland and these samples may represent individuals from this haplogroup rather than true association between C150T and DKD. These SNPs may represent genetic differences in population structure and ancestral origin, however, due to small sample sizes robust statistical testing was not possible based on these results.

The number of genes found to be differentially expressed and differentially methylated in more than 1 comparison group can be seen in Table 3.34 and Table 3.35, respectively. As these sequencing workflows were not optimised to specifically target mitochondrial DNA, this was not available in the results and therefore any differential gene expression or differential methylation in NEMGS have been mentioned.

Table 3.34| Number of Genes Differentially Expressed in More than One Comparison All genes (NEMGS) Run 5 Run 4 Run 3 Run 2 Run 2 2(1) 2715(416) 2225(333) 2907(439) Run 3 0(0) 3277(469) 3938(546) Run 4 2(1) 4162(620) Run 5 2(1) (no genes found to be differentially expressed in DKD with ESRD vs ESRD without Diabetes) Run 2, DKD with ESRD vs “Healthy” Controls; Run 3, ESRD without Diabetes vs “Healthy” Controls; Run 4, All ESRD vs “Healthy” Controls; Run 5, All Diabetes vs No Diabetes

218 Table 3.35| Number of Differentially Methylated Bases in More than One Comparison DMB Run4 Run3 Run2 Run1 Run1 0 1 16 136 Run2 2 1 41 Run3 1 14 Run4 2 (No genes found to be differentially methylated in comparison 5)

The analyses comparing samples from individuals with DKD and ESRD compared to those with ESRD without diabetes found no differential gene expression using RNA express. From the MethylKit analysis there were 134 hypermethylated bases and two hypomethylated bases throughout the genome. A total of five differentially methylated regions were identified in this analysis, three in chromosome 16 and two in chromosome 7. In chromosome 16, differentially methylated sites where found across the whole range of the chromosome, however, only four bases were found to be significantly differentially methylated. One of the differentially methylated regions in chromosome 7 was found between Chr7: 158188408 and 158188451 within the gene PTPRN2 which encodes a major islet autoantigen for T1DM1068. Out of the 136 differentially methylated bases, 11 of these were found in, or within 5 kb of, genes with a possible role in mitochondrial function and previously included in the mtGWAS in Chapter 2 and these can be seen in Table 3.36. Six of these were found to be significantly differentially methylated after logistic regression.

Table 3.36| Differentially Methylated Bases from Individuals with DKD and ESRD Compared to those with ESRD without Diabetes Chromosome DMB position strand pvalue qvalue meth.diff Gene chr11 45923465 + 8.49E-14 6.74E-12 25.4028 MAPK8IP1 chr11 112096708 + 0 0 27.05543 PTS chr14 103243018 + 0 0 25.9741 TRAF3 chr16 457900 + 0 0 33.36039 NME4 chr16 457914 + 1.52E-11 7.42E-10 26.40161 NME4 chr16 81132730 + 2.22E-16 2.83E-14 25.31791 GCSH chr17 17409174 + 0 0 34.53202 PEMT chr19 19625627 + 0 0 35.68995 NDUFA13 chr5 1085197 + 1.92E-09 5.66E-08 25.23638 SLC12A7 chr6 73951870 + 1.11E-15 1.26E-13 26.22002 KHDC1 chr7 963736 + 1.39E-09 4.26E-08 25.31126 COX19

Most notable of these are COX19, MAPK8IP1, NME4, GCSH, PEMT and NDUFA13. COX19 encodes an assembly protein involved with Complex IV assembly1069. MAPK8IP1 codes for a protein involved with regulation of apoptosis in pancreatic beta-cells and is reportedly

219 silenced in insulin secreting beta cells1070. This gene is thought to be a susceptibility gene for type 2 diabetes. NME4 encodes one of a group of nucleoside diphosphate kinases which localise to the mitochondria and have a major role in synthesising nucleoside triphosphates1071,1072. GCSH localises to the mitochondria and encodes the H protein of the glycine cleavage system and transfers the methylamine group of glycine from the P protein to the T protein during glycine degradation1071,1072. Protein isoforms encoded by PEMT localise to the endoplasmic reticulum and mitochondrial-associated membranes. One of the products of this gene is an enzyme which converts phosphatidylethanolamine to phosphatidylcholine by sequential methylation in the liver1071,1072. NDUFA13 encodes a subunit of Complex I of NADH dehydrogenase (respiratory chain Complex I) and is required for complex I assembly and electron transfer. Single nucleotide polymorphisms in complex I genes were previously identified in Chapter 2.

The analyses of samples from individuals with DKD and ESRD compared to the Healthy” Controls from NICOLA found 2,907 differentially expressed genes between cases and controls. Out of these genes 187 were not found in the results of any other comparison. Differential expression was observed in 439 genes involved with mitochondrial function, 23 of which were not found in any other comparison. The analysis with MethylKit identified 25 hypermethylated bases and 16 hypomethylated bases. Out of these there were 11 significantly hypermethylated bases across six chromosomes. In chromosome 20 there were five significantly hypermethylated bases located between Chr20:60553691-60553730 within the gene MIR646HG which is an RNA coding gene and is associated with the miRNA. There were also 11 significantly hypomethylated bases across nine chromosomes. Within chromosome 7 there were three significantly hypomethylated base and one significantly hypermethylated base. Out of the hypermethylated bases two were found in DNAJB6 which encodes a member of the DNAJ protein family which are molecular chaperones involved in various cellular events, including protein folding and oligomeric protein complex assembly. One of the hypomethylated bases, chr2:11737065 (p = 1.01E-14, q = 4.17E-13, meth.diff = - 27.6376) was found in GREB1 which may have some localisation to mitochondria and was previously included in the targeted mitochondrial GWAS in Chapter 2.

A third comparison investigated features associated with ESRD in individuals with no evidence of T1DM compared with “Healthy” Controls from NICOLA. Using RNA Express a total of 3,938 genes were found to be differentially expressed between cases and controls in this analysis. Out of these genes 656 were not found in the results of any other comparison. Differential expression was observed in 546 genes involved with mitochondrial function, 77

220 of which were not found in any other comparison. From the results of the methylation analysis performed with these samples there were 14 differentially methylated bases including one hypermethylated base and 13 hypomethylated bases spread across 10 chromosomes. Significantly hypomethylated bases on chromosome 7 were found between Chr7:158188381-158188416 which was located within PTPNR2. Differential methylation was previously observed in this gene in the analysis comparing samples from those with DKD and ESRD to those with ESRD without Diabetes in which hypermethylation was observed in sample from patients with DKD and ESRD. The opposing differential methylation patterns would be expected as the controls used in the first MethylKit analysis are used as cases in the current analysis. One of the hypomethylated bases, Chr16:85682810 ((p = 8.24E-13, q = 3.66E-11, meth.diff = -26.2295) was located in GSE1 which is thought to have localisation in the mitochondria and may function as an oncogene associated with breast cancer. This was previously included in the targeted mitochondrial GWAS in Chapter 2.

The next comparison group was investigating All ESRD samples (DKD with ESRD and ESRD without Diabetes) compared with “Healthy” Controls from NICOLA. The RNA Express analysis identified 4,162 differentially expressed genes. Out of these genes, 390 were not found in the results of any other comparison. Differential expression was observed in 620 genes involved with mitochondrial function, 68 of which were not found in any other comparison. From the MethylSeq analysis, only two differentially expressed bases were found both of which were hypomethylated. Only one of these was found to be significantly differentially methylated after logistic regression and this was found on chromosome 4 at chr4:73970898 which was located between CXCL1 and CXCL5. None of the differentially methylated bases identified in this analysis were found in or near NEMGs.

The final comparison in the current study was investigating all samples from individuals with T1DM (DKD with ESRD and DKD no ESRD) compared to all samples with no evidence of diabetes (ESRD without Diabetes and “Healthy” Controls). Only two genes, OAS1 and HLA- DRB6, were found to be differentially expressed in the RNA Express analysis. The gene OAS1 is induced by interferons and encodes 2',5'-oligoadenylates (2-5As). There is some evidence to suggest association with polymorphisms in this gene and T1DM1073–1075. Isoforms of this gene have previously been reported to localise to mitochondria. HLA-DRB6 is a related to immune response and G-protein signalling. No differential methylation was observed in this analysis.

Further investigation of the significantly differentially expressed genes identified from RNA Express analyses found 188 unique to the analyses comparing samples from those with DKD

221 and ESRD compared to the Healthy” Controls from NICOLA. This analysis was investigating genetic features which are present in individuals with DKD with ESRD that do not appear in individuals with no diabetes or renal disease and these 188 genes may represent a gene expression profile associated with DKD and this will be discussed further in Chapter 4. There were 657 significantly differentially expressed genes unique to the analysis comparing samples from those with ESRD without T1DM with “Healthy” Controls from NICOLA. Individuals in the ESRD without diabetes group had ESRD resulting from various forms of CKD (excluding DKD) and the genetic profile of these individuals may vary to a larger degree than that of the DKD with ESRD group which may be why there are many more differentially expressed genes unique to those with ESRD without diabetes. When all ESRD samples were compared with healthy controls, 391 significantly differentially expressed genes were found which did not appear in the results of any other RNA Express analyses in the current project. This may represent a genetic profile of ESRD in general as this was comprised of individuals with ESRD both with and without diabetes.

The MethylKit results discussed were from the analyses performed using the DRAGEN Methylation Pipeline output. By comparing the MethylKit analysis with the output from DRAGEN Methylation Pipeline to that performed with the MethylSeq output many more significantly differentiated genes were identified using the DRAGEN Methylation Pipeline output. This may indicate an issue with the MethylSeq app and closer inspection revealed that an updated version of the MethylKit was released after the analyses of the MethylSeq output were completed and this may explain the discrepancy between these results. The DRAGEN Methylation Pipeline is a more recently developed app than MethylKit and this may address some of the limitations with the latter.

The main aim of these NGS analyses was to establish a suitable workflow which could be used to investigate mitochondrial genetic and epigenetic features of DKD at a larger scale. Due to previously mentioned issues, sequencing for RNA and BST-DNA was not optimised to specifically target these features in mtDNA and therefore projections in Table 3.38 are based on the successful analyses using HG19 as a reference genome. The applications used for data analysis for mtDNA1048 and BST-DNA1053 sequencing data can support up to 96 samples in a single run and the RNA express1051 app is able to analyse up to 192 case-control samples. For a future study, up to 96 case-control samples could be analysed using this workflow to allow comprehensive investigation of mtDNA, RNA and DNA methylation in samples obtained from suitable individuals. In order to investigate all of the phenotypes previously mentioned at the maximum capacity of the mtDNA and BST-DNA apps 48 case samples and 48 control samples

222 would be required for each study group (DKD with ESRD, ESRD without Diabetes, “Healthy” Controls, DKD without Transplant). Projections for time, storage and cost for sequencing of 48 samples for each phenotype are illustrated in Table 3.37.

Table 3.37| Projections for Larger Study Sequencing FASTQ Alignment Data Generation Analysis* mtDNA time 32:30:48 00:51:46 NA 02:41:29 Size on BaseSpace 32.4 38.04 NA 54.312 Size of Downloaded files 86.28 35.04 NA 54.24 iCredits 0 0 0 0 RNA time 23:45:55 00:55:46 NA 46:09:32 Size on BaseSpace 38.4 49.8 NA 265.0393 Size of Downloaded files 135.6 47.16 NA 264.9124 iCredits 0 0 0 139.6364 BST-DNA with DRAGEN Sequencing FASTQ Alignment Data workflow Generation Analysis* time 26:00:41 02:36:58 01:24:41 87:44:39 Size on BaseSpace 306 484.44 1061.556 17.28 MB Size of Downloaded files 255.6 397.2 1060.8 19.2 MB iCredits 0 0 224.4 267.3231 Sequencing FASTQ generation and alignment estimates are for a single study group made up of 48 samples *estimates for data analysis are for one comparison run with 48 cases and 48 controls but multiple runs can be carried out simultaneously

It is important to note that no diabetic control samples were available as these samples were from a historical collection with ethical approval for DNA analysis. If this additional group was included there would be a total of 240 samples for sequencing and analysis. Although the RNA Express app can support up to 192 samples, the NextSeq 5001027 is only capable of sequencing up to 40 RNA samples for gene expression profiling in a single run and therefore six NextSeq runs would be required to generate data from all 240 samples. Due to this bottleneck and the large amount of storage space and analysis time required to analyse these data it may be more feasible to limit sample group size to a maximum of 30 samples for each study group which would reduce the projections in Table 3.35 by 37.5%. The times included in these estimates are for computational time only and more time needs to be allocated for sample preparation, sample transport to GCTU, hands on processing of sample, data transfer, manual inspection of data and any additional analysis. An important step when establishing how many samples should be included typically involves calculating study power however

223 unlike traditional experiments or microarrays, calculating power for NGS studies requires simultaneous consideration of sample size, sequencing depth and count-data which comes with additional statistical challenges. Various methods are currently available for power calculation for high throughput experiments1076,1077. In future larger scale analysis using NGS technology to investigate these study groups a suitable method for power calculation to ensure accurate results which are free from bias will need to be determined.

In the current study it was reported by GCTU that concentrations of the RNA samples were inconsistent, and the RNA was of low quality which may have had an impact on further analyses. It was suggested that the quality of these samples was what would be expected when using formalin-fixed paraffin-embedded (FFPE) sample1040. Although the genetic material sent to GCTU was not obtained from FFPE samples it was also reported by GCTU that samples which have been frozen for a long time may result in similar problems. Of note, when preparing RNA samples some clumping of genetic material was observed and this may have had an impact on the nucleic acid concentrations within samples sent to GCTU. These issues with RNA quantitation highlight an area of concern which will need to be addressed prior to future sequencing work using this workflow. Due to these issues with sample size and quality downstream statistical analyses have not been performed; however, a future study could make use of this to identify more robust associations between DKD and mtDNA variants, RNA expression patterns and DNA methylation patterns.

The work presented in this chapter has established an Illumina sequencing pipeline for mtDNA RNA and BST-DNA and pilot results seem to indicate that this can identify genetic and epigenetic features associated with DKD. Estimates for the resources required for a more in- depth analysis can be made based on the projections presented. A major limitation of this workflow is that the library preparation methods were not optimised to target mitochondrial RNA and mitochondrial DNA methylation and therefore these could not be investigated as originally intended. The services offered by GCTU are very useful for a wide range of sequencing purposes. However, due to this ‘black box’ approach in which samples are given to GCTU and data is returned with no information regarding protocols followed it is often difficult to discern the exact procedures followed during QC and sequencing. There are several concerns which must be addressed in future, including the issues with RNA quantitation and QC and close liaison with GCTU is necessary to identify the source of these issues prior to a large-scale sequencing project. For these reasons, the targeted sequencing mitochondrial RNA and methylated regions of mitochondrial DNA will require a customised approach to library preparation and sequencing. While this may be possible through GCTU it

224 may be more desirable to use NephRes sequencing facilities at a future time provided resources are available. The use of methylation sequencing and RNA sequencing specifically focused on mtDNA may identify differential gene expression and differential methylation at sites close to or in linkage disequilibrium with SNPs or indels which may provide useful insight into mitochondrial genes which influence DKD development and may have a use as biomarkers or potential targets for gene therapy, however significant work will need to be done to overcome the issues previously discussed. Genetic material is a precious resource and for a project on such a scale it would be essential to ensure high quality samples in order to ensure data generated is free from bias and identify true genetic and epigenetic associations between mitochondrial genes and DKD.

225 Chapter 4| Conclusions and Future Directions

4.1 Introduction

Recent projections estimate that incidence rate of end stage renal disease (ESRD) will increase by up to 18% over the next decade in the United States due to populations changes in age, obesity and diabetes prevalence1078. Similar projections for England estimate that by 2036 chronic kidney disease will affect 8.3% of the population and a substantial proportion of this will be due to diabetes1079. Along with the increasing burden of kidney disease the cost of diabetes treatment is also expected to increase which may result in insulin rationing amongst those who cannot afford treatment which can lead to diabetic ketoacidosis and death1080. Unlike the US, public health care is available in the United Kingdom (UK) and therefore individual patients are unlikely to be impacted by these rising costs of treatment at present. Nevertheless, the increasing treatment costs, rising prevalence of CKD and an aging population will also place considerable strain on the Health and Social Care Services in Northern Ireland (NI) and the National Health Service in the rest of the UK, both financially and in terms of availability of staff and resources.

The work presented in this thesis utilised mitochondrial-targeted Genome-Wide Association Studies and Next-Generation Sequencing to investigate mitochondrial genetic and epigenetic features of diabetic kidney disease in mitochondrial DNA (mtDNA) and in nuclear genes required for mitochondrial function (NEMGs) using several case-control studies. There is considerable evidence to suggest that interaction between genetic, epigenetic and environmental risk factors contribute to DKD development and also interact with environmental factors277–279,290,293,294,507–511,529,530,532,533. Much of this evidence has been derived from genetic studies and there are more than 150 genes which have been associated with DKD1081. With regards to the role of mitochondria in kidney disease development recent evidence investigating CKD (including DKD) has found that reduced expression of TFAM may contribute to disease development by compromising mitochondrial integrity. This can result in the release of mtDNA into the cytoplasm which in turn activates the cytosolic cGAS- stimulator of interferon genes (STING) DNA sensing pathway enhancing cytokine expression and leading to inflammation and interstitial fibrosis (summarised in Figure 4.1). The decreased levels of TFAM in individuals with CKD correlated with the degree of renal fibrosis and such mechanisms may help to explain the mitochondrial dysfunction in DKD.

226 .

Figure 4.1| Reduced TFAM Expression May Contribute to Inflammation and Fibrosis in CKD and DKD. Loss of TFAM in kidney tubule cells of mice with tubule-specific deletion of TFAM lead to severe mitochondrial loss and energetic deficit by 6 weeks of age. Loss of kidney function resulting in fibrosis, immune cell infiltration, and progressive azotemia leading to death were seen by 12 weeks of age1082.

Epigenetic studies in DKD may provide insights into how environmental factors can influence gene expression, without directly altering the nucleotide sequence, and how epigenetic phenomena may be involved in DKD. The focus of this thesis was on investigating the mitochondrial genome in DKD. Like nuclear DNA, mtDNA is also subject to epigenetic modification which adds to the complexity of investigating DKD. Due to these complex interactions the use of combined genetic, epigenetic and phenotypic studies may reveal previously unknown pathogenic pathways and allow identification of new biomarkers for early diagnosis and prediction of disease progression1081. This may in turn allow the development of specialised prevention programmes and novel treatment targets for DKD which may improve outcome for those with DKD and reduce the burden on the health service. Epigenetics studies of DKD have examined potentially heritable changes in gene expression that occur without variation in the original DNA nucleotide sequence and the impact this has on disease development and pathogenesis39,277,1081,1083–1087. Combined

227 genetic, epigenetic and phenotypic studies together may generate information to understand new pathogenic pathways and to search for new biomarkers for early diagnosis and novel treatment targets as part of prevention programs in DKD1081,1084.

4.2 Summary of Findings

Over the last 15 years genome-wide association studies (GWAS) have been used to study a vast range of diseases and genetic variation. This approach has previously been used to identify single nucleotide polymorphisms (SNPs) associated with kidney phenotypes including diabetic kidney disease (DKD). In Chapter 2 association studies using the GWAS methodology but focusing only on NEMGs and mtDNA were employed to investigate traits related to DKD using PLINK with data obtained from the All Ireland, Warren 3, Genetics Of Kidneys in Diabetes UK (UK-ROI) case-control collection826 with follow-up in up to 19,406 individuals from up to 17 independent collections from the Diabetes Nephropathy Collaborative Research Initiative (DNCRI). Prior to the targeted mitochondrial GWAS (mtGWAS) NEMGs were identified from a search of recent literature as well as several online databases. Association analyses were performed using PLINK with DNA from the UK-ROI collection exploring SNPs in mitochondrial DNA (mtDNA, n=225 total SNPs) and 2,526 NEMGs (n=2,880,249 total SNPs). Significant associations were found between traits associated with DKD and SNPs in mtDNA and NEMG SNPs. The strongest of these associations was between rs2853496 (MitoG11915A) and the eGFR phenotype which remained significant after Bonferroni correction (P < 5.88 x 10-4) following adjustment for covariates. This was also found to be associated with the DKD phenotype (individuals over 18 years of age with T1DM for at least 10 years, diagnosed with persistent proteinuria and hypertension) in the UK-ROI data. The SNP rs201336470 (exm2216239) was strongly associated (P<0.01) with eGFR before adjusting for covariates and remained weakly associated after adjustment for covariates. Another three SNPs were only weakly associated (P<0.05) with eGFR in mtDNA in the UK-ROI data before and after adjusting for covariates but not after adjusting for genomic inflation. These were rs28359178, rs41528348 and a C>T variant located at mt:16070. Six SNPs representing three independent signals in three NEMGs were found to be suggestive of significant association in UK-ROI mtGWAS and crossed the threshold for significance in the look-up of the DNCRI data in both minimum and maximum adjusted models. These SNPs were rs56122389 in KRT4 which was associated with CKD and albuminuria/ESRD; rs17727871 in CATSPER2 associated with CKD; and rs72851722 ABCA9 associated with CKD and this was the index SNP in a linkage disequilibrium block which included rs568286059, rs72844573, and rs72851720.

228 The work presented in Chapter 3 aimed to establish a next-generation sequencing (NGS) workflow to target mtDNA for analysis of SNPs, gene expression and DNA methylation. This would allow further study of the SNPs identified in Chapter 2 as well as a comprehensive investigation of genetic and epigenetic features in mtDNA in samples taken from healthy individuals, those with DKD that has progressed to ESRD, those with DKD without ESRD and those with ESRD without diabetes. The workflow established for mtDNA sequencing can be used to investigate SNPs and indels in mtDNA, however the RNA and bisulfite-treated DNA sequencing was not optimized to specifically target mtDNA which may have impacted downstream analysis due to poor coverage of mtDNA1088. No differential gene expression or differential methylation was observed in mtDNA. Although the mtDNA sequencing workflow can be used successfully further work is required to establish more suitable methodology to specifically target mitochondrial gene expression and methylation.

The gene CATSPER2 was found to be differentially expressed in those with ESRD without Diabetes compared to “Healthy” controls and in All ESRD samples (with and without diabetes) compared to “Healthy” controls. One of these SNPs, rs17727871, was significantly associated with CKD in the minimum and maximum adjusted models within four studies in the DNCRI lookup. This may represent an association with DKD or ESRD in general, however further work is necessary to verify these associations and the impact of differential expression of this gene. No differential methylation was observed in the regions surrounding SNPs identified in Chapter 2. Two SNPs in mtDNA which were weakly associated with eGFR in the GWAS were also found in mtDNA sequencing results in multiple samples. These were rs41528348 and rs28359178 which were found in two samples in the DKD without ESRD, two samples in DKD with ESRD and two samples in “Healthy” control groups. One of these, rs28359178, was also found in one sample in each of the ESRD only, DKD with ESRD and DKD without ESRD group. Further sequencing with larger sample sizes, along with a refined workflow for investigating gene expression and methylation specifically in mtDNA, is necessary to establish if these SNP associations are statistically significant.

4.3 Strengths and Limitations of Research

Data for the mtGWAS project was previously collected as part of the All Ireland, Warren 3, Genetics Of Kidneys in Diabetes UK (UK-ROI) case-control collection. Putatively significant SNPs were included in a look-up of DNCRI GWAS data which has assembled up to 19,406 patients with type 1 diabetes and of European origin from 17 cohorts. The UK-ROI samples were also included in this with updated genotyping using the HumanCore BeadChip (Illumina,

229 San Diego, CA, USA) and updated imputation to approximately 49 million SNPs. Genotyping of the UK-ROI data and DNCRI collections were performed as described by Salem and colleagues294. One strength of this research is that both the UK-ROI and DNCRI data include similar, clearly defined case-control phenotypes and the DNCRI collection has recently been used to identify 16 genome-wide significant risk loci associated with DKD many of which were previously correlated with kidney function and likely represent true DKD associations. The large sample size included in the DNCRI data provides greater power to detect variants associated with DKD in those with type 1 diabetes. The use of uniform genotyping and QC procedures along with standardised imputation across all studies reduced the degree of between-study heterogeneity and increased the likelihood that variants detected represented true associations with DKD rather than underlying differences in population structure. In the current project the focus was on mtDNA and NEMGs within UK-ROI and DNCRI data and therefore a subset of the total SNPs in these collections was investigated. One notable limitation is that mtDNA was not genotyped for the full DNCRI dataset and therefore mtDNA SNPs were not available in the look-up of the meta-analysis results. Variants identified in this research were found in three or four collections which would limit the power of these associations and indicate that these SNPs will require validation to confirm association. Only one variant was found to be associated with CKD and albuminuria/ESRD whereas the remaining SNPs in NEMGs were associated with CKD only in the full DNCRI dataset and these may represent associations with CKD in general rather than DKD specifically. Replication in independent samples and confirmation of function will be necessary to validate the significance of all SNPs identified in these analyses. This highlights a common criticism of GWAS in that there is typically poor reproducibility of results and inconsistencies between GWAS investigating the similar phenotypes40,1089 These inconsistencies may be a result of underpowered studies or bias due to previously unidentified differences in population structure which highlight the need for reliable and standardised methods to correct for multiple testing and detection of false positives and false negative results in GWAS794,1081,1090,1091. The results from mtDNA GWAS in the UK-ROI data found one SNP to remain significant after Bonferroni correction which is a conservative method for adjusting for multiple testing and avoiding Type I errors1092. The use of Bonferroni correction highlights the strength of the association between rs2853496 (MitoG11915A) and the eGFR phenotype, although this method of multiple testing is prone to Type II errors. The use of a selection of genes for GWAS allowed only genes which are known or thought to be involved with mitochondrial function to be investigated and SNPs 5kb upstream and downstream of NEMGs were also included in an effort to identify any SNPs which may be in

230 LD with the NEMGs investigated. However, the 5kb cut off point may have excluded long- range regulatory elements which may affect genes up to 2 million base pairs away1093,1094. By limiting the SNPs in the GWAS to only those in NEMGs some important variants may be missed as new genes and mechanisms involved with mitochondrial function are continuing to be revealed. Therefore the use of a subset of NEMGs may introduce bias and exclude genes with a currently unknown role in mitochondrial function1095. The findings from the GWASs performed in this thesis will require verification in independent populations as well as functional follow-up to investigate their role in DKD and to identify if the associations identified are truly due to DKD rather than underlying phenotypic differences or unknown differences in populations structure.

A key strength of the NGS work performed in this thesis is that the mtDNA sequencing and analysis workflow can successfully be used to detect SNPs and indels in mtDNA obtained from individuals with DKD. A major limitation of this was the small sample size which prohibited robust statistical associations in this pilot study and future replication of this workflow with a suitable sample size representing DKD may be used to confirm or refute the associations identified in Chapter 2. The workflow followed in Chapter 3 had some notable limitations in that the library preparation and sequencing methods used for RNA sequencing and bisulfite- treated DNA sequencing were not optimised to specifically target mtDNA which may have impacted depth of coverage of this small element of genetic material1088. It should also be noted that most differential methylation observed in the NGS analysis was at individual bases. Early epigenetic studies focused on individual base sites; however, methylation levels have been shown to be strongly correlated across the genome and functionally relevant regions such as CpG islands or CpG island shores have previously been associated with diseases1096–1099. Although changes at individual bases may impact function, the functional impact of differentially methylated regions (DMRs) are more apparent and more reliably replicated1100,1101. These are genomic regions (rather than single bases) with different methylation patterns in multiple samples1102. In statistical terms by testing methylated regions rather than individual nucleotide bases the number of tests conducted is reduced which in turn increases discovery power. The use of NGS technology has permitted significant breakthroughs for identifying and treating disease related genes, however there are also several challenges which come with this. The large number of variants identified through such studies may lead to over interpretation or misinterpretation of the impact of individual variants. As more sequencing data continues to be generated it will be necessary to address these issues through careful collection and annotation of NGS data from a various disease in

231 populations representing all ethnic backgrounds in order to gain better insight into the impact of genetic variants on disease development and susceptibility in individual patients1001.

Another potential limitation which may have impacted the results presented in this thesis (and in genetic studies in general) is in the criteria used to define disease phenotypes. Control samples in the mtGWAS were taken from individuals with type 1 diabetes of at least 15 years duration with no evidence of kidney disease, however this study did not involve long term follow up of patients and therefore some of these “controls” may have developed DKD at a later time or some of those with DKD without ESRD may have gone on to develop ESRD294. This issue may be further complicated by potential subtypes of DKD which may have different pathogenicities1103. For the NGS project RNA was not available from individuals with DKD with no evidence of ESRD from the All-Ireland Collection and therefore these were matched to samples from the NICOLA collection. Due to the inclusion criteria of the NICOLA study it was not possible to age match all samples and therefore some of those with DKD with no evidence of ESRD may have developed ESRD at a later time. The use of RNA and DNA from different individuals highlight a potential source of error as these samples are likely to have different genetic profiles. Another issue which arose was the low quality of RNA in the sequencing results which may have occurred because the samples used had been frozen for a long time and this issue will need to be addressed in future studies.

The source and quality of genetic material for genotyping in GWAS and for use in NGS sequencing may also impact the findings. For the NGS project genetic material was collected from peripheral blood samples, which is common in genetic and epigenetic studies of DKD as these are more convenient and less invasive than kidney biopsy. There is evidence to suggest that SNPs1104–1106, DNA methylation1107 and gene expression1108 exhibit tissue specificity and the site that a sample is taken from may affect the observations as samples taken from peripheral blood may introduce bias due to mixed cell types1081. The research in this project focused on mtDNA which may add a further layer of complexity to this problem due to mitochondrial heteroplasmy which has also been shown to show tissue specificity1109–1111.

4.4 Future Work

There are several challenges which must be overcome in future studies of mtDNA using NGS approaches in order to better understand the cause and consequences of mitochondrial genetic and epigenetic changes in DKD. A possible approach could be to design a custom NGS panel to include mtDNA and NEMGs, however, due to the expanding list of

232 mitochondrial genes this will inevitably overlook some potentially important genes. Whole exome sequencing can be used to detect mtDNA variants and indels1112–1114 however this is more challenging when used to detect mitochondrial gene expression and methylation. In order to explore methylation in mtDNA a specialised bisulfite treatment protocol could be used such as that used by Sirard which involves extraction of mtDNA and linearization with a restriction enzyme prior to bisulfite treatment1115. Future studies investigating mitochondrial epigenetics should also consider the role of mitochondrial non-coding RNAs in relation to DKD1116–1118. It has been reported that data on mtDNA variants and heteroplasmy can be extracted from RNA sequencing data based on the presence of mitochondrial in nuclear DNA1114. However, for comprehensive investigation of the mitochondrial transcriptome direct sequencing of mtDNA such as that described by Rackham and Filipovska could be used in a future sequencing study1119. BaseSpace offers a user-friendly interface for bioinformatic analysis of NGS data, however, for the targeted analysis of mitochondrial RNA sequencing data and mitochondrial BST-DNA data it would likely be more efficient to develop a robust and flexible bioinformatic pipeline incorporating additional tools to specifically target mtDNA in NGS data. If used in combination with DNA, RNA and BST DNA sequencing protocols designed for this purpose this could allow comprehensive investigation of mtDNA not only for associations with DKD but also with many other diseases with mitochondrial involvement. It may also be possible to utilise additional bioinformatics tools to extract mitochondrial specific data regarding gene expression and methylation from the NGS sequencing data from GCTU.

The samples used in this thesis were from three separate studies. A longitudinal study of DKD with samples taken from multiple sites from each of the participants at various time points from diabetes diagnosis may provide valuable insight into mitochondrial genetic and epigenetic features of DKD and how these change over the course of the disease. Substantial financial resources would be needed to support a longitudinal cohort study coupled with streamlined sequencing workflows and sophisticated bioinformatics. The methods and frequency of sample collection would need to be carefully considered as repeated kidney biopsy from the time of type 1 diabetes diagnosis would raise ethical concerns. There would also be concerns about how to handle incidental findings in a longitudinal study using NGS as the presence of known disease risk genes may exclude the participant but could seriously adversely impact on their lives.

The use of renal biopsy biobanks may allow in depth study of a range of renal phenotypes including DKD. One such biobank has been established by the Karolinska Institute which has

233 more than 750 renal biopsies1081. Biobanks in many countries could be used to build a large collection of samples from all ethnic groups and mitochondrial haplotypes which would allow comprehensive omics profiling including all kidney diseases. This would allow identification of useful biomarkers for risk prediction and treatment response as well as permitting investigation in the molecular pathology and mechanisms involved with kidney disease in general. Profiling of the mitochondrial genome in such samples would also be possible and allow greater insight into the role of mitochondria in DKD.

In this thesis the term “Next Generation Sequencing” has been used to refer to second generation sequencing technology which allows massively paralleled sequencing which dramatically reduced cost of sequencing compared to previous methods. Third generation sequencing is now a reality. This uses single molecule level sequencing to generate very long reads which reduces the burden of genome assembly and removes the need to make some of the approximations necessary for variant calling in second generation sequencing1120. This can also be used to detect epigenetic modifications without the need for bisulfite treatment and can sequence whole transcripts bypassing the assembly stage. As with any new sequencing technology there are a number of hurdles which must be overcome before the widespread use of third generation sequencing and currently available third generation sequencing platforms each have their own strengths and weaknesses which can be seen in Table 4.11120.

Table 4.1| Pros and Cons of Third Generation Sequencing Platforms1120 Platform Pros Cons PacBio Real long reads Expensive sequencer (Sequel list price: Extremely high accuracy with US$350 000) and relatively high cost CCS (>99.999% at 20 passes) per Gb Direct detection of epigenetic Large amounts of starting material modifications; also here, high required for library preparation level of accuracy with CCS High error rate at single pass ( 15%) No problem with repeats, Only one sequencer available (Sequel) low/high %GC with limited throughput per SMRT∼ Cell ( 10 Gb) Maximum read length limited by polymerase∼ processivity ( 80 kb)

Oxford Real (ultra-) long reads; in High overall error rate (1D:∼ 15%, 1D2: Nanopore principle no upper limit to 3%*) and systematic errors with Technology read length ( 1-Mb reads homopolymers ∼ have been obtained) Large∼ amount of starting material Cost-effective∼ sequencers required for library preparation (MinION, GridION) Frequent changes of software versions, Direct detection of epigenetic flow cells, and kits

234 Platform Pros Cons modifications Extremely fast library preparation (‘rapid sequencing kit’) Portability (MinION and SmidgION) Scalability; from extremely small and portable (SmidgION, MinION) to extremely powerful and high throughput (PromethION) Direct sequencing of RNA and detection of RNA modifications Illumina/10X Based on classical Illumina No real long reads Genomics sequencing; low error rates, PCR amplification required for library Synthetic :Long high throughput preparation Reads Relatively low cost per Gb No direct detection of epigenetic Illumina SLR: no specialized modifications equipment required Illumina SLR: limited partitioning 10X Genomics: small capacity (384 wells and indexes) amounts of material required Illumina SLR: 10 kb maximum SLR for library preparation length ( 1 ng) *Although 1D2 reduces error rates, it also reduces throughput and it is not very efficient. ∼

The use of high-throughput sequencing in a clinical setting has the potential to transform healthcare by enabling faster diagnosis, more accurate risk prediction and targeted treatment. It is essential to establish standardised procedures for collection and interpretation of clinical and genomic data along with sequencing facilities and robust bioinformatics infrastructure to maximise the benefit of genomic medicine1121. Large scale data sets are necessary to enable this and the UK Government plan to sequence 5 million genomes over the next five years which will considerably enhance the utility of genomic medicine for diagnosing rare disease712. As this field continues to develop multi-omic sequencing could become routine clinical practice for all diseases permit stratified health care, reliable diagnosis, and early detection of disease.

4.5 Concluding Remarks

The work in this thesis has utilised GWAS and NGS to investigate mitochondrial genetic variants which may be used to identify predisposition to DKD in individuals with type 1 diabetes. Every effort was made to ensure accurate classification for all phenotypes used and

235 that any significant results were replicated in additional samples. The use of NGS to investigate mitochondrial gene expression and methylation identified several challenges which must be overcome in order to better understand the genetics and epigenetics of DKD. Robust bioinformatics pipelines will also be required for the targeted analysis of genetic and epigenetic features of DKD. It is also important to remember that association is not causation and any variants identified through sequencing studies will require functional validation. While substantial progress has been made in DKD research there is still a long way to go and genetics and epigenetics represent only a small piece of the puzzle. The widespread use of precision medicine will require seamless integration of clinical data with full “omics” profiles for each patient with large data repositories which can identify and generate treatment options at an individual patient level. The use of gene-editing technology in combination with this could eventually permit gene therapy targeted to specific mutations to treat DKD and any disease resulting from genetic defects.

236 Bibliography

1. NCD Risk Factor Collaboration. Worldwide trends in diabetes since 1980: a pooled analysis of 751 population- based studies with 4.4 million participants. Lancet 387, 1513–1530 (2016). 2. Cho, N. H. et al. IDF Diabetes Atlas: Global estimates of diabetes prevalence for 2017 and projections for 2045. Diabetes Res. Clin. Pract. 138, 271–281 (2018). 3. World Health Organization. Global Report on Diabetes. (2016). 4. International Diabetes Federation. Global Diabetes Plan 2011-2021. (2011). 5. World Health Organization. Global action plan for the prevention and control of noncommunicable diseases 2013- 2020. (2013). 6. Di Iorgi, N. et al. Diabetes insipidus - Diagnosis and management. Horm. Res. Paediatr. 77, 69–84 (2012). 7. Guariguata, L. et al. Global estimates of diabetes prevalence for 2013 and projections for 2035. Diabetes Res. Clin. Pract. 103, 137–149 (2014). 8. Zaccardi, F., Webb, D. R., Yates, T. & Davies, M. J. Pathophysiology of type 1 and type 2 diabetes mellitus: a 90- year perspective. Postgrad. Med. J. 92, 63–69 (2016). 9. Holman, N., Young, B. & Gadsby, R. Current prevalence of Type 1 and Type 2 diabetes in adults and children in the UK. Diabet. Med. 32, 1119–1120 (2015). 10. American Diabetes Association. Diagnosis and classification of diabetes mellitus. Diabetes Care 36, S67–S74 (2013). 11. Atkinson, M. A., Eisenbarth, G. S. & Michels, A. W. Type 1 diabetes. Lancet 383, 69–82 (2014). 12. Bluestone, J. A., Herold, K. & Eisenbarth, G. Genetics, pathogenesis and clinical interventions in type 1 diabetes. Nature 464, 1293–1300 (2010). 13. Stout, R. W. Glucose tolerance and ageing. J. R. Soc. Med. 87, 608–609 (1994). 14. Laakso, M. & Pyorala, K. Age of onset and type of diabetes. Diabetes Care 8, 114–117 (1985). 15. Naik, R. G., Brooks-Worrell, B. M. & Palmer, J. P. Latent autoimmune diabetes in adults. J. Clin. Endocrinol. Metab. 94, 4635–4644 (2009). 16. Leslie, R. D. Predicting adult-onset autoimmune diabetes clarity from complexity. Diabetes 59, 330–331 (2010). 17. Steenkamp, D. W., Alexanian, S. M. & Sternthal, E. Approach to the patient with atypical diabetes. CMAJ 186, 678– 684 (2014). 18. Lašaite, L., Ostrauskas, R., Žalinkevičius, R., Jurgevičiene, N. & Radzevičiene, L. Diabetes distress in adult type 1 diabetes mellitus men and women with disease onset in childhood and in adulthood. J. Diabetes Complications 30, 133–137 (2016). 19. Kerr, M. Inpatient Care for People with Diabetes: The Economic Case for Change. NHS Diabetes, Insight Heal. Econ. 1–52 (2011). 20. Hex, N., Bartlett, C., Wright, D., Taylor, M. & Varley, D. Estimating the current and future costs of Type 1 and Type 2 diabetes in the UK, including direct health costs and indirect societal and productivity costs. Diabet. Med. 29, 855–862 (2012). 21. Pharmaceutical Services Negotiating Committee. Essential facts, stats and quotes relating to diabetes. psnc.org (2019). Available at: https://psnc.org.uk/services-commissioning/essential-facts-stats-and-quotes-relating-to- diabetes/. (Accessed: 26th July 2019) 22. International Diabetes Federation. IDF Diabetes Atlas, 8th edn. (2017). 23. Yau, J. W. Y. et al. Global prevalence and major risk factors of diabetic retinopathy. Diabetes Care 35, 556–564 (2012). 24. Danaei, G., Lawes, C. M. M., Vander Hoorn, S., Murray, C. J. L. & Ezzati, M. Global and regional mortality from ischaemic heart disease and stroke attributable to higher-than-optimum blood glucose concentration: comparative risk assessment. Lancet 368, 1651–1659 (2006). 25. The Emerging Risk Factors Collaboration. Diabetes mellitus, fasting blood glucose concentration, and risk of vascular disease: a collaborative meta-analysis of 102 prospective studies. Lancet 375, 2215–2222 (2010). 26. Moxey, P. W. et al. Lower extremity amputations--a review of global variability in incidence. Diabet. Med. 28,

237 1144–1153 (2011). 27. Coresh, J. et al. Decline in estimated glomerular filtration rate and subsequent risk of end-stage renal disease and mortality. JAMA 311, 2518–2531 (2014). 28. Fox, C. S. et al. Associations of kidney disease measures with mortality and end-stage renal disease in individuals with and without diabetes: a meta-analysis. Lancet 380, 1662–1673 (2012). 29. The Renal Association. Annual Report. (2018). 30. Rosolowsky, E. T. et al. Risk for ESRD in Type 1 Diabetes Remains High Despite Renoprotection. J. Am. Soc. Nephrol. 22, 545–553 (2011). 31. Reutens, A. T. Epidemiology of Diabetic Kidney Disease. Med. Clin. North Am. 97, 1–18 (2013). 32. Groop, P. et al. The Presence and Severity of Chronic Kidney Disease Predicts All-Cause Mortality in Type 1 Diabetes. Diabetes 58, 1651–1658 (2009). 33. Rao Kondapally Seshasai, S. et al. Diabetes Mellitus, Fasting Glucose, and Risk of Cause-Specific Death. N. Engl. J. Med. 364, 829–841 (2011). 34. Tuttle, K. R. et al. Diabetic Kidney Disease: A Report from an ADA Consensus Conference. Am. J. Kidney Dis. 64, 510–533 (2014). 35. Molitch, M. E. et al. Diabetic kidney disease: a clinical update from Kidney Disease: Improving Global Outcomes. Kidney Int. 87, 20–30 (2015). 36. Hill, C. J. et al. Chronic kidney disease and diabetes in the National Health Service: A cross-sectional survey of the UK National Diabetes Audit. Diabet. Med. 31, 448–454 (2014). 37. Barbosa, J. Is diabetic microangiopathy genetically heterogeneous? HLA and diabetic nephropathy. Horm. Metab. Res. Suppl. 11, 77–80 (1981). 38. Buffon, M. P., Sortica, D. A., Gerchman, F., Crispim, D. & Canani, L. H. FRMD3 gene: its role in diabetic kidney disease. A narrative review. Diabetol. Metab. Syndr. 7, 118 (2015). 39. Swan, E. J., Maxwell, A. P. & McKnight, A. J. Distinct methylation patterns in genes that affect mitochondrial function are associated with kidney disease in blood-derived DNA from individuals with Type 1 diabetes. Diabet. Med. 32, 1110–1115 (2015). 40. van Zuydam, N. R. et al. A Genome-Wide Association Study of Diabetic Kidney Disease in Subjects With Type 2 Diabetes. Diabetes 67, 1414–1427 (2018). 41. Sandholm, N. et al. New susceptibility loci associated with kidney disease in type 1 diabetes. PLoS Genet. 8, e1002921 (2012). 42. Germain, M. et al. SORBS1 gene, a new candidate for diabetic nephropathy: results from a multi-stage genome- wide association study in patients with type 1 diabetes. Diabetologia 58, 543–548 (2015). 43. Ma, R. C. W. & Cooper, M. E. Genetics of Diabetic Kidney Disease-From the Worst of Nightmares to the Light of Dawn? J. Am. Soc. Nephrol. 28, 389–393 (2017). 44. Böger, C. A. & Sedor, J. R. GWAS of Diabetic Nephropathy: Is the GENIE out of the Bottle? PLOS Genet. 8, e1002989 (2012). 45. Sandholm, N. et al. The Genetic Landscape of Renal Complications in Type 1 Diabetes. J. Am. Soc. Nephrol. 28, 557–574 (2017). 46. The Expert Committee on the Diagnosis and Classification of Diabetes Mellitus. Report of the Expert Committee on the Diagnosis and Classification of Diabetes Mellitus. Diabetes care 20, (1997). 47. WHO. Use of glycated haemoglobin (HbA1c) in the diagnosis of diabetes mellitus. Abbreviated Rep. a WHO Consult. 1–25 (2011). 48. The International Expert Committee. International Expert Committee Report on the Role of the A1C Assay in the Diagnosis of Diabetes. Diabetes Care 32, 1327–1334 (2009). 49. Manley, S. et al. Diagnosis of diabetes : HbA 1c versus WHO criteria. Diabetes Prim. Care 12, 87–96 (2010). 50. John, W. G. Use of HbA 1c in the diagnosis of diabetes mellitus in the UK. The implementation of World Health Organization guidance 2011. Diabet. Med. 29, 1350–1357 (2012). 51. Jenkins, G. W., Kemnitz, C. P. & Tortora, G. J. The Endocrine System. in Anatomy and Physiology: From science to life (ed. Roesch, B.) 653 (John Wiley & Sons, Ltd., 2013).

238 52. Mayfield, J. Diagnosis and classification of diabetes mellitus: new criteria. Am. Fam. Physician 58, 1355-1362,1369- 1370 (1998). 53. National Institute of Diabetes and Digestive and Kidney Diseases. Health Information - Diabetes. NIH (2016). Available at: https://www.niddk.nih.gov/health-information/diabetes/overview/tests-diagnosis. (Accessed: 27th March 2019) 54. Lundgren, V. M. et al. GAD Antibody Positivity Predicts Type 2 Diabetes in an Adult Population. Diabetes 59, 416– 422 (2010). 55. Maahs, D. M. & Rewers, M. Editorial: Mortality and renal disease in type 1 diabetes mellitus--progress made, more to be done. The Journal of clinical endocrinology and metabolism 91, 3757–3759 (2006). 56. Mordes, J. P. et al. The BB/Wor rat and the balance hypothesis of autoimmunity. Diabetes. Metab. Rev. 12, 103– 109 (1996). 57. Makino, S. et al. Breeding of a non-obese, diabetic strain of mice. Jikken Dobutsu. 29, 1–13 (1980). 58. DiLorenzo, T. P. & Serreze, D. V. The good turned ugly: immunopathogenic basis for diabetogenic CD8+ T cells in NOD mice. Immunol. Rev. 204, 250–263 (2005). 59. Serreze, D. V et al. B lymphocytes are critical antigen-presenting cells for the initiation of T cell-mediated autoimmune diabetes in nonobese diabetic mice. J. Immunol. 161, 3912–3918 (1998). 60. Han, B. et al. Developmental control of CD8(+) T cell–avidity maturation in autoimmune diabetes. J. Clin. Invest. 115, 1879–1887 (2005). 61. Burton, A. R. et al. On the pathogenicity of autoantigen-specific T-cell receptors. Diabetes 57, 1321–1330 (2008). 62. Roep, B. O. & Peakman, M. Antigen targets of type 1 diabetes autoimmunity. Cold Spring Harb. Perspect. Med. 2, a007781 (2012). 63. Marcovecchio, L., Dunger, D. B., Peakman, M. & Taylor, K. W. Aetiology of type 1 diabetes mellitus - genetics, autoimmunity and trigger factors. in Evidence-based Paediatric and Adolescent Diabetes (eds. Allgrove, J. & Swift, P. G. F.) 26–41 (Blackwell Publishing Ltd, 2007). 64. Parthasarathy, S. & Choudhary, P. Epidemiology, aetiology and pathogenesis of type 1 diabetes. in Advanced Nutrition and Dietetics in Diabetes (eds. Goff, L. & Dyson, P.) 53–59 (John Wiley & Sons, Ltd., 2016). 65. Colman, P. G. et al. The Melbourne Pre-Diabetes Study: prediction of type 1 diabetes mellitus using antibody and metabolic testing. Med. J. Aust. 169, 81–84 (1998). 66. Siljander, H. T. A. et al. Predictive Characteristics of Diabetes-Associated Autoantibodies Among Children With HLA-Conferred Disease Susceptibility in the General Population. Diabetes 58, 2835 LP – 2842 (2009). 67. Bradfield, J. P. et al. A Genome-Wide Meta-Analysis of Six Type 1 Diabetes Cohorts Identifies Multiple Associated Loci. PLoS Genet. 7, e1002293 (2011). 68. Nyaga, D. M., Vickers, M. H., Jefferies, C., Perry, J. K. & O’Sullivan, J. M. Type 1 Diabetes Mellitus-Associated Genetic Variants Contribute to Overlapping Immune Regulatory Networks . Frontiers in Genetics 9, 535 (2018). 69. Koulmanda, M. et al. Modification of adverse inflammation is required to cure new-onset type 1 diabetic hosts. Proc. Natl. Acad. Sci. U. S. A. 104, 13074–13079 (2007). 70. Böni-Schnetzler, M. et al. Increased Interleukin (IL)-1β Messenger Ribonucleic Acid Expression in β-Cells of Individuals with Type 2 Diabetes and Regulation of IL-1β in Human Islets by Glucose and Autostimulation. J. Clin. Endocrinol. Metab. 93, 4065–4074 (2008). 71. Gong, Z., Tas, E. & Muzumdar, R. Humanin and age-related diseases: a new link? Front. Endocrinol. (Lausanne). 5, 210 (2014). 72. Lee, Y.-J., Han, S. B., Nam, S.-Y., Oh, K.-W. & Hong, J. T. Inflammation and Alzheimer’s disease. Arch. Pharm. Res. 33, 1539–1556 (2010). 73. Tuppo, E. E. & Arias, H. R. The role of inflammation in Alzheimer’s disease. Int. J. Biochem. Cell Biol. 37, 289–305 (2005). 74. Akiyama, H. et al. Inflammation and Alzheimer’s disease. Neurobiol. Aging 21, 383–421 (2000). 75. Libby, P. Inflammation in atherosclerosis. Nature 420, 868–874 (2002). 76. Hansson, G. K. Inflammation, Atherosclerosis, and Coronary Artery Disease. N. Engl. J. Med. 352, 1685–1695 (2005).

239 77. Kinney, J. W. et al. Inflammation as a central mechanism in Alzheimer’s disease. Alzheimer’s Dement. Transl. Res. Clin. Interv. 4, 575–590 (2018). 78. National Collaborating Centre for Women’s and Children’s Health. Diabetes (type 1 and type 2) in children and young people: diagnosis and management. Diabetes UK (2015). 79. Roncarolo, M.-G. & Battaglia, M. Regulatory T-cell immunotherapy for tolerance to self antigens and alloantigens in humans. Nat. Rev. Immunol. 7, 585–598 (2007). 80. Matthews, D. R. Epidemiology, aetiology and pathogenesis of type 2 diabetes. in Advanced Nutrition and Dietetics in Diabetes (eds. Goff, L. & Dyson, P.) 95–102 (John Wiley & Sons, Ltd., 2016). 81. Meier, J. J. & Bonadonna, R. C. Role of Reduced β-Cell Mass Versus Impaired β-Cell Function in the Pathogenesis of Type 2 Diabetes. Diabetes Care 36, S113 LP-S119 (2013). 82. Salehi, M. & D’Alessio, D. A. Going With the Flow: Adaptation of β-Cell Function to Glucose Fluxes After Bariatric Surgery. Diabetes 62, 3671–3673 (2013). 83. Lim, E. L. et al. Reversal of type 2 diabetes: normalisation of beta cell function in association with decreased pancreas and liver triacylglycerol. Diabetologia 54, 2506–2514 (2011). 84. DeFronzo, R. A. et al. Type 2 diabetes mellitus. Nat. Rev. Dis. Prim. 1, 15019 (2015). 85. Sakamoto, J. et al. Activation of human peroxisome proliferator-activated receptor (PPAR) subtypes by pioglitazone. Biochem. Biophys. Res. Commun. 278, 704–711 (2000). 86. Kung, J. & Henry, R. R. Thiazolidinedione safety. Expert Opin. Drug Saf. 11, 565–579 (2012). 87. Rosenbloom, A. L., Silverstein, J. H., Amemiya, S., Zeitler, P. & Klingensmith, G. J. ISPAD Clinical Practice Consensus Guidelines 2006-2007. Type 2 diabetes mellitus in the child and adolescent. Pediatr. Diabetes 9, 512–526 (2008). 88. Wadden, T. A., Brownell, K. D. & Foster, G. D. Obesity: responding to the global epidemic. J. Consult. Clin. Psychol. 70, 510–525 (2002). 89. Felber, J.-P. & Golay, A. Pathways from obesity to diabetes. Int. J. Obes. Relat. Metab. Disord. 26 Suppl 2, S39-45 (2002). 90. Hossain, P., Kawar, B. & El Nahas, M. Obesity and diabetes in the developing world--a growing challenge. N. Engl. J. Med. 356, 213–215 (2007). 91. Lewis, G. F., Carpentier, A., Adeli, K. & Giacca, A. Disordered fat storage and mobilization in the pathogenesis of insulin resistance and type 2 diabetes. Endocr. Rev. 23, 201–229 (2002). 92. Ravussin, E. & Smith, S. R. Increased fat intake, impaired fat oxidation, and failure of fat cell proliferation result in ectopic fat storage, insulin resistance, and type 2 diabetes mellitus. Ann. N. Y. Acad. Sci. 967, 363–378 (2002). 93. Lois, K. & Kumar, S. Obesity and diabetes. Endocrinol. y Nutr. 56, 38–42 (2009). 94. Lee, H. I. B., Yu, M., Yang, Y., Jiang, Z. & Ha, H. Reactive Oxygen Species-Regulated Signaling Pathways in Diabetic Nephropathy. 241–245 (2003). 95. Sladek, R. et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445, 881– 885 (2007). 96. Tsai, F.-J. et al. A Genome-Wide Association Study Identifies Susceptibility Variants for Type 2 Diabetes in Han Chinese. PLoS Genet 6, e1000847 (2010). 97. Flannick, J. & Florez, J. C. Type 2 diabetes: genetic data sharing to advance complex disease research. Nat. Rev. Genet. 17, 535–549 (2016). 98. Scott, L. J. et al. A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants. Science (80-. ). 316, 1341–1345 (2007). 99. Ling, C. et al. Epigenetic regulation of PPARGC1A in human type 2 diabetic islets and effect on insulin secretion. Diabetologia 51, 615–622 (2008). 100. Moon, M. K. et al. Genetic polymorphisms in peroxisome proliferator-activated receptor gamma are associated with type 2 diabetes mellitus and obesity in the Korean population. Diabet. Med. 22, 1161–1166 (2005). 101. Wada, J. & Nakatsuka, A. Mitochondrial Dynamics and Mitochondrial Dysfunction in Diabetes. Acta Med. Okayama 70, 151–158 (2016). 102. Chand, S. et al. Analysis of single nucleotide polymorphisms implicate mTOR signalling in the development of new- onset diabetes after transplantation. BBA Clin. 5, 41–45 (2016).

240 103. Frayling, T. M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889–894 (2007). 104. World Health Organization. Diagnostic Criteria and Classification of Hyperglycaemia First Detected in Pregnancy. World Heal. Organ. 1–63 (2013). 105. Bellamy, L., Casas, J.-P., Hingorani, A. D. & Williams, D. Type 2 diabetes mellitus after gestational diabetes: a systematic review and meta-analysis. Lancet (London, England) 373, 1773–1779 (2009). 106. Homko, C., Sivan, E., Chen, X., Reece, E. A. & Boden, G. Insulin secretion during and after pregnancy in patients with gestational diabetes mellitus. J. Clin. Endocrinol. Metab. 86, 568–573 (2001). 107. Barbour, L. A. et al. Cellular mechanisms for insulin resistance in normal pregnancy and gestational diabetes. Diabetes Care 30 Suppl 2, S112-9 (2007). 108. Dornhorst, A. & Misra, S. Epidemiology, aetiology and pathogensis of diabetes in pregnancy. in Advanced Nutrition and Dietetics in Diabetes (eds. Goff, L. & Dyson, P.) 159–167 (John Wiley & Sons, Ltd., 2016). 109. Thanabalasingham, G. & Owen, K. R. Diagnosis and management of maturity onset diabetes of the young (MODY). BMJ 343, (2011). 110. Martin, D. et al. Long-Term Follow-Up of Oral Glucose Tolerance Test–Derived Glucose Tolerance and Insulin Secretion and Insulin Sensitivity Indexes in Subjects With Glucokinase Mutations (MODY2). Diabetes Care 31, 1321 LP – 1323 (2008). 111. Pearson, E. R. et al. Switching from insulin to oral sulfonylureas in patients with diabetes due to Kir6.2 mutations. N. Engl. J. Med. 355, 467–477 (2006). 112. Leslie, R. D. G., Williams, R. & Pozzilli, P. Clinical review: Type 1 diabetes and latent autoimmune diabetes in adults: one end of the rainbow. J. Clin. Endocrinol. Metab. 91, 1654–1659 (2006). 113. Edwards, C. M. B. International Textbook of Diabetes Mellitus. J. R. Soc. Med. 97, 554 (2004). 114. Li, R., Zhang, P., Barker, L. E., Chowdhury, F. M. & Zhang, X. Cost-effectiveness of interventions to prevent and control diabetes mellitus: a systematic review. Diabetes Care 33, 1872–1894 (2010). 115. Herman, W. H. et al. Early Detection and Treatment of Type 2 Diabetes Reduce Cardiovascular Morbidity and Mortality: A Simulation of the Results of the Anglo-Danish-Dutch Study of Intensive Treatment in People With Screen-Detected Diabetes in Primary Care (ADDITION-Europe). Diabetes Care 38, 1449–1455 (2015). 116. International Diabetes Federation. Access to Medicines and Supplies for People with Diabetes: a global survey on patients’ and health professionals’ perspective. (International Diabetes Federation, 2016). 117. Scott, A., Chambers, D., Goyder, E. & O’Cathain, A. Socioeconomic inequalities in mortality, morbidity and diabetes management for adults with type 1 diabetes: A systematic review. PLoS One 12, e0177210 (2017). 118. Brinkman, A. K. Management of Type 1 Diabetes. Nurs. Clin. North Am. 52, 499–511 (2017). 119. International Diabetes Federation. No child should die of diabetes - Life for Child and International Diabetes Federation programme. International Diabetes Federation (2015). 120. Bailey, C. J., Tahrani, A. A. & Barnett, A. H. Future glucose-lowering drugs for type 2 diabetes. Lancet Diabetes Endocrinol. 4, 350–359 (2016). 121. Rena, G., Hardie, D. G. & Pearson, E. R. The mechanisms of action of metformin. Diabetologia 60, 1577–1585 (2017). 122. El-Mir, M. Y. et al. Dimethylbiguanide inhibits cell respiration via an indirect effect targeted on the respiratory chain complex I. J. Biol. Chem. 275, 223–228 (2000). 123. Zhou, G. et al. Role of AMP-activated protein kinase in mechanism of metformin action. J. Clin. Invest. 108, 1167– 1174 (2001). 124. Owen, M. R., Doran, E. & Halestrap, A. P. Evidence that metformin exerts its anti-diabetic effects through inhibition of complex 1 of the mitochondrial respiratory chain. Biochem. J. 348 Pt 3, 607–614 (2000). 125. Bridges, H. R., Jones, A. J. Y., Pollak, M. N. & Hirst, J. Effects of metformin and other biguanides on oxidative phosphorylation in mitochondria. Biochem. J. 462, 475–487 (2014). 126. Deschemin, J.-C., Foretz, M., Viollet, B. & Vaulont, S. AMPK is not required for the effect of metformin on the inhibition of BMP6-induced hepcidin gene expression in hepatocytes. Sci. Rep. 7, 12679 (2017). 127. Pryor, H. J., Smyth, J. E., Quinlan, P. T. & Halestrap, A. P. Evidence that the flux control coefficient of the respiratory chain is high during gluconeogenesis from lactate in hepatocytes from starved rats. Implications for the

241 hormonal control of gluconeogenesis and action of hypoglycaemic agents. Biochem. J. 247, 449–457 (1987). 128. Hsu, W.-H. et al. Effect of metformin on kidney function in patients with type 2 diabetes mellitus and moderate chronic kidney disease. Oncotarget 9, 5416–5423 (2018). 129. Lalau, J.-D. et al. Metformin Treatment in Patients With Type 2 Diabetes and Chronic Kidney Disease Stages 3A, 3B, or 4. Diabetes Care dc172231 (2018). 130. Marín-Peñalver, J. J., Martín-Timón, I., Sevillano-Collantes, C. & Del Cañizo-Gómez, F. J. Update on the treatment of type 2 diabetes mellitus. World J. Diabetes 7, 354–395 (2016). 131. Genuth, S. Should Sulfonylureas Remain an Acceptable First-Line Add-on to Metformin Therapy in Patients With Type 2 Diabetes? No, It’s Time to Move On! Diabetes Care 38, 170 LP – 175 (2015). 132. Eldor, R. & Raz, I. Diabetes therapy-focus on Asia: second-line therapy debate: insulin/secretagogues. Diabetes. Metab. Res. Rev. 28, 85–89 (2012). 133. Gerich, J., Raskin, P., Jean-Louis, L., Purkayastha, D. & Baron, M. A. PRESERVE-beta: two-year efficacy and safety of initial combination therapy with nateglinide or glyburide plus metformin. Diabetes Care 28, 2093–2099 (2005). 134. McIntosh, B. et al. Second-line therapy in patients with type 2 diabetes inadequately controlled with metformin monotherapy: a systematic review and mixed-treatment comparison meta-analysis. Open Med. 5, e35–e48 (2011). 135. DiNicolantonio, J. J., Bhutani, J. & O’Keefe, J. H. Acarbose: safe and effective for lowering postprandial hyperglycaemia and improving cardiovascular outcomes. Open Hear. 2, e000327 (2015). 136. Chiasson, J. L. et al. The STOP-NIDDM Trial: an international study on the efficacy of an alpha-glucosidase inhibitor to prevent type 2 diabetes in a population with impaired glucose tolerance: rationale, design, and preliminary screening data. Study to Prevent Non-Insulin-Depe. Diabetes Care 21, 1720–1725 (1998). 137. Chiasson, J.-L. et al. Acarbose treatment and the risk of cardiovascular disease and hypertension in patients with impaired glucose tolerance: the STOP-NIDDM trial. JAMA 290, 486–494 (2003). 138. Watkins, P. B. & Whitcomb, R. W. Hepatic dysfunction associated with troglitazone. The New England journal of medicine 338, 916–917 (1998). 139. DeFronzo, R. A., Triplitt, C. L., Abdul-Ghani, M. & Cersosimo, E. Novel Agents for the Treatment of Type 2 Diabetes. Diabetes Spectr. 27, 100 LP – 112 (2014). 140. BC Campus. Gross Anatomy of the Kidney. in Anatomy and Physiology 172 (BC Open Textbook project, 2019). 141. O’Callaghan, C. The kidney: Functional overview. in The Renal System at a Glance 4–6 (John Wiley & Sons, Ltd., 2017). 142. Patton, K. T. & Thibodeau, G. A. Urinary System. in The Human Body in Health and Disease 558–559 (Elsevier Inc., 2018). 143. The Editors of Encyclopaedia Britannica. Nephron. Encyclopædia Britannica (2019). Available at: https://www.britannica.com/science/nephron. (Accessed: 29th September 2019) 144. Jha, V. et al. Chronic kidney disease: Global dimension and perspectives. Lancet (London, England) 382, 260–272 (2013). 145. Hill, N. R. et al. Global prevalence of chronic kidney disease – A systematic review and meta-analysis. PLoS One 11, e0158765 (2016). 146. Lozano, R. et al. Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: A systematic analysis for the Global Burden of Disease Study 2010. Lancet 380, 2095–2128 (2012). 147. Foreman, K. J. et al. Forecasting life expectancy, years of life lost, and all-cause and cause-specific mortality for 250 causes of death: reference and alternative scenarios for 2016-40 for 195 countries and territories. Lancet (London, England) 392, 2052–2090 (2018). 148. Anand, S. et al. Prevalence of chronic kidney disease and risk factors for its progression: A cross-sectional comparison of Indians living in Indian versus U.S. cities. PLoS One 12, e0173554 (2017). 149. Tonelli, M. et al. How to advocate for the inclusion of chronic kidney disease in a national noncommunicable chronic disease program. Kidney Int. 85, 1269–1274 (2014). 150. Liyanage, T. et al. Worldwide access to treatment for end-stage kidney disease: a systematic review. Lancet 385, 1975–1982 (2015). 151. KDIGO. KDIGO 2012 Clinical practice guideline for the evaluation and management of chronic kidney disease. Off.

242 J. Int. Soc. Nephrol. 3, (2013). 152. Astor, B. C. et al. Lower estimated glomerular filtration rate and higher albuminuria are associated with mortality and end-stage renal disease. A collaborative meta-analysis of kidney disease population cohorts. Kidney Int. 79, 1331–1340 (2011). 153. Gansevoort, R. T. et al. Lower estimated GFR and higher albuminuria are associated with adverse kidney outcomes. A collaborative meta-analysis of general and high-risk population cohorts. Kidney Int. 80, 93–104 (2011). 154. Matsushita, K. et al. Association of estimated glomerular filtration rate and albuminuria with all-cause and cardiovascular mortality in general population cohorts: a collaborative meta-analysis. Lancet (London, England) 375, 2073–2081 (2010). 155. van der Velde, M. et al. Lower estimated glomerular filtration rate and higher albuminuria are associated with all- cause and cardiovascular mortality. A collaborative meta-analysis of high-risk population cohorts. Kidney Int. 79, 1341–1352 (2011). 156. Thomas, B. Proteinuria. Medscape (2018). Available at: https://emedicine.medscape.com/article/238158- overview. (Accessed: 11th September 2019) 157. Andrade, M. & Knight, J. Exploring the anatomy and physiology of ageing: Part 4--the renal system. Nurs. Times 104, 22–23 (2008). 158. Poggio, E. D. et al. Demographic and clinical characteristics associated with glomerular filtration rates in living kidney donors. Kidney Int. 75, 1079–1087 (2009). 159. Denic, A., Glassock, R. J. & Rule, A. D. Structural and functional changes with the aging kidney. Adv. Chronic Kidney Dis. 23, 19–28 (2016). 160. Noone, D. & Licht, C. Chronic kidney disease: a new look at pathogenetic mechanisms and treatment options. Pediatr. Nephrol. 29, 779–792 (2014). 161. Brenner, B. M., Hostetter, T. H. & Humes, H. D. Molecular basis of proteinuria of glomerular origin. N. Engl. J. Med. 298, 826–833 (1978). 162. Brenner, B. M., Meyer, T. W. & Hostetter, T. H. Dietary protein intake and the progressive nature of kidney disease: the role of hemodynamically mediated glomerular injury in the pathogenesis of progressive glomerular sclerosis in aging, renal ablation, and intrinsic renal disease. N. Engl. J. Med. 307, 652–659 (1982). 163. Hostetter, T. H., Olson, J. L., Rennke, H. G., Venkatachalam, M. A. & Brenner, B. M. Hyperfiltration in remnant nephrons: a potentially adverse response to renal ablation. Am. J. Physiol. 241, F85-93 (1981). 164. Klag, M. J. et al. Blood pressure and end-stage renal disease in men. N. Engl. J. Med. 334, 13–18 (1996). 165. Ruggenenti, P., Cravedi, P. & Remuzzi, G. Mechanisms and treatment of CKD. J. Am. Soc. Nephrol. 23, 1917–1928 (2012). 166. Slee, A. D. Exploring metabolic dysfunction in chronic kidney disease. Nutr Metab 9, (2012). 167. Siew, E. D. & Ikizler, T. A. Insulin resistance and protein energy metabolism in patients with advanced chronic kidney disease. Semin Dial 23, (2010). 168. Dounousi, E. et al. Oxidative stress is progressively enhanced with advancing stages of CKD. Am J Kidney Dis 48, (2006). 169. Cachofeiro, V. et al. Oxidative stress and inflammation, a link between chronic kidney disease and cardiovascular disease. Kidney Int Suppl 111, (2008). 170. Brownlee, M. Biochemistry and molecular cell biology of diabetic complications. Nature 414, 813–820 (2001). 171. Silverthorn, D. U. The Kidneys. in Human Physiology: An Integrated Approach 621–622 (Pearson Education Limited, 2016). 172. Grahammer, F. et al. A flexible, multilayered protein scaffold maintains the slit in between glomerular podocytes. JCI insight 1, e86177 (2016). 173. Lemley, K. V. Glomerular pathology and the progression of chronic kidney disease. Am. J. Physiol. Physiol. 310, F1385–F1388 (2016). 174. Chaudhry, S. Chronic Kidney Disease. McMaster Pathophysiology Reveiw (2013). Available at: http://www.pathophys.org/ckd/#Pathophysiology. (Accessed: 29th September 2019) 175. Lopez-Novoa, J. M. & Nieto, M. A. Inflammation and EMT: an alliance towards organ fibrosis and cancer

243 progression. EMBO Mol. Med. 1, 303–314 (2009). 176. Schiffer, M. et al. Apoptosis in podocytes induced by TGF-β and Smad7. J. Clin. Invest. 108, 807–816 (2001). 177. Kopp, J. B. et al. Transgenic mice with increased plasma levels of TGF-beta 1 develop progressive renal disease. Lab. Invest. 74, 991–1003 (1996). 178. Liu, Y. Epithelial to mesenchymal transition in renal fibrogenesis: pathologic significance, molecular mechanism, and therapeutic intervention. J. Am. Soc. Nephrol. 15, 1–12 (2004). 179. Tang, J., Liu, N. & Zhuang, S. Role of epidermal growth factor receptor in acute and chronic kidney injury. Kidney Int. 83, 804–810 (2013). 180. Hutton, H. L., Ooi, J. D., Holdsworth, S. R. & Kitching, A. R. The NLRP3 inflammasome in kidney disease and autoimmunity. Nephrology 21, 736–744 (2016). 181. Souza, A. C. P. et al. TLR4 mutant mice are protected from renal fibrosis and chronic kidney disease progression. Physiol. Rep. 3, (2015). 182. Go, A. S., Chertow, G. M., Fan, D., McCulloch, C. E. & Hsu, C. Chronic kidney disease and the risks of death, cardiovascular events, and hospitalization. N. Engl. J. Med. 351, 1296–1305 (2004). 183. Coresh, J. et al. Prevalence of chronic kidney disease in the United States. JAMA 298, 2038–2047 (2007). 184. Abramson, J. L., Jurkovitz, C. T., Vaccarino, V., Weintraub, W. S. & McClellan, W. Chronic kidney disease, anemia, and incident stroke in a middle-aged, community-based population: the ARIC Study. Kidney Int. 64, 610–615 (2003). 185. Van Biesen, W. et al. The glomerular filtration rate in an apparently healthy population and its relation with cardiovascular mortality during 10 years. Eur. Heart J. 28, 478–483 (2007). 186. Krittayaphong, R. et al. Prevalence of chronic kidney disease associated with cardiac and vascular complications in hypertensive patients: a multicenter, nation-wide study in Thailand. BMC Nephrol. 18, 115 (2017). 187. Toyoda, K. & Ninomiya, T. Stroke and cerebrovascular diseases in patients with chronic kidney disease. Lancet. Neurol. 13, 823–833 (2014). 188. Ene-Iordache, B. et al. Chronic kidney disease and cardiovascular risk in six regions of the world (ISN-KDDC): a cross-sectional study. Lancet. Glob. Heal. 4, e307-19 (2016). 189. Schiffrin, E. L., Lipman, M. L. & Mann, J. F. E. Chronic kidney disease: effects on the cardiovascular system. Circulation 116, 85–97 (2007). 190. Tomey, M. I. & Winston, J. A. Cardiovascular pathophysiology in chronic kidney disease: opportunities to transition from disease to health. Ann. Glob. Heal. 80, 69–76 (2014). 191. Small, D. M., Coombes, J. S., Bennett, N., Johnson, D. W. & Gobe, G. C. Oxidative stress, anti-oxidant therapies and chronic kidney disease. Nephrology (Carlton). 17, 311–321 (2012). 192. Granata, S., Dalla Gassa, A., Tomei, P., Lupo, A. & Zaza, G. Mitochondria: A new therapeutic target in chronic kidney disease. Nutr. Metab. (Lond). 12, 49 (2015). 193. Gamboa, J. L. et al. Mitochondrial dysfunction and oxidative stress in patients with chronic kidney disease. Physiol. Rep. 4, e12780 (2016). 194. Chen, J. et al. EGFR signaling promotes TGFbeta-dependent renal fibrosis. J. Am. Soc. Nephrol. 23, 215–224 (2012). 195. Limou, S., Vince, N. & Parsa, A. Lessons from CKD-related genetic association studies–moving forward. Clin. J. Am. Soc. Nephrol. 13, 140–152 (2018). 196. Neuen, B. L., Chadban, S. J., Demaio, A. R., Johnson, D. W. & Perkovic, V. Chronic kidney disease and the global NCDs agenda. BMJ Glob. Heal. 2, e000380 (2017). 197. Wang, H. et al. Global, regional, and national life expectancy, all-cause mortality, and cause-specific mortality for 249 causes of death, 1980–2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet 388, 1459–1544 (2016). 198. MacIsaac, R. J., Jerums, G. & Watts, G. F. Diabetic Chronic Kidney Disease. in Diabetes: Chronic Complications (eds. Shaw, K. M. & Cummings, M. H.) 34–66 (John Wiley & Sons, Ltd., 2012). 199. Batuman, V., J Schmidt, R., Soman, A. S. & Soman, S. S. Diabetic Nephropathy. Medscape (2019). Available at: https://emedicine.medscape.com/article/238946-overview. (Accessed: 26th July 2019) 200. Yip, J. W., Jones, S. L., Wiseman, M. J., Hill, C. & Viberti, G. Glomerular Hyperfiltration in the Prediction of

244 Nephropathy in IDDM: A 10-Year Follow-Up Study. Diabetes 45, 1729 LP – 1733 (1996). 201. Reidy, K., Kang, H. M., Hostetter, T. & Susztak, K. Molecular mechanisms of diabetic kidney disease. J. Clin. Invest. 124, 2333–2340 (2014). 202. MacIsaac, R. J., Ekinci, E. I. & Jerums, G. Markers of and Risk Factors for the Development and Progression of Diabetic Kidney Disease. Am. J. Kidney Dis. 63, S39–S62 (2014). 203. Stratton, I. M. et al. Association of glycaemia with macrovascular and microvascular complications of type 2 diabetes (UKPDS 35): prospective observational study. BMJ 321, 405 LP – 412 (2000). 204. Zoungas, S. et al. Association of HbA1c levels with vascular complications and death in patients with type 2 diabetes: evidence of glycaemic thresholds. Diabetologia 55, 636–643 (2012). 205. De Boer, I. H. et al. Intensive diabetes therapy and glomerular filtration rate in type 1 diabetes. N. Engl. J. Med. 365, 2366–2376 (2011). 206. Holman, R. R., Paul, S. K., Bethel, M. A., Matthews, D. R. & Neil, H. A. W. 10-Year Follow-up of Intensive Glucose Control in Type 2 Diabetes. N. Engl. J. Med. 359, 1577–1589 (2008). 207. UK Prospective Diabetes Study (UKPDS) Group. Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes (UKPDS 33). Lancet 352, 837–853 (1998). 208. Rodríguez-Gutiérrez, R. & Montori, V. M. Glycemic Control for Patients With Type 2 Diabetes Mellitus: Our Evolving Faith in the Face of Evidence. Circ. Cardiovasc. Qual. Outcomes 9, 504–512 (2016). 209. Giorgino, F., Home, P. D. & Tuomilehto, J. Glucose Control and Vascular Outcomes in Type 2 Diabetes: Is the Picture Clear? Diabetes Care 39, 187–195 (2016). 210. Seaquist, E. R., Goetz, F. C., Rich, S. & Barbosa, J. Familial Clustering of Diabetic Kidney Disease. N. Engl. J. Med. 320, 1161–1165 (1989). 211. Smyth, L. J., Canadas-Garre, M., Cappa, R. C., Maxwell, A. P. & McKnight, A. J. Genetic associations between genes in the renin-angiotensin-aldosterone system and renal disease: a systematic review and meta-analysis. BMJ Open 9, e026777 (2019). 212. Said, S. M. & Nasr, S. H. Silent diabetic nephropathy. Kidney Int. 90, 24–26 (2016). 213. Tuttle, K. R. et al. Effect of strict glycemic control on renal hemodynamic response to amino acids and renal enlargement in insulin-dependent diabetes mellitus. N. Engl. J. Med. 324, 1626–1632 (1991). 214. Tuttle, K. R. & Bruton, J. L. Effect of insulin therapy on renal hemodynamic response to amino acids and renal hypertrophy in non-insulin-dependent diabetes. Kidney Int. 42, 167–173 (1992). 215. Grabias, B. M. & Konstantopoulos, K. The physical basis of renal fibrosis: effects of altered hydrodynamic forces on kidney homeostasis. Am. J. Physiol. Physiol. 306, F473–F485 (2014). 216. Premaratne, E. et al. The impact of hyperfiltration on the diabetic kidney. Diabetes Metab. 41, 5–17 (2015). 217. Tuttle, K. R. Back to the future: Glomerular hyperfiltration and the diabetic kidney. Diabetes 66, 14–16 (2017). 218. Heerspink, H. J. L., Perkins Bruce, A., Fitchett David, H., Husain, M. & Cherney David Z, I. Sodium Glucose Cotransporter 2 Inhibitors in the Treatment of Diabetes Mellitus. Circulation 134, 752–772 (2016). 219. Wolf, G., Butzmann, U. & Wenzel, U. O. The renin-angiotensin system and progression of renal disease: from hemodynamics to cell biology. Nephron. Physiol. 93, P3-13 (2003). 220. Lewis, E. J., Hunsicker, L. G., Bain, R. P. & Rohde, R. D. The effect of angiotensin-converting-enzyme inhibition on diabetic nephropathy. The Collaborative Study Group. N. Engl. J. Med. 329, 1456–1462 (1993). 221. Wilmer, W. A. et al. Remission of nephrotic syndrome in type 1 diabetes: long-term follow-up of patients in the Captopril Study. Am. J. Kidney Dis. 34, 308–314 (1999). 222. Lewis, E. J. et al. Renoprotective effect of the angiotensin-receptor antagonist irbesartan in patients with nephropathy due to type 2 diabetes. N. Engl. J. Med. 345, 851–860 (2001). 223. Brenner, B. M. et al. Effects of losartan on renal and cardiovascular outcomes in patients with type 2 diabetes and nephropathy. N. Engl. J. Med. 345, 861–869 (2001). 224. Benz, K. & Amann, K. Endothelin in diabetic renal disease. Contrib. Nephrol. 172, 139–148 (2011). 225. Toth-Manikowski, S. & Atta, M. G. Diabetic Kidney Disease: Pathophysiology and Therapeutic Targets. Journal of Diabetes Research 2015, (2015).

245 226. Du, X.-L. et al. Hyperglycemia-induced mitochondrial superoxide overproduction activates the hexosamine pathway and induces plasminogen activator inhibitor-1 expression by increasing Sp1 glycosylation. Proceedings of the National Academy of Sciences of the United States of America 97, 12222–12226 (2000). 227. Nishikawa, T. et al. Normalizing mitochondrial superoxide production blocks three pathways of hyperglycaemic damage. Nature 404, 787–790 (2000). 228. Beckerman, P. & Susztak, K. Sweet Debate: Fructose versus Glucose in Diabetic Kidney Disease. Journal of the American Society of Nephrology : JASN 25, 2386–2388 (2014). 229. Lanaspa, M. A. et al. Endogenous Fructose Production and Fructokinase Activation Mediate Renal Injury in Diabetic Nephropathy. Journal of the American Society of Nephrology : JASN 25, 2526–2538 (2014). 230. Schleicher, E. D. & Weigert, C. Role of the hexosamine biosynthetic pathway in diabetic nephropathy. Kidney Int. Suppl. 77, S13-8 (2000). 231. Chen, S., Jim, B. & Ziyadeh, F. N. Diabetic nephropathy and transforming growth factor-beta: transforming our view of glomerulosclerosis and fibrosis build-up. Semin. Nephrol. 23, 532–543 (2003). 232. Sheetz, M. J. & King, G. L. Molecular understanding of hyperglycemia’s adverse effects for diabetic complications. JAMA 288, 2579–2588 (2002). 233. Forbes, J. M., Cooper, M. E., Oldfield, M. D. & Thomas, M. C. Role of advanced glycation end products in diabetic nephropathy. J. Am. Soc. Nephrol. 14, S254-8 (2003). 234. Fukami, K., Yamagishi, S.-I., Ueda, S. & Okuda, S. Role of AGEs in diabetic nephropathy. Curr. Pharm. Des. 14, 946– 952 (2008). 235. Alicic, R. Z. & Tuttle, K. R. Novel therapies for diabetic kidney disease. Adv. Chronic Kidney Dis. 21, 121–133 (2014). 236. Noh, H. & King, G. L. The role of protein kinase C activation in diabetic nephropathy. Kidney Int. Suppl. S49-53 (2007). 237. Craven, P. A., Davidson, C. M. & DeRubertis, F. R. Increase in diacylglycerol mass in isolated glomeruli by glucose from de novo synthesis of glycerolipids. Diabetes 39, 667–674 (1990). 238. Prabhakar, S. S. Role of nitric oxide in diabetic nephropathy. Semin. Nephrol. 24, 333–344 (2004). 239. Tessari, P. Nitric oxide in the normal kidney and in patients with diabetic nephropathy. J. Nephrol. 28, 257–268 (2015). 240. Menne, J. et al. Diminished loss of proteoglycans and lack of albuminuria in protein kinase C-alpha-deficient diabetic mice. Diabetes 53, 2101–2109 (2004). 241. Garcia-Garcia, P. M., Getino-Melian, M. A., Dominguez-Pimentel, V. & Navarro-Gonzalez, J. F. Inflammation in diabetic kidney disease. World J. Diabetes 5, 431–443 (2014). 242. Pickup, J. C. Inflammation and activated innate immunity in the pathogenesis of type 2 diabetes. Diabetes Care 27, 813–823 (2004). 243. Sanz, A. B. et al. NF-kappaB in renal inflammation. J. Am. Soc. Nephrol. 21, 1254–1262 (2010). 244. Mezzano, S. et al. NF-kappaB activation and overexpression of regulated genes in human diabetic nephropathy. Nephrol. Dial. Transplant 19, 2505–2512 (2004). 245. Ohga, S. et al. Thiazolidinedione ameliorates renal injury in experimental diabetic rats through anti-inflammatory effects mediated by inhibition of NF-kappaB activation. Am. J. Physiol. Renal Physiol. 292, F1141-50 (2007). 246. Sakai, N. et al. p38 MAPK phosphorylation and NF-kappa B activation in human crescentic glomerulonephritis. Nephrol. Dial. Transplant 17, 998–1004 (2002). 247. Chuang, P. Y. & He, J. C. JAK/STAT signaling in renal diseases. Kidney international 78, 231–234 (2010). 248. Berthier, C. C. et al. Enhanced Expression of Janus Kinase–Signal Transducer and Activator of Transcription Pathway Members in Human Diabetic Nephropathy. Diabetes 58, 469–477 (2009). 249. Navarro, J. F., Milena, F. J., Mora, C., Leon, C. & Garcia, J. Renal pro-inflammatory cytokine gene expression in diabetic nephropathy: effect of angiotensin-converting enzyme inhibition and pentoxifylline administration. Am. J. Nephrol. 26, 562–570 (2006). 250. Sekizuka, K. et al. Detection of serum IL-6 in patients with diabetic nephropathy. Nephron 68, 284–285 (1994). 251. Moriwaki, Y. et al. Elevated levels of interleukin-18 and tumor necrosis factor-alpha in serum of patients with type 2 diabetes mellitus: relationship with diabetic nephropathy. Metabolism. 52, 605–608 (2003).

246 252. Navarro, J. F., Mora, C., Muros, M. & Garcia, J. Urinary tumour necrosis factor-alpha excretion independently correlates with clinical markers of glomerular and tubulointerstitial injury in type 2 diabetic patients. Nephrol. Dial. Transplant 21, 3428–3434 (2006). 253. Navarro-Gonzalez, J. F. & Mora-Fernandez, C. The role of inflammatory cytokines in diabetic nephropathy. J. Am. Soc. Nephrol. 19, 433–442 (2008). 254. Dalla Vestra, M. et al. Acute-phase markers of inflammation and glomerular structure in patients with type 2 diabetes. J. Am. Soc. Nephrol. 16 Suppl 1, S78-82 (2005). 255. Marino, E. & Cardier, J. E. Differential effect of IL-18 on endothelial cell apoptosis mediated by TNF-alpha and Fas (CD95). Cytokine 22, 142–148 (2003). 256. DiPetrillo, K., Coutermarsh, B. & Gesek, F. A. Urinary tumor necrosis factor contributes to sodium retention and renal hypertrophy during diabetes. Am. J. Physiol. Renal Physiol. 284, F113-21 (2003). 257. Badal, S. S. & Danesh, F. R. New Insights Into Molecular Mechanisms of Diabetic Kidney Disease. American journal of kidney diseases : the official journal of the National Kidney Foundation 63, S63-83 (2014). 258. Marshall, S. M. Natural history and clinical characteristics of CKD in type 1 and type 2 diabetes mellitus. Adv. Chronic Kidney Dis. 21, 267–272 (2014). 259. Perez-Gomez, M. V. et al. Horizon 2020 in Diabetic Kidney Disease: The Clinical Trial Pipeline for Add-On Therapies on Top of Renin Angiotensin System Blockade. J. Clin. Med. 4, 1325–1347 (2015). 260. Mizuiri, S. & Ohashi, Y. ACE and ACE2 in kidney disease. World J. Nephrol. 4, 74–82 (2015). 261. Nadarajah, R. et al. Podocyte-specific overexpression of human angiotensin-converting enzyme 2 attenuates diabetic nephropathy in mice. Kidney Int. 82, 292–303 (2012). 262. Judge, P., Haynes, R., Landray, M. J. & Baigent, C. Neprilysin inhibition in chronic kidney disease. Nephrol. Dial. Transplant 30, 738–743 (2015). 263. Skrtic, M. & Cherney, D. Z. I. Sodium-glucose cotransporter-2 inhibition and the potential for renal protection in diabetic nephropathy. Curr. Opin. Nephrol. Hypertens. 24, 96–103 (2015). 264. Stanton, R. C. Sodium glucose transport 2 (SGLT2) inhibition decreases glomerular hyperfiltration: is there a role for SGLT2 inhibitors in diabetic kidney disease? Circulation 129, 542–544 (2014). 265. Cherney, D. Z. et al. The effect of empagliflozin on arterial stiffness and heart rate variability in subjects with uncomplicated type 1 diabetes mellitus. Cardiovasc. Diabetol. 13, 28 (2014). 266. Cherney, D. Z. I. et al. Renal hemodynamic effect of sodium-glucose cotransporter 2 inhibition in patients with type 1 diabetes mellitus. Circulation 129, 587–597 (2014). 267. Perkins, B. A. et al. Sodium-glucose cotransporter 2 inhibition and glycemic control in type 1 diabetes: results of an 8-week open-label proof-of-concept trial. Diabetes Care 37, 1480–1483 (2014). 268. Ferrannini, E. & Solini, A. SGLT2 inhibition in diabetes mellitus: rationale and clinical prospects. Nat. Rev. Endocrinol. 8, 495–502 (2012). 269. Lytvyn, Y. et al. Glycosuria-mediated urinary uric acid excretion in patients with uncomplicated type 1 diabetes mellitus. Am. J. Physiol. Renal Physiol. 308, F77-83 (2015). 270. Musso, G., Gambino, R., Cassader, M. & Pagano, G. A novel approach to control hyperglycemia in type 2 diabetes: sodium glucose co-transport (SGLT) inhibitors: systematic review and meta-analysis of randomized trials. Ann. Med. 44, 375–393 (2012). 271. Tonneijck, L., Smits, M. M., van Raalte, D. H. & Muskiet, M. H. A. Incretin-based drugs and renoprotection-is hyperfiltration key? Kidney international 87, 660–661 (2015). 272. Groop, P.-H. et al. Linagliptin lowers albuminuria on top of recommended standard treatment in patients with type 2 diabetes and renal dysfunction. Diabetes Care 36, 3460–3468 (2013). 273. Lewis, E. J. et al. Pyridorin in type 2 diabetic nephropathy. J. Am. Soc. Nephrol. 23, 131–136 (2012). 274. Tuttle, K. R. et al. Baricitinib in diabetic kidney disease: results from a phase 2, multicenter, randomized, double- blind, placebo-controlled study. in American Diabetes Association Meeting 114, (2015). 275. Lytvyn, Y., Bjornstad, P., Pun, N. & Cherney, D. Z. I. New and old agents in the management of diabetic nephropathy. Curr. Opin. Nephrol. Hypertens. 25, 232–239 (2016). 276. Karter, A. J. et al. Ethnic disparities in diabetic complications in an insured population. Jama 287, 2519–2527 (2002).

247 277. Cañadas-Garre, M., Anderson, K., McGoldrick, J., Maxwell, A. P. & McKnight, A. J. Genomic approaches in the search for molecular biomarkers in chronic kidney disease. J. Transl. Med. 16, 292 (2018). 278. Smyth, L. J., Duffy, S., Maxwell, A. P. & McKnight, A. J. Genetic and epigenetic factors influencing chronic kidney disease. Am. J. Physiol. Renal Physiol. 307, F757-76 (2014). 279. McKnight, A. J., McKay, G. J. & Maxwell, A. P. Genetic and epigenetic risk factors for diabetic kidney disease. Adv. Chronic Kidney Dis. 21, 287–296 (2014). 280. Tang, Z. H., Zeng, F. & Zhang, X. Z. Human genetics of diabetic nephropathy. Ren Fail. 37, 363–71. doi: 10.3109/0886022X.2014.1000801. Epub 2 (2015). 281. Slatkin, M. Linkage disequilibrium--understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9, 477–85 (2008). 282. Tziastoudi, M., Stefanidis, I., Stravodimos, K. & Zintzaras, E. Identification of Chromosomal Regions Linked to Diabetic Nephropathy: A Meta-Analysis of Genome-Wide Linkage Scans. Genet. Test. Mol. Biomarkers 23, 105–117 (2019). 283. McKnight, A. J., O’Donoghue, D. & Peter Maxwell, A. Annotated chromosome maps for renal disease. Hum. Mutat. 30, 314–320 (2009). 284. Froguel, P. et al. Close linkage of glucokinase locus on chromosome 7p to early-onset non-insulin-dependent diabetes mellitus. Nature 356, 162–164 (1992). 285. Yamagata, K. et al. Mutations in the hepatocyte nuclear factor-1α gene in maturity-onset diabetes of the young (MODY3). Nature 384, 455–458 (1996). 286. Grimsley, C. M. et al. Dock180 and ELMO1 Proteins Cooperate to Promote Evolutionarily Conserved Rac- dependent Cell Migration. J. Biol. Chem. 279, 6087–6097 (2003). 287. Shimazaki, A. et al. Genetic Variations in the Gene Encoding ELMO1 Are Associated With Susceptibility to Diabetic Nephropathy. Diabetes 54, 1171–1178 (2005). 288. Pezzolesi, M. G. et al. Confirmation of Genetic Associations at ELMO1 in the GoKinD Collection Supports Its Role as a Susceptibility Gene in Diabetic Nephropathy. Diabetes 58, 2698–2702 (2009). 289. McKnight, A. J., Currie, D., Patterson, C. C., Maxwell, A. P. & Fogarty, D. G. Targeted genome-wide investigation identifies novel SNPs associated with diabetic nephropathy. Hugo J. 3, 77–82 (2009). 290. McKnight, A. J., Duffy, S. & Maxwell, A. P. Genetics of Diabetic Nephropathy: a Long Road of Discovery. Curr. Diab. Rep. 15, 41 (2015). 291. Fagerholm, E. et al. SNP in the genome-wide association study hotspot on chromosome 9p21 confers susceptibility to diabetic nephropathy in type 1 diabetes. Diabetologia 55, 2386–2393 (2012). 292. Saxena, R. et al. Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride Levels. Science (80-. ). 316, 1331–1336 (2007). 293. Ahlqvist, E., van Zuydam, N. R., Groop, L. C. & McCarthy, M. I. The genetics of diabetic complications. Nat Rev Nephrol 11, 277–287 (2015). 294. Salem, R. M. et al. Genome-Wide Association Study of Diabetic Kidney Disease Highlights Biology Involved in Glomerular Basement Membrane Collagen. J. Am. Soc. Nephrol. (2019). 295. Cooper, G. The Cell: A molecular approach. (Sinauer Associates, 2000). 296. Nunnari, J. & Suomalainen, A. Mitochondria: In sickness and in health. Cell 148, 1145–1159 (2012). 297. Gray, M. W., Burger, G. & Lang, B. F. Mitochondrial evolution. Science (80-. ). 283, 1476 LP – 1481 (1999). 298. Stewart, J. B. et al. Strong Purifying Selection in Transmission of Mammalian Mitochondrial DNA. PLoS Biol. 6, e10 (2008). 299. Ma, K. Mitochondrion. Wikimedia (2019). Available at: https://commons.wikimedia.org/wiki/File:Mitochondrion_mini.svg. (Accessed: 30th September 2019) 300. Anderson, S. et al. Sequence and organization of the human mitochondrial genome. Nature 290, 457–465 (1981). 301. Schatz, G. The Protein Import System of Mitochondria. J. Biol. Chem. 31763–31767 (1996). 302. Chomyn, A. et al. Six unidentified reading frames of human mitochondrial DNA encode components of the respiratory-chain NADH dehydrogenase. Nature 314, 592–597 (1985). 303. Nicholls, T. J. & Minczuk, M. In D-loop: 40 years of mitochondrial 7S DNA. Exp. Gerontol. 56, 175–181 (2014).

248 304. Clayton, D. A. Replication of animal mitochondrial DNA. Cell 28, 693–705 (1982). 305. Brown, T. A., Cecconi, C., Tkachuk, A. N., Bustamante, C. & Clayton, D. A. Replication of mitochondrial DNA occurs by strand displacement with alternative light-strand origins, not via a strand-coupled mechanism. Genes Dev. 19, 2466–2476 (2005). 306. Clayton, D. A. & Larsson, N.-G. Mitochondrial DNA replication and human disease. in DNA Replication and Human Disease (ed. DePamphilis, M. L.) 47, 547–560 (Cold Spring Harbor Laboratory Press, 2006). 307. Robberson, D. L. & Clayton, D. A. Pulse-labeled components in the replication of mitochondrial deoxyribonucleic acid. J. Biol. Chem. 248, 4512–4514 (1973). 308. Brown, T. A. & Clayton, D. A. Release of replication termination controls mitochondrial DNA copy number after depletion with 2′,3′-dideoxycytidine. Nucleic Acids Res. 30, 2004–2010 (2002). 309. Bogenhagen, D. & Clayton, D. A. Mechanism of mitochondrial DNA replication in mouse L-cells: Kinetics of synthesis and turnover of the initiation sequence. J. Mol. Biol. 119, 49–68 (1978). 310. Zhang, H. & Pommier, Y. Mitochondrial topoisomerase I sites in the regulatory D-Loop region of mitochondrial DNA. Biochemistry 47, 11196–11203 (2008). 311. Holt, I. J. et al. Mammalian mitochondrial nucleoids: Organizing an independently minded genome. Mitochondrion 7, 311–321 (2007). 312. He, J. et al. The AAA+ protein ATAD3 has displacement loop binding properties and is involved in mitochondrial nucleoid organization. J. Cell Biol. 176, 141 LP – 146 (2007). 313. Hubstenberger, A., Merle, N., Charton, R., Brandolin, G. & Rousseau, D. Topological analysis of ATAD3A insertion in purified human mitochondria. J. Bioenerg. Biomembr. 42, 143–150 (2010). 314. Holt, I. J., Lorimer, H. E. & Jacobs, H. T. Coupled Leading- and Lagging-Strand Synthesis of Mammalian Mitochondrial DNA. Cell 100, 515–524 (2000). 315. Fuste, J. M. et al. Mitochondrial RNA polymerase is needed for activation of the origin of light-strand DNA replication. Mol. Cell 37, 67–78 (2010). 316. Yang, M. Y. et al. Biased Incorporation of Ribonucleotides on the Mitochondrial L-Strand Accounts for Apparent Strand-Asymmetric DNA Replication. Cell 111, 495–505 (2002). 317. Yasukawa, T. et al. Replication of vertebrate mitochondrial DNA entails transient ribonucleotide incorporation throughout the lagging strand. EMBO J. 25, 5358–5371 (2005). 318. Ciesielski, G. L., Oliveira, M. T. & Kaguni, L. S. Animal Mitochondrial DNA Replication. Enzym. 39, 255–292 (2016). 319. Bandy, B. & Davison, A. J. Mitochondrial mutations may increase oxidative stress: implications for carcinogenesis and aging? Free Radic. Biol. Med. 8, 523–539 (1990). 320. Stewart, J. B. & Chinnery, P. F. The dynamics of mitochondrial DNA heteroplasmy: implications for human health and disease. Nature reviews. Genetics 16, 530–542 (2015). 321. Hoekstra, R. F. Evolutionary origin and consequences of uniparental mitochondrial inheritance. Hum. Reprod. 15 Suppl 2, 102–111 (2000). 322. Stewart, J. B. & Larsson, N.-G. Keeping mtDNA in Shape between Generations. PLOS Genet. 10, e1004670 (2014). 323. Bergstrom, C. T. & Pritchard, J. Germline bottlenecks and the evolutionary maintenance of mitochondrial genomes. Genetics 149, 2135–2146 (1998). 324. Orrenius, S., Gogvadze, V. & Zhivotovsky, B. Mitochondrial oxidative stress: Implications for cell death. Annu. Rev. Pharmacol. Toxicol. 47, 143–183 (2007). 325. Niyazov, D. M., Kahler, S. G. & Frye, R. E. Primary Mitochondrial Disease and Secondary Mitochondrial Dysfunction: Importance of Distinction for Diagnosis and Treatment. Mol. Syndromol. 7, 122–137 (2016). 326. Di Re, M. et al. The accessory subunit of mitochondrial DNA polymerase γ determines the DNA content of mitochondrial nucleoids in human cultured cells. Nucleic Acids Res. 37, 5701–5713 (2009). 327. Tyynismaa, H. et al. Twinkle helicase is essential for mtDNA maintenance and regulates mtDNA copy number. Hum. Mol. Genet. 13, 3219–3227 (2004). 328. Yang, C., Curth, U., Urbanke, C. & Kang, C. Crystal structure of human mitochondrial single-stranded DNA binding protein at 2.4 A resolution. Nat Struct Mol Biol 4, 153–157 (1997). 329. Cerritelli, S. M. et al. Failure to produce mitochondrial DNA results in embryonic lethality in RNaseH1 null mice.

249 Mol. Cell 11, 807–815 (2003). 330. Simsek, D. et al. Crucial roles for DNA ligase III in mitochondria but not in XRCC1-dependent repair. Nature 471, 245–248 (2011). 331. Chan, S. S. L. & Copeland, W. C. DNA polymerase gamma and mitochondrial disease: Understanding the consequence of POLG mutations. Biochim. Biophys. Acta - Bioenerg. 1787, 312–319 (2009). 332. Milenkovic, D. et al. TWINKLE is an essential mitochondrial helicase required for synthesis of nascent D-loop strands and complete mtDNA replication. Hum. Mol. Genet. 22, 1983–1993 (2013). 333. Sarzi, E. et al. Twinkle helicase (PEO1) gene mutation causes mitochondrial DNA depletion. Ann. Neurol. 62, 579– 587 (2007). 334. Mercer, T. R. et al. The Human Mitochondrial Transcriptome. Cell 146, 645–658 (2011). 335. Suzuki, T. Biosynthesis and function of tRNA wobble modifications. in Fine-Tuning of RNA Functions by Modification and Editing (ed. Grosjean, H.) 23–69 (Springer Berlin Heidelberg, 2005). 336. Rogalski, M., Karcher, D. & Bock, R. Superwobbling facilitates translation with reduced tRNA sets. Nat. Struct. Mol. Biol. 15, 192–198 (2008). 337. Sengupta, S., Yang, X. & Higgs, P. G. The Mechanisms of Codon Reassignments in Mitochondrial Genetic Codes. J. Mol. Evol. 64, 662–688 (2007). 338. Lind, C., Sund, J. & Åqvist, J. Codon-reading specificities of mitochondrial release factors and translation termination at non-standard stop codons. Nat. Commun. 4, 2940 (2013). 339. Shi, Y. et al. Mitochondrial transcription termination factor 1 directs polar replication fork pausing. Nucleic Acids Res. 44, 5732–5742 (2016). 340. Ryan, K. R. & Jensen, R. E. Protein translocation across mitochondrial membranes: What a long, strange trip it is. Cell 83, 517–519 (1995). 341. Bourgeron, T. et al. Mutation of a nuclear succinate dehydrogenase gene results in mitochondrial respiratory chain deficiency. Nat. Genet. 11, 144–149 (1995). 342. Alston, C. L. et al. Recessive germline SDHA and SDHB mutations causing leukodystrophy and isolated mitochondrial complex II deficiency. J. Med. Genet. 49, 569–577 (2012). 343. Jackson, C. B. et al. Mutations in SDHD lead to autosomal recessive encephalomyopathy and isolated mitochondrial complex II deficiency. J. Med. Genet. 51, 170–175 (2014). 344. Gupta, S. et al. RECQL4 and p53 potentiate the activity of polymerase gamma and maintain the integrity of the human mitochondrial genome. Carcinogenesis 35, 34–45 (2014). 345. Wang, J.-T., Xu, X., Alontaga, A. Y., Chen, Y. & Liu, Y. Impaired p32 regulation caused by the lymphoma-prone RECQ4 mutation drives mitochondrial dysfunction. Cell Rep. 7, 848–858 (2014). 346. De, S. et al. RECQL4 is essential for the transport of p53 to mitochondria in normal human cells in the absence of exogenous stress. J. Cell Sci. 125, 2509–2522 (2012). 347. Ding, L. & Liu, Y. Borrowing nuclear DNA helicases to protect mitochondrial DNA. Int. J. Mol. Sci. 16, 10870–10887 (2015). 348. Peng, G. et al. Human Nuclease/helicase DNA2 Alleviates Replication Stress by Promoting DNA End Resection. Cancer Res. 72, 2802–2813 (2012). 349. Duxin, J. P. et al. Human DNA2 Is a Nuclear and Mitochondrial DNA Maintenance Protein. Mol. Cell. Biol. 29, 4274– 4282 (2009). 350. Zheng, L. et al. Human DNA2 is a mitochondrial nuclease/helicase for efficient processing of DNA replication and repair intermediates. Mol. Cell 32, 325–336 (2008). 351. Ronchi, D. et al. Mutations in DNA2 Link Progressive Myopathy to Mitochondrial DNA Instability. Am. J. Hum. Genet. 92, 293–300 (2013). 352. George, T. et al. Human PIF1 helicase unwinds synthetic DNA structures resembling stalled DNA replication forks. Nucleic Acids Res. 37, 6491–6502 (2009). 353. Futami, K., Shimamoto, A. & Furuichi, Y. Mitochondrial and nuclear localization of human PIF1 helicase. Biol. Pharm. Bull. 30, 1685–1692 (2007). 354. Gagou, M. E. et al. Human PIF1 helicase supports DNA replication and cell growth under oncogenic-stress.

250 Oncotarget 5, 11381–11398 (2014). 355. Cheng, X. & Ivessa, A. S. Association of the yeast DNA helicase Pif1p with mitochondrial membranes and mitochondrial DNA. Eur. J. Cell Biol. 89, 742–747 (2010). 356. Cheng, X., Dunaway, S. & Ivessa, A. S. The role of Pif1p, a DNA helicase in Saccharomyces cerevisiae, in maintaining mitochondrial DNA. Mitochondrion 7, 211–222 (2007). 357. Foury, F. & Kolodynski, J. pif mutation blocks recombination between mitochondrial rho+ and rho- genomes having tandemly arrayed repeat units in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. U. S. A. 80, 5345–5349 (1983). 358. Bharti, S. K. et al. DNA sequences proximal to human mitochondrial DNA deletion breakpoints prevalent in human disease form G-quadruplexes, a class of DNA structures inefficiently unwound by the mitochondrial replicative Twinkle helicase. J. Biol. Chem. 289, 29975–29993 (2014). 359. Wang, D. D.-H., Shu, Z., Lieser, S. A., Chen, P.-L. & Lee, W.-H. Human mitochondrial SUV3 and polynucleotide phosphorylase form a 330-kDa heteropentamer to cooperatively degrade double-stranded RNA with a 3’-to-5’ directionality. J. Biol. Chem. 284, 20812–20821 (2009). 360. Wang, Z. & Choi, M. E. Autophagy in Kidney Health and Disease. Antioxidants & Redox Signaling 20, 519–537 (2014). 361. Szczesny, R. J. et al. Human mitochondrial RNA turnover caught in flagranti: involvement of hSuv3p helicase in RNA surveillance. Nucleic Acids Res. 38, 279–298 (2010). 362. Veno, S. T. et al. The human Suv3 helicase interacts with replication protein A and flap endonuclease 1 in the nucleus. Biochem. J. 440, 293–300 (2011). 363. Wang, Z. et al. Specific metabolic rates of major organs and tissues across adulthood: Evaluation by mechanistic model of resting energy expenditure. Am. J. Clin. Nutr. 92, 1369–1377 (2010). 364. O’Connor, P. M. Renal oxygen delivery: Matching delivery to metabolic demand. Clin. Exp. Pharmacol. Physiol. 33, 961–967 (2006). 365. Soltoff, S. P. ATP and the regulation of renal cell function. Annu. Rev. Physiol. 48, 9–31 (1986). 366. Bhargava, P. & Schnellmann, R. G. Mitochondrial energetics in the kidney. Nat. Rev. Nephrol. 13, 629 (2017). 367. Lightowlers, R. N., Taylor, R. W. & Turnbull, D. M. Mutations causing mitochondrial disease: What is new and what challenges remain? Science 349, 1494–1499 (2015). 368. Emma, F., Montini, G., Parikh, S. M. & Salviati, L. Mitochondrial dysfunction in inherited renal disease and acute kidney injury. Nat. Rev. Nephrol. 12, 267–280 (2016). 369. Yanagihara, C., Oyama, A., Tanaka, M., Nakaji, K. & Nishimura, Y. An autopsy case of mitochondrial encephalomyopathy with lactic acidosis and stroke-like episodes syndrome with chronic renal failure. Intern. Med. 40, 662–665 (2001). 370. Iwasaki, N. et al. Prevalence of A-to-G mutation at nucleotide 3243 of the mitochondrial tRNA(Leu(UUR)) gene in Japanese patients with diabetes mellitus and end stage renal disease. J. Hum. Genet. 46, 330–334 (2001). 371. Manouvrier, S. et al. Point mutation of the mitochondrial tRNA(Leu) gene (A 3243 G) in maternally inherited hypertrophic cardiomyopathy, diabetes mellitus, renal failure, and sensorineural deafness. J. Med. Genet. 32, 654– 656 (1995). 372. Desbats, M. A., Lunardi, G., Doimo, M., Trevisson, E. & Salviati, L. Genetic bases and clinical manifestations of coenzyme Q10 (CoQ 10) deficiency. J. Inherit. Metab. Dis. 38, 145–156 (2015). 373. Saiki, R. et al. Coenzyme Q10 supplementation rescues renal disease in Pdss2kd/kd mice with mutations in prenyl diphosphate synthase subunit 2. Am. J. Physiol. Renal Physiol. 295, F1535-44 (2008). 374. Quinzii, C. M. et al. Tissue-specific oxidative stress and loss of mitochondria in CoQ-deficient PDSS2 mutant mice. FASEB J. 27, 612–621 (2013). 375. Peng, M. et al. Primary coenzyme Q deficiency in Pdss2 mutant mice causes isolated renal disease. PLoS Genet. 4, e1000061 (2008). 376. Wilson, F. H. et al. A cluster of metabolic defects caused by mutation in a mitochondrial tRNA. Science 306, 1190– 1194 (2004). 377. Klootwijk, E. D. et al. Mistargeting of peroxisomal EHHADH and inherited renal Fanconi’s syndrome. N. Engl. J. Med. 370, 129–138 (2014).

251 378. Fernandez-Marcos, P. J. & Auwerx, J. Regulation of PGC-1α, a nodal regulator of mitochondrial biogenesis. Am. J. Clin. Nutr. 93, 884S-890S (2011). 379. Zaza, G. et al. Downregulation of nuclear-encoded genes of oxidative metabolism in dialyzed chronic kidney disease patients. PLoS One 8, e77847 (2013). 380. Granata, S. et al. Mitochondrial dysregulation and oxidative stress in patients with chronic kidney disease. BMC Genomics 10, 388 (2009). 381. Choi, B., Kang, K.-S. & Kwak, M.-K. Effect of redox modulating NRF2 activators on chronic kidney disease. Molecules 19, 12727–12759 (2014). 382. Coughlan, M. T. et al. Deficiency in apoptosis-inducing factor recapitulates chronic kidney disease via aberrant mitochondrial homeostasis. Diabetes 65, 1085 LP – 1098 (2016). 383. Sedeek, M., Nasrallah, R., Touyz, R. M. & Hebert, R. L. NADPH oxidases, reactive oxygen species, and the kidney: Friend and foe. J. Am. Soc. Nephrol. 24, 1512–1518 (2013). 384. Nlandu Khodo, S. et al. NADPH-oxidase 4 protects against kidney fibrosis during chronic renal injury. J. Am. Soc. Nephrol. 23, 1967–1976 (2012). 385. Chen, W.-T. et al. Impaired leukocytes autophagy in chronic kidney disease patients. Cardiorenal Med. 3, 254–264 (2013). 386. Chen, P.-L. et al. Mitochondrial genome instability resulting from SUV3 haploinsufficiency leads to tumorigenesis and shortened lifespan. Oncogene 32, 1193–1201 (2013). 387. Lee, J., Giordano, S. & Zhang, J. Autophagy, mitochondria and oxidative stress: Cross-talk and redox signalling. Biochem. J. 441, 523–540 (2012). 388. Kimura, T. et al. Autophagy protects the proximal tubule from degeneration and acute ischemic injury. J. Am. Soc. Nephrol. 22, 902–913 (2011). 389. Edlich, F. et al. Bcl-x(L) retrotranslocates Bax from the mitochondria into the cytosol. Cell 145, 104–116 (2011). 390. Chen, H.-C. et al. An interconnected hierarchical model of cell death regulation by the BCL-2 family. Nat. Cell Biol. 17, 1270–1281 (2015). 391. Burlaka, I. et al. Prevention of apoptosis averts glomerular tubular disconnection and podocyte loss in proteinuric kidney disease. Kidney Int. 90, 135–148 (2017). 392. Massague, J. TGFβ signalling in context. Nat. Rev. Mol. Cell Biol. 13, 616–630 (2012). 393. Gentle, M. E. et al. Epithelial cell TGFβ signaling induces acute tubular injury and interstitial inflammation. J. Am. Soc. Nephrol. 24, 787–799 (2013). 394. Fogo, A. B. PPARγ and chronic kidney disease. Pediatr. Nephrol. 26, 347–351 (2011). 395. Granata, S. et al. NLRP3 inflammasome activation in dialyzed chronic kidney disease patients. PLoS One 10, (2015). 396. Ding, W. et al. Mitochondrial reactive oxygen species-mediated NLRP3 inflammasome activation contributes to aldosterone-induced renal tubular cells injury. Oncotarget 7, 17479–17491 (2016). 397. Gong, W. et al. NLRP3 deletion protects against renal fibrosis and attenuates mitochondrial abnormality in mouse with 5/6 nephrectomy. Am. J. Physiol. Renal Physiol. 310, F1081-8 (2016). 398. Guo, H., Bi, X., Zhou, P., Zhu, S. & Ding, W. NLRP3 deficiency attenuates renal fibrosis and ameliorates mitochondrial dysfunction in a mouse unilateral ureteral obstruction model of chronic kidney disease. Mediators Inflamm. 2017, (2017). 399. Yamamoto-Nonaka, K. et al. Cathepsin D in podocytes is important in the pathogenesis of proteinuria and CKD. J. Am. Soc. Nephrol. 1–16 (2016). 400. El-Meanawy, A. et al. Identification of nephropathy candidate genes by comparing sclerosis-prone and sclerosis- resistant mouse strain kidney transcriptomes. BMC Nephrol. 13, 61 (2012). 401. Yang, C. et al. Negative regulation of the E3 ubiquitin ligase itch via Fyn-mediated tyrosine phosphorylation. Mol. Cell 21, 135–141 (2006). 402. Azakir, B. A. & Angers, A. Reciprocal regulation of the ubiquitin ligase Itch and the epidermal growth factor receptor signaling. Cell. Signal. 21, 1326–1336 (2009). 403. Magnifico, A. et al. WW domain HECT E3s target Cbl RING finger E3s for proteasomal degradation. J. Biol. Chem. 278, 43169–43177 (2003).

252 404. Bai, Y., Yang, C., Hu, K., Elly, C. & Liu, Y.-C. Itch E3 ligase-mediated regulation of TGFβ signaling by modulating SMAD2 phosphorylation. Mol. Cell 15, 825–831 (2004). 405. Gluzman-Poltorak, Z., Cohen, T., Herzog, Y. & Neufeld, G. Neuropilin-2 is a receptor for the vascular endothelial growth factor (VEGF) forms VEGF-145 and VEGF-165 [corrected]. J. Biol. Chem. 275, 18040–18045 (2000). 406. Qi, H. et al. Glomerular endothelial mitochondrial dysfunction is essential and characteristic of diabetic kidney disease susceptibility. Diabetes 66, 763–778 (2017). 407. Che, R., Yuan, Y., Huang, S. & Zhang, A. Mitochondrial dysfunction in the pathophysiology of renal diseases. Am J Physiol Ren. Physiol 306, (2014). 408. Turrens, J. F. Mitochondrial formation of reactive oxygen species. J. Physiol. 552, 335–344 (2003). 409. Murphy, M. P. How mitochondria produce reactive oxygen species. Biochemical Journal 417, 1–13 (2009). 410. Hallan, S. & Sharma, K. The Role of Mitochondria in Diabetic Kidney Disease. Curr. Diab. Rep. 16, 61 (2016). 411. Myung, S.-K. et al. Efficacy of vitamin and antioxidant supplements in prevention of cardiovascular disease: systematic review and meta-analysis of randomised controlled trials. BMJ 346, (2013). 412. Sharma, K. Mitochondrial hormesis and diabetic complications. Diabetes 64, 663–672 (2015). 413. Mackenzie, R. M. et al. Mitochondrial reactive oxygen species enhance AMP-activated protein kinase activation in the endothelium of patients with coronary artery disease and diabetes. Clinical Science (London, England : 1979) 124, 403–411 (2013). 414. Ristow, M. et al. Antioxidants prevent health-promoting effects of physical exercise in humans. Proceedings of the National Academy of Sciences of the United States of America 106, 8665–8670 (2009). 415. Czajka, A. et al. Altered Mitochondrial Function, Mitochondrial DNA and Reduced Metabolic Flexibility in Patients With Diabetic Nephropathy. EBioMedicine 2, 499–512 (2015). 416. Lee, J. E. et al. Higher mitochondrial DNA copy number is associated with lower prevalence of microalbuminuria. Exp Mol Med 41, 253–258 (2009). 417. Malik, A. N. & Czajka, A. Is mitochondrial DNA content a potential biomarker of mitochondrial dysfunction? Mitochondrion 13, 481–492 (2013). 418. Wallace, D. C. & Chalkia, D. Mitochondrial DNA Genetics and the Heteroplasmy Conundrum in Evolution and Disease. Cold Spring Harbor Perspectives in Biology 5, (2013). 419. Mohammedi, K. et al. Manganese Superoxide Dismutase (SOD2) Polymorphisms, Plasma Advanced Oxidation Protein Products (AOPP) Concentration and Risk of Kidney Complications in Subjects with Type 1 Diabetes. PLoS One 9, 1–11 (2014). 420. Lemasters, J. J. Selective mitochondrial autophagy, or mitophagy, as a targeted defense against oxidative stress, mitochondrial dysfunction, and aging. Rejuvenation Res. 8, 3–5 (2005). 421. Huber, T. B. et al. Emerging role of autophagy in kidney function, diseases and aging. Autophagy 8, 1009–1031 (2012). 422. Kim, S. Il et al. Autophagy Promotes Intracellular Degradation of Type I Collagen Induced by Transforming Growth Factor (TGF)-β1. The Journal of Biological Chemistry 287, 11677–11688 (2012). 423. Wang, W. et al. Mitochondrial Fission Triggered by Hyperglycemia Is Mediated by {ROCK1} Activation in Podocytes and Endothelial Cells. Cell Metab. 15, 186–200 (2012). 424. Heijmans, B. T. et al. Persistent epigenetic differences associated with prenatal exposure to famine in humans. Proc. Natl. Acad. Sci. U. S. A. 105, 17046–17049 (2008). 425. Lumey, L. H. et al. Cohort profile: The dutch hunger winter families study. Int. J. Epidemiol. 36, 1196–1204 (2007). 426. Moore, L. D., Le, T. & Fan, G. DNA methylation and its basic function. Neuropsychopharmacology 38, 23–38 (2013). 427. Eden, A., Gaudet, F., Waghmare, A. & Jaenisch, R. Chromosomal instability and tumors promoted by DNA hypomethylation. Science (80-. ). 300, 455 (2003). 428. Esteller, M. CpG island hypermethylation and tumor suppressor genes: A booming present, a brighter future. Oncogene 21, 5427 (2002). 429. Bannister, A. J. & Kouzarides, T. Regulation of chromatin by histone modifications. Cell Res. 21, 381–395 (2011). 430. Geiman, T. M. & Robertson, K. D. Chromatin remodeling, histone modifications, and DNA methylation—how does it all fit together? J. Cell. Biochem. 87, 117–125 (2002).

253 431. Ho, L. & Crabtree, G. R. Chromatin remodelling during development. Nature 463, 474–484 (2010). 432. Frías-Lasserre, D. & Villagra, C. A. The Importance of ncRNAs as Epigenetic Mechanisms in Phenotypic Variation and Organic Evolution. Front. Microbiol. 8, 2483 (2017). 433. Park, K.-Y. & Pfeifer, K. Epigenetic interplay. Nat. Genet. 34, 126 (2003). 434. van der Harst, P., de Windt, L. J. & Chambers, J. C. Translational Perspective on Epigenetics in Cardiovascular Disease. J. Am. Coll. Cardiol. 70, 590 LP – 606 (2017). 435. Deaton, A. M. & Bird, A. CpG islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011). 436. Barres, R. et al. Non-CpG methylation of the PGC-1alpha promoter through DNMT3B controls mitochondrial density. Cell Metab. 10, 189–198 (2009). 437. Jones, P. A. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat. Rev. Genet. 13, 484– 492 (2012). 438. Bell, C. G. et al. Genome-wide DNA methylation analysis for diabetic nephropathy in type 1 diabetes mellitus. BMC Medical Genomics 3, 33 (2010). 439. Smyth, L. J., McKay, G. J., Maxwell, A. P. & McKnight, A. J. DNA hypermethylation and DNA hypomethylation is present at different loci in chronic kidney disease. Epigenetics 9, 366–376 (2014). 440. Ko, Y.-A. et al. Cytosine methylation changes in enhancer regions of core pro-fibrotic genes characterize kidney fibrosis development. Genome Biol. 14, R108 (2013). 441. Rakyan, V. K. et al. An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs). Genome Res. 18, 1518–1529 (2008). 442. Liu, R., Lee, K. & He, J. C. Genetics and Epigenetics of Diabetic Nephropathy. Kidney Dis. 1, 42–51 (2015). 443. Hausser, J., Syed, A. P., Bilen, B. & Zavolan, M. Analysis of CDS-located miRNA target sites suggests that they can effectively inhibit translation. Genome Res. 23, 604–615 (2013). 444. Kozomara, A. & Griffiths-Jones, S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 42, D68-73 (2014). 445. Wu, H. et al. The role of microRNAs in diabetic nephropathy. J. Diabetes Res. 2014, 920134 (2014). 446. Wang, B. et al. Suppression of microRNA-29 Expression by TGF-β1 Promotes Collagen Expression and Renal Fibrosis. Journal of the American Society of Nephrology : JASN 23, 252–265 (2012). 447. Reddy, M. A., Park, J. T. & Natarajan, R. Epigenetic Modifications in the Pathogenesis of Diabetic Nephropathy. Seminars in nephrology 33, 341–353 (2013). 448. Kato, M. et al. MicroRNA-192 in diabetic kidney glomeruli and its function in TGF-beta-induced collagen expression via inhibition of E-box repressors. Proc. Natl. Acad. Sci. U. S. A. 104, 3432–3437 (2007). 449. Wang, B. et al. E-cadherin expression is regulated by miR-192/215 by a mechanism that is independent of the profibrotic effects of transforming growth factor-beta. Diabetes 59, 1794–1802 (2010). 450. Krupa, A. et al. Loss of MicroRNA-192 promotes fibrogenesis in diabetic nephropathy. J. Am. Soc. Nephrol. 21, 438–447 (2010). 451. Babu, D. A., Deering, T. G. & Mirmira, R. G. A Feat of Metabolic Proportions: Pdx1 Orchestrates Islet Development and Function in the Maintenance of Glucose Homeostasis. Molecular genetics and metabolism 92, 43–55 (2007). 452. Deering, T. G., Ogihara, T., Trace, A. P., Maier, B. & Mirmira, R. G. Methyltransferase Set7/9 Maintains Transcription and Euchromatin Structure at Islet-Enriched Genes. Diabetes 58, 185–193 (2009). 453. Musri, M. M., Gomis, R. & Párrizas, M. A chromatin perspective of adipogenesis. Organogenesis 6, 15–23 (2010). 454. Lefterova, M. I. et al. Cell-Specific Determinants of Peroxisome Proliferator-Activated Receptor γ Function in Adipocytes and Macrophages . Molecular and Cellular Biology 30, 2078–2089 (2010). 455. Farmer, S. R. Transcriptional control of adipocyte formation. Cell Metab. 4, 263–273 (2006). 456. Ross, S. et al. Smads orchestrate specific histone modifications and chromatin remodeling to activate transcription. The EMBO Journal 25, 4490–4502 (2006). 457. Kanwar, Y. S., Sun, L., Xie, P., Liu, F. & Chen, S. A Glimpse of Various Pathogenetic Mechanisms of Diabetic Nephropathy. Annual review of pathology 6, 395–423 (2011). 458. Sanchez, A. P. & Sharma, K. Transcription factors in the pathogenesis of diabetic nephropathy. Expert Rev. Mol.

254 Med. 11, e13 (2009). 459. Iacobazzi, V., Castegna, A., Infantino, V. & Andria, G. Mitochondrial DNA methylation as a next-generation biomarker and diagnostic tool. Mol. Genet. Metab. 110, 25–34 (2013). 460. Dawid, I. B. 5-Methylcytidylic acid: Absence from mitochondrial DNA of frogs and HeLa cells. Science (80-. ). 184, 80–81 (1974). 461. Cummings, D. J., Tait, A. & Goddard, J. M. Methylated bases in DNA from Paramecium aurelia. Biochim. Biophys. Acta - Nucleic Acids Protein Synth. 374, 1–11 (1974). 462. Nass, M. M. K. Differential methylation of mitochondrial and nuclear DNA in cultured mouse, hamster and virus- transformed hamster cells In vivo and in vitro methylation. J. Mol. Biol. 80, 155–175 (1973). 463. Pollack, Y., Kasir, J., Shemer, R., Metzger, S. & Szyf, M. Methylation pattern of mouse mitochondrial DNA. Nucleic Acids Res. 12, 4811–4824 (1984). 464. Shmookler Reis, R. J. & Goldstein, S. Mitochondrial DNA in mortal and immortal human cells. Genome number, integrity, and methylation. J. Biol. Chem. 258, 9078–9085 (1983). 465. Groot, G. S. P. & Kroon, A. M. Mitochondrial DNA from various organisms does not contain internally methylated cytosine in -CCGG- sequences. Biochim. Biophys. Acta - Nucleic Acids Protein Synth. 564, 355–357 (1979). 466. Infantino, V. et al. Impairment of methyl cycle affects mitochondrial methyl availability and glutathione level in Down’s syndrome. Mol. Genet. Metab. 102, 378–382 (2011). 467. Shock, L. S., Thakkar, P. V, Peterson, E. J., Moran, R. G. & Taylor, S. M. DNA methyltransferase 1, cytosine methylation, and cytosine hydroxymethylation in mammalian mitochondria. Proc. Natl. Acad. Sci. 108, 3630 LP – 3635 (2011). 468. Chestnut, B. A. et al. Epigenetic regulation of motor neuron cell death through DNA methylation. J. Neurosci. 31, 16619 LP – 16636 (2011). 469. Bellizzi, D. et al. The control region of mitochondrial DNA shows an unusual CpG and non-CpG methylation pattern. DNA Res. 20, 537–547 (2013). 470. Sun, C., Reimers, L. L. & Burk, R. D. Methylation of HPV16 genome CpG sites is associated with cervix precancer and cancer. Gynecol. Oncol. 121, 59–63 (2011). 471. Chen, X. J. & Butow, R. A. The organization and inheritance of the mitochondrial genome. Nat. Rev. Genet. 6, 815– 825 (2005). 472. Kukat, C. et al. Super-resolution microscopy reveals that mammalian mitochondrial nucleoids have a uniform size and frequently contain a single copy of mtDNA. Proc. Natl. Acad. Sci. U. S. A. 108, 13534–13539 (2011). 473. Kukat, C. & Larsson, N.-G. mtDNA makes a U-turn for the mitochondrial nucleoid. Trends Cell Biol. 23, 457–463 (2013). 474. Choi, Y.-S. et al. Shot-gun proteomic analysis of mitochondrial D-loop DNA binding proteins: Identification of mitochondrial histones. Mol. Biosyst. 7, 1523–1536 (2011). 475. Bogenhagen, D. F., Rousseau, D. & Burke, S. The layered structure of human mitochondrial DNA nucleoids. J. Biol. Chem. 283, 3665–3675 (2008). 476. Wang, Y. & Bogenhagen, D. F. Human mitochondrial DNA nucleoids are linked to protein folding machinery and metabolic enzymes at the mitochondrial inner membrane. J. Biol. Chem. 281, 25791–25802 (2006). 477. Kasashima, K., Sumitani, M., Satoh, M. & Endo, H. Human prohibitin 1 maintains the organization and stability of the mitochondrial nucleoids. Exp. Cell Res. 314, 988–996 (2008). 478. Schon, E. A. & Gilkerson, R. W. Functional complementation of mitochondrial DNAs: Mobilizing mitochondrial genetics against dysfunction. Biochim. Biophys. Acta 1800, 245–249 (2010). 479. Gilkerson, R. et al. The mitochondrial nucleoid: Integrating mitochondrial DNA into cellular homeostasis. Cold Spring Harb. Perspect. Biol. 5, a011080 (2013). 480. Kren, B. T. et al. MicroRNAs identified in highly purified liver-derived mitochondria may play a role in apoptosis. RNA Biol. 6, 65–72 (2009). 481. Bian, Z. et al. Identification of mouse liver mitochondria-associated miRNAs and their potential biological functions. Cell research 20, 1076–1078 (2010). 482. Barrey, E. et al. Pre-microRNA and mature microRNA in human mitochondria. PLoS One 6, e20220 (2011).

255 483. Carrer, M. et al. Control of mitochondrial metabolism and systemic energy homeostasis by microRNAs 378 and 378*. Proc. Natl. Acad. Sci. U. S. A. 109, 15330–15335 (2012). 484. Li, P., Jiao, J., Gao, G. & Prabhakar, B. S. Control of mitochondrial activity by miRNAs. J. Cell. Biochem. 113, 1104– 1110 (2012). 485. Sripada, L. et al. Systematic analysis of small RNAs associated with human mitochondria by deep sequencing: Detailed analysis of mitochondrial associated miRNA. PLoS One 7, e44873 (2012). 486. Tomasetti, M., Neuzil, J. & Dong, L. MicroRNAs as regulators of mitochondrial function: Role in cancer suppression. Biochim. Biophys. Acta - Gen. Subj. 1840, 1441–1453 (2014). 487. Rackham, O. et al. Long noncoding RNAs are generated from the mitochondrial genome and regulated by nuclear- encoded proteins. RNA 17, 2085–2093 (2011). 488. Yang, K.-C. et al. Deep RNA sequencing reveals dynamic regulation of myocardial noncoding RNAs in failing human heart and remodeling with mechanical circulatory support. Circulation 129, 1009–1021 (2014). 489. Lung, B. et al. Identification of small non-coding RNAs from mitochondria and chloroplasts. Nucleic Acids Res. 34, 3842–3852 (2006). 490. Ro, S. et al. The mitochondrial genome encodes abundant small noncoding RNAs. Cell Res. 23, 759–774 (2013). 491. The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature 489, 57–74 (2012). 492. Rosenbloom, K. R. et al. ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic Acids Research 40, D912-7 (2012). 493. Jones, P. A. et al. Moving AHEAD with an international human epigenome project. Nature 454, 711–715 (2008). 494. Bradbury, J. Human epigenome project--up and running. PLoS Biol. 1, 82 (2003). 495. Jirtle, R. L. & Skinner, M. K. Environmental epigenomics and disease susceptibility. Nat. Rev. Genet. 8, 253–262 (2007). 496. Wang, J. et al. Nutrition, epigenetics, and metabolic syndrome. Antioxid. Redox Signal. 17, 282–301 (2012). 497. Baylin, S. B. & Jones, P. A. A decade of exploring the cancer epigenome — biological and translational implications. Nature Reviews. Cancer 11, 726–734 (2011). 498. Putta, S. et al. Inhibiting MicroRNA-192 Ameliorates Renal Fibrosis in Diabetic Nephropathy. Journal of the American Society of Nephrology : JASN 23, 458–469 (2012). 499. Barker, D. J. et al. Type 2 (non-insulin-dependent) diabetes mellitus, hypertension and hyperlipidaemia (syndrome X): relation to reduced fetal growth. Diabetologia 36, 62–67 (1993). 500. Barker, D. J., Osmond, C., Golding, J., Kuh, D. & Wadsworth, M. E. Growth in utero, blood pressure in childhood and adult life, and mortality from cardiovascular disease. BMJ 298, 564–567 (1989). 501. Eriksson, J. G. et al. Effects of size at birth and childhood growth on the insulin resistance syndrome in elderly individuals. Diabetologia 45, 342–348 (2002). 502. Hallan, S. et al. Effect of intrauterine growth restriction on kidney function at young adult age: the Nord Trondelag Health (HUNT 2) Study. Am. J. Kidney Dis. 51, 10–20 (2008). 503. Law, C. M., Barker, D. J., Osmond, C., Fall, C. H. & Simmonds, S. J. Early growth and abdominal fatness in adult life. J. Epidemiol. Community Health 46, 184–186 (1992). 504. Li, S. et al. Low birth weight is associated with chronic kidney disease only in men. Kidney Int. 73, 637–642 (2008). 505. White, S. L. et al. Is low birth weight an antecedent of CKD in later life? A systematic review of observational studies. Am. J. Kidney Dis. 54, 248–261 (2009). 506. Han, Q., Zhu, H., Chen, X. & Liu, Z. Non-genetic mechanisms of diabetic nephropathy. Front. Med. 11, 319–332 (2017). 507. Reddy, M. A. & Natarajan, R. Recent developments in epigenetics of acute and chronic kidney diseases. Kidney Int. 88, 250–261 (2015). 508. Wing, M. R., Ramezani, A., Gill, H. S., Devaney, J. M. & Raj, D. S. Epigenetics of progression of chronic kidney disease: Fact or fantasy? Semin. Nephrol. 33, 10.1016/j.semnephrol.2013.05.008 (2013). 509. Beckerman, P., Ko, Y.-A. & Susztak, K. Epigenetics: A new way to look at kidney diseases. Nephrol. Dial. Transplant. 29, 1821–1827 (2014).

256 510. Reddy, M. A. & Natarajan, R. Epigenetics in diabetic kidney disease. J. Am. Soc. Nephrol. 22, 2182–2185 (2011). 511. Reddy, M. A., Zhang, E. & Natarajan, R. Epigenetic mechanisms in diabetic complications and metabolic memory. Diabetologia 58, 443–455 (2015). 512. Bechtel, W. et al. Methylation determines fibroblast activation and fibrogenesis in the kidney. Nat. Med. 16, 544– 550 (2010). 513. Godfrey, K. M. & Barker, D. J. Fetal programming and adult health. Public Health Nutr. 4, 611–624 (2001). 514. Breton, C. V et al. Prenatal tobacco smoke exposure affects global and gene-specific DNA methylation. Am. J. Respir. Crit. Care Med. 180, 462–467 (2009). 515. Breton, C. V et al. Prenatal tobacco smoke exposure is associated with childhood DNA CpG methylation. PLoS One 9, e99716 (2014). 516. Joubert, B. R. et al. Maternal smoking and DNA methylation in newborns: in utero effect or epigenetic inheritance? Cancer Epidemiol. Biomarkers Prev. 23, 1007–1017 (2014). 517. Suter, M. et al. In utero tobacco exposure epigenetically modifies placental CYP1A1 expression. Metabolism. 59, 1481–1490 (2010). 518. Suter, M. et al. Maternal tobacco use modestly alters correlated epigenome-wide placental DNA methylation and gene expression. Epigenetics 6, 1284–1294 (2011). 519. Sapienza, C. et al. DNA methylation profiling identifies epigenetic differences between diabetes patients with ESRD and diabetes patients without nephropathy. Epigenetics 6, 20–28 (2011). 520. Smiraglia, D. J., Kulawiec, M., Bistulfi, G. L., Gupta, S. G. & Singh, K. K. A novel role for mitochondria in regulating epigenetic modification in the nucleus. Cancer Biol. Ther. 7, 1182–1190 (2008). 521. Wallace, D. C. & Fan, W. Energetics, Epigenetics, Mitochondrial Genetics. Mitochondrion 10, 12–31 (2010). 522. Borrelli, E., Nestler, E. J., Allis, C. D. & Sassone-Corsi, P. Decoding the epigenetic language of neuronal plasticity. Neuron 60, 961–974 (2008). 523. Mi, X., Tang, W., Chen, X., Liu, F. & Tang, X. Mitofusin 2 attenuates the histone acetylation at collagen IV promoter in diabetic nephropathy. J. Mol. Endocrinol. 57, 233–249 (2016). 524. Kabra, D. G., Gupta, J. & Tikoo, K. Insulin induced alteration in post-translational modifications of histone H3 under a hyperglycemic condition in L6 skeletal muscle myoblasts. Biochimica et Biophysica Acta 1792, 574–583 (2009). 525. Minocherhomji, S., Tollefsbol, T. O. & Singh, K. K. Mitochondrial regulation of epigenetics and its role in human diseases. Epigenetics 7, 326–334 (2012). 526. Golestaneh, L., Melamed, M. L. & Hostetter, T. H. Uremic memory: The role of acute kidney injury in long-term outcomes. Kidney Int. 76, 813–814 (2009). 527. Keating, S. T. & El-Osta, A. Glycemic memories and the epigenetic component of diabetic nephropathy. Curr. Diab. Rep. 13, 574–581 (2013). 528. McCaughan, J. A., McKnight, A. J., Courtney, A. E. & Maxwell, A. P. Epigenetics: Time to translate into transplantation. Transplantation 94, 1–7 (2012). 529. Swan, E. J. et al. Genetic risk factors affecting mitochondrial function are associated with kidney disease in people with type 1 diabetes. Diabet. Med. 32, 1104–1109 (2015). 530. Peng, R. et al. Promoter hypermethylation of let-7a-3 is relevant to its down-expression in diabetic nephropathy by targeting UHRF1. Gene 570, 57–63 (2015). 531. Alhosin, M. et al. Signalling pathways in UHRF1-dependent regulation of tumor suppressor genes in cancer. J. Exp. Clin. Cancer Res. 35, 174 (2016). 532. Reddy, M. A. et al. Losartan reverses permissive epigenetic changes in renal glomeruli of diabetic db/db mice. Kidney Int. 85, 362–373 (2014). 533. DiMauro, S. & Davidzon, G. Mitochondrial DNA and disease. Ann. Med. 37, 222–232 (2005). 534. Hauberg, M. E. et al. Large-Scale Identification of Common Trait and Disease Variants Affecting Gene Expression. Am. J. Hum. Genet. 100, 885–894 (2017). 535. Kim, S. et al. A comprehensive gene-environment interaction analysis in Ovarian Cancer using genome-wide significant common variants. Int. J. cancer 144, 2192–2205 (2019). 536. Group, G. and U. D. P. S. et al. Common variants near ATM are associated with glycemic response to metformin in

257 type 2 diabetes. Nat. Genet. 43, 117–120 (2011). 537. Zhou, K. et al. Variation in the glucose transporter gene SLC2A2 is associated with glycemic response to metformin. Nat. Genet. 48, 1055–1059 (2016). 538. Gloyn, A. L. et al. Activating mutations in the gene encoding the ATP-sensitive potassium-channel subunit Kir6.2 and permanent neonatal diabetes. N. Engl. J. Med. 350, 1838–1849 (2004). 539. Lynch, C. J. & Adams, S. H. Branched-chain amino acids in metabolic signalling and insulin resistance. Nat. Rev. Endocrinol. 10, 723–736 (2014). 540. Sarraju, A. & Knowles, J. W. Genetic Testing and Risk Scores: Impact on Familial Hypercholesterolemia. Front. Cardiovasc. Med. 6, 5 (2019). 541. Pedersen, H. K. et al. Human gut microbes impact host serum metabolome and insulin sensitivity. Nature 535, 376–381 (2016). 542. Kruglyak, L. & Nickerson, D. A. Variation is the spice of life. Nat. Genet. 27, 234 (2001). 543. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010). 544. Carlton, V. E. H., Ireland, J. S., Useche, F. & Faham, M. Functional single nucleotide polymorphism-based association studies. Hum. Genomics 2, 391–402 (2006). 545. Albert, P. R. What is a functional genetic polymorphism? Defining classes of functionality. J. Psychiatry Neurosci. 36, 363–365 (2011). 546. Altshuler, D. M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010). 547. Bush, W. S. & Moore, J. H. Chapter 11: Genome-wide association studies. PLoS Comput. Biol. 8, e1002822– e1002822 (2012). 548. Iyengar, S. K. et al. Linkage analysis of candidate loci for end-stage renal disease due to diabetic nephropathy. J. Am. Soc. Nephrol. 14, S195-201 (2003). 549. McDonough, C. W. et al. Genetic analysis of diabetic nephropathy on chromosome 18 in African Americans: linkage analysis and dense SNP mapping. Hum. Genet. 126, 805–817 (2009). 550. Davoudi, S. & Sobrin, L. Novel Genetic Actors of Diabetes-Associated Microvascular Complications: Retinopathy, Kidney Disease and Neuropathy. Rev. Diabet. Stud. 12, 243–259 (2015). 551. Berstein, Y., McCarthy, S. E., Kramer, M. & McCombie, W. R. Detection of rare disease-related genetic variants using the birthday model. bioRxiv 464842 (2018). 552. Bomba, L., Walter, K. & Soranzo, N. The impact of rare and low-frequency genetic variants in common disease. Genome Biol. 18, 77 (2017). 553. Zhang, Q. Associating rare genetic variants with human diseases . Frontiers in Genetics 6, 133 (2015). 554. Iyengar, S. K. & Elston, R. C. The genetic basis of complex traits: rare variants or ‘common gene, common disease’? Methods Mol. Biol. 376, 71–84 (2007). 555. Gibson, G. Rare and common variants: twenty arguments. Nat. Rev. Genet. 13, 135–145 (2012). 556. Schork, N. J., Murray, S. S., Frazer, K. A. & Topol, E. J. Common vs. rare allele hypotheses for complex diseases. Curr. Opin. Genet. Dev. 19, 212–219 (2009). 557. Reich, D. E. & Lander, E. S. On the allelic spectrum of human disease. Trends Genet. 17, 502–510 (2001). 558. Price, A., Spencer, C. & Donnelly, P. Progress and promise in understanding the genetic basis of common diseases. Proc. R. Soc. B Biol. Sci. 282, 20151684 (2015). 559. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. U. S. A. 106, 9362–9367 (2009). 560. Auer, P. L. et al. Rare and low-frequency coding variants in CXCR2 and other genes are associated with hematological traits. Nat. Genet. 46, 629–634 (2014). 561. Cohen, J. et al. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nat. Genet. 37, 161–165 (2005). 562. Tsai, C.-W. et al. Both rare and common variants in PCSK9 influence plasma low-density lipoprotein cholesterol level in American Indians. J. Clin. Endocrinol. Metab. 100, E345–E349 (2015).

258 563. Han, K. et al. SHANK3 overexpression causes manic-like behaviour with unique pharmacogenetic properties. Nature 503, 72–77 (2013). 564. Sztainberg, Y. & Zoghbi, H. Y. Lessons learned from studying syndromic autism spectrum disorders. Nat. Neurosci. 19, 1408–1417 (2016). 565. Lin, A. et al. Mapping 22q11.2 Gene Dosage Effects on Brain Morphometry. J. Neurosci. 37, 6183–6199 (2017). 566. Toro, R. et al. Key role for gene dosage and synaptic homeostasis in autism spectrum disorders. Trends Genet. 26, 363–372 (2010). 567. Emond, M. J. et al. Exome sequencing of extreme phenotypes identifies DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis. Nat. Genet. 44, 886–889 (2012). 568. Dorfman, R. et al. Complex two-gene modulation of lung disease severity in children with cystic fibrosis. J. Clin. Invest. 118, 1040–1049 (2008). 569. Posey, J. E. et al. Resolution of Disease Phenotypes Resulting from Multilocus Genomic Variation. N. Engl. J. Med. 376, 21–31 (2017). 570. Corvol, H. et al. Genome-wide association meta-analysis identifies five modifier loci of lung disease severity in cystic fibrosis. Nat. Commun. 6, 8382 (2015). 571. Cirulli, E. T. & Goldstein, D. B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet. 11, 415–425 (2010). 572. Blighe, K. et al. Gene editing in the context of an increasingly complex genome. BMC Genomics 19, 595 (2018). 573. Pritchard, J. K. & Cox, N. J. The allelic architecture of human disease genes: common disease-common variant...or not? Hum. Mol. Genet. 11, 2417–2423 (2002). 574. Bodmer, W. & Bonilla, C. Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet. 40, 695–701 (2008). 575. Gorlov, I. P., Gorlova, O. Y., Sunyaev, S. R., Spitz, M. R. & Amos, C. I. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 82, 100–112 (2008). 576. Pritchard, J. K. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69, 124– 137 (2001). 577. Gorski, M. M. et al. Whole-exome sequencing to identify genetic risk variants underlying inhibitor development in severe hemophilia A patients. Blood 127, 2924–2933 (2016). 578. Alves, M. M. et al. Contribution of rare and common variants determine complex diseases-Hirschsprung disease as a model. Dev. Biol. 382, 320–329 (2013). 579. Diogo, D. et al. Rare, low-frequency, and common variants in the protein-coding sequence of biological candidate genes from GWASs contribute to risk of rheumatoid arthritis. Am. J. Hum. Genet. 92, 15–27 (2013). 580. Yang, J. et al. The contribution of rare and common variants in 30 genes to risk nicotine dependence. Mol. Psychiatry 20, 1467–1478 (2015). 581. Fritsche, L. G. et al. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nat. Genet. 48, 134–143 (2016). 582. Timpson, N. J., Greenwood, C. M. T., Soranzo, N., Lawson, D. J. & Richards, J. B. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat. Rev. Genet. 19, 110–124 (2018). 583. Mohammad, M. et al. A female with X-linked Alport syndrome and compound heterozygous COL4A5 mutations. Pediatr. Nephrol. 29, 481–485 (2014). 584. Pei, Y. et al. Bilineal disease and trans-heterozygotes in autosomal dominant polycystic kidney disease. Am. J. Hum. Genet. 68, 355–363 (2001). 585. Kalatharan, V., Lemaire, M. & Lanktree, M. B. Opportunities and Challenges for Genetic Studies of End-Stage Renal Disease in Canada. Can. J. kidney Heal. Dis. 5, 2054358118789368–2054358118789368 (2018). 586. Steinthorsdottir, V. et al. Identification of low-frequency and rare sequence variants associated with elevated or reduced risk of type 2 diabetes. Nature genetics 46, 294–298 (2014). 587. Fuchsberger, C. et al. The genetic architecture of type 2 diabetes. Nature 536, 41–47 (2016). 588. Huyghe, J. R. et al. Exome array analysis identifies new loci and low-frequency variants influencing insulin processing and secretion. Nat. Genet. 45, 197–201 (2013).

259 589. Albrechtsen, A. et al. Exome sequencing-driven discovery of coding polymorphisms associated with common metabolic phenotypes. Diabetologia 56, 298–310 (2013). 590. Merino, J., Udler, M. S., Leong, A. & Meigs, J. B. A Decade of Genetic and Metabolomic Contributions to Type 2 Diabetes Risk Prediction. Curr. Diab. Rep. 17, 135 (2017). 591. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009). 592. Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101, 5– 22 (2017). 593. Sved, J. A. & Hill, W. G. One Hundred Years of Linkage Disequilibrium. Genetics 209, 629–636 (2018). 594. Park, L. Linkage Disequilibrium Decay and Past Population History in the Human Genome. PLoS One 7, e46603 (2012). 595. Baird, S. J. E. Exploring linkage disequilibrium. Mol. Ecol. Resour. 15, 1017–1019 (2015). 596. Darvasi, A., Yakir, B., Kuypers, J., Kokoris, M. & Shifman, S. Linkage disequilibrium patterns of the human genome across populations. Hum. Mol. Genet. 12, 771–776 (2003). 597. Pritchard, J. K. & Przeworski, M. Linkage Disequilibrium in Humans: Models and Data. Am. J. Hum. Genet. 69, 1–14 (2001). 598. McEvoy, B. P., Powell, J. E., Goddard, M. E. & Visscher, P. M. Human population dispersal ‘Out of Africa’ estimated from linkage disequilibrium and allele frequencies of SNPs. Genome Res. 21, 821–829 (2011). 599. Devlin, B. & Risch, N. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311– 322 (1995). 600. The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005). 601. Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6, 95–108 (2005). 602. Li, M., Li, C. & Guan, W. Evaluation of coverage variation of SNP chips for genome-wide association studies. Eur. J. Hum. Genet. 16, 635–643 (2008). 603. Fisher, S. A. et al. Direct or indirect association in a complex disease: the role of SLC22A4 and SLC22A5 functional variants in Crohn disease. Hum. Mutat. 27, 778–785 (2006). 604. Ohashi, J. & Tokunaga, K. The power of genome-wide association studies of complex disease genes: statistical limitations of indirect approaches using SNP markers. J. Hum. Genet. 46, 478–482 (2001). 605. Howie, B., Marchini, J. & Stephens, M. Genotype imputation with thousands of genomes. G3 (Bethesda). 1, 457– 470 (2011). 606. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010). 607. Bertsimas, D., Pawlowski, C. & Zhuo, Y. D. From predictive methods to missing data imputation: An optimization approach. Journal of Machine Learning Research 18, (2018). 608. Zhang, Z. Missing data imputation: focusing on single imputation. Ann. Transl. Med. 4, 9 (2016). 609. Jakobsen, J. C., Gluud, C., Wetterslev, J. & Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials - A practical guide with flowcharts. BMC Med. Res. Methodol. 17, 1–10 (2017). 610. Huque, M. H., Carlin, J. B., Simpson, J. A. & Lee, K. J. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med. Res. Methodol. 18, 168 (2018). 611. Oetting, W. S., Jacobson, P. A. & Israni, A. K. Validation Is Critical for Genome-Wide Association Study-Based Associations. Am. J. Transplant 17, 318–319 (2017). 612. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995). 613. Smyth, L., Kerr, K., Duffy, S., Kilner, J. & McKnight, A. Genetic strategies to understand human diabetic nephropathy: wet-lab approaches. in (Springer Nature, 2019). 614. Cañadas-Garre, M., Smyth, L., Anderson, K., Kerr, K. & McKnight, A. Genetic strategies to understand human diabetic nephropathy: in silico strategies for molecular data - association studies. in (Springer Nature, 2019). 615. Clarke, G. M. et al. Basic statistical analysis in genetic case-control studies. Nat. Protoc. 6, 121–133 (2011).

260 616. Pearce, N. What Does the Odds Ratio Estimate in a Case-Control Study? Int. J. Epidemiol. 22, 1189–1192 (1993). 617. Waltoft, B. L., Pedersen, C. B., Nyegaard, M. & Hobolth, A. The importance of distinguishing between the odds ratio and the incidence rate ratio in GWAS. BMC Med. Genet. 16, 71 (2015). 618. Sul, J. H., Martin, L. S. & Eskin, E. Population structure in genetic studies: Confounding factors and mixed models. PLOS Genet. 14, e1007309 (2018). 619. Gershon, D. Microarray technology: An array of possibilities. Nature 416, 885–891 (2000). 620. Lipshutz, R. J. & Fodor, S. P. A. Advanced DNA sequencing technologies. Curr. Opin. Struct. Biol. 4, 376–380 (1994). 621. Lamy, P., Grove, J. & Wiuf, C. A review of software for microarray genotyping. Hum. Genomics 5, 304–309 (2011). 622. LaFramboise, T. Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances. Nucleic Acids Res. 37, 4181–4193 (2009). 623. Li, M. & Pezzolesi, M. G. Advances in understanding the genetic basis of diabetic kidney disease. Acta Diabetol. 55, 1093–1104 (2018). 624. Kent, W. J. et al. The Human Genome Browser at UCSC. Genome Res. 12, 996–1006 (2002). 625. Dickinson, H. et al. Towards a Next-Generation Sequencing Diagnostic Service for Tumour Genotyping: A Comparison of Panels and Platforms. Biomed Res. Int. 2015, 1–6 (2015). 626. Torkamaneh, D., Laroche, J. & Belzile, F. Genome-Wide SNP Calling from Genotyping by Sequencing (GBS) Data: A Comparison of Seven Pipelines and Two Sequencing Technologies. PLoS One 11, e0161333 (2016). 627. Purcell, S. et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am. J. Hum. Genet. 81, 559–575 (2007). 628. Aulchenko, Y. S., Ripke, S., Isaacs, A. & van Duijn, C. M. GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 1294–1296 (2007). 629. Chen, M.-H. & Yang, Q. GWAF: an R package for genome-wide association analyses with family data. Bioinformatics 26, 580–581 (2010). 630. Gonzalez, J. R. et al. SNPassoc: an R package to perform whole genome association studies. Bioinformatics 23, 644–645 (2007). 631. Clayton, D. & Leung, H.-T. An R package for analysis of whole-genome association studies. Hum. Hered. 64, 45–51 (2007). 632. Fong, C. et al. GWAS analyzer: integrating genotype, phenotype and public annotation data for genome-wide association study analysis. Bioinformatics 26, 560–564 (2010). 633. Chen, W., Liang, L. & Abecasis, G. R. GWAS GUI: graphical browser for the results of whole-genome association studies with high-dimensional phenotypes. Bioinformatics 25, 284–285 (2009). 634. Narayanan, K. & Li, J. MAVEN: a tool for visualization and functional analysis of genome-wide association results. Bioinformatics 26, 270–272 (2010). 635. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006). 636. Jin, Y., Schaffer, A. A., Sherry, S. T. & Feolo, M. Quickly identifying identical and closely related subjects in large databases using genotype data. PLoS One 12, e0179106 (2017). 637. Saggar, a et al. Multiple Susceptibility Variants. Science (80-. ). 316, 1341–1345 (2007). 638. McRae, A. F. Analysis of Genome-Wide Association Data. in Bioinformatics: Volume II: Structure, Function, and Applications, Methods in Molecular Biology (ed. Keith, J. M.) 1526, 161–173 (2017). 639. Pe’er, I., Yelensky, R., Altshuler, D. & Daly, M. J. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol. 32, 381–385 (2008). 640. Chanock, S. J. et al. Replicating genotype-phenotype associations. Nature 447, 655–660 (2007). 641. Dehghan, A. Genome-Wide Association Studies. in Genetic Epidemiology (ed. Evangelou, E.) 37–49 (Humana Press, 2018). 642. Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996). 643. Wray, N. R., Goddard, M. E. & Visscher, P. M. Prediction of individual genetic risk to disease from genome-wide

261 association studies. Genome Res. 17, 1520–1528 (2007). 644. Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS discovery. Am. J. Hum. Genet. 90, 7– 24 (2012). 645. Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236– 1241 (2015). 646. Cross-Disorder Group of the Psychiatric Genomics Consortium. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat. Genet. 45, 984–994 (2013). 647. Visscher, P. M. & Yang, J. A plethora of pleiotropy across complex traits. Nat. Genet. 48, 707–708 (2016). 648. Yang, J. et al. Ubiquitous polygenicity of human complex traits: genome-wide analysis of 49 traits in Koreans. PLoS Genet. 9, e1003355–e1003355 (2013). 649. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821– 824 (2012). 650. Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nat. Methods 8, 833–835 (2011). 651. Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006). 652. Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010). 653. Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014). 654. Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015). 655. Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012). 656. Yang, J. et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44, 369-S3 (2012). 657. Liu, J. Z. et al. A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet. 87, 139–145 (2010). 658. Li, M.-X., Gui, H.-S., Kwan, J. S. H. & Sham, P. C. GATES: a rapid and powerful gene-based association test using extended Simes procedure. Am. J. Hum. Genet. 88, 283–293 (2011). 659. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015). 660. Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565– 569 (2010). 661. Smith, G. D. & Ebrahim, S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International journal of epidemiology 32, 1–22 (2003). 662. Bowden, J., Davey Smith, G. & Burgess, S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44, 512–525 (2015). 663. Burgess, S., Butterworth, A. & Thompson, S. G. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet. Epidemiol. 37, 658–665 (2013). 664. Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016). 665. Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016). 666. Hoffmann, T. J. et al. Genome-wide association analyses using electronic health records identify new loci influencing blood pressure variation. Nat. Genet. 49, 54–64 (2017). 667. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015). 668. Pickrell, J. K. et al. Detection and interpretation of shared genetic influences on 42 human traits. Nat. Genet. 48, 709–717 (2016).

262 669. Evangelou, E. & Ioannidis, J. P. A. Meta-analysis methods for genome-wide association studies and beyond. Nat. Rev. Genet. 14, 379 (2013). 670. Ikeda, M., Saito, T., Kondo, K. & Iwata, N. Genome-wide association studies of bipolar disorder: A systematic review of recent findings and their clinical implications. Psychiatry Clin. Neurosci. 72, 52–63 (2018). 671. Sud, A., Kinnersley, B. & Houlston, R. S. Genome-wide association studies of cancer: current insights and future perspectives. Nat. Rev. Cancer 17, 692–704 (2017). 672. Cuyvers, E. & Sleegers, K. Genetic variations underlying Alzheimer’s disease: evidence from genome-wide association studies and beyond. Lancet. Neurol. 15, 857–868 (2016). 673. Kessler, T., Vilne, B. & Schunkert, H. The impact of genome-wide association studies on the pathophysiology and therapy of cardiovascular disease. EMBO Mol. Med. 8, 688–701 (2016). 674. Leu, C., Coppola, A. & Sisodiya, S. M. Progress from genome-wide association studies and copy number variant studies in epilepsy. Curr. Opin. Neurol. 29, 158–167 (2016). 675. Zhang, Y.-P., Zhang, Y.-Y. & Duan, D. D. From Genome-Wide Association Study to Phenome-Wide Association Study: New Paradigms in Obesity Research. Prog. Mol. Biol. Transl. Sci. 140, 185–231 (2016). 676. Wang, X. et al. Genetic markers of type 2 diabetes: Progress in genome-wide association studies and clinical application for risk prediction. J. Diabetes 8, 24–35 (2016). 677. Wuttke, M. & Kottgen, A. Insights into kidney diseases from genome-wide association studies. Nat. Rev. Nephrol. 12, 549–562 (2016). 678. Shimazaki, A. et al. ELMO1 increases expression of extracellular matrix proteins and inhibits cell adhesion to ECMs. Kidney Int. 70, 1769–1776 (2006). 679. Wu, H. Y. et al. Association of ELMO1 gene polymorphisms with diabetic nephropathy in Chinese population. J. Endocrinol. Invest. 36, 298–302 (2013). 680. Leak, T. S. et al. Variants in Intron 13 of the ELMO1 Gene are Associated with Diabetic Nephropathy in African Americans. Ann. Hum. Genet. 73, 152–159 (2009). 681. Hanson, R. L. et al. ELMO1 variants and susceptibility to diabetic nephropathy in American Indians. Mol. Genet. Metab. 101, 383–390 (2010). 682. Williams, W. et al. Association Testing of Previously Reported Variants in a Large Case-Control Meta-analysis of Diabetic Nephropathy. Diabetes 61, 2187–2194 (2012). 683. Pezzolesi, M. G. et al. Family-based association analysis confirms the role of the chromosome 9q21.32 locus in the susceptibility of diabetic nephropathy. PLoS One 8, e60301 (2013). 684. Freedman, B. I. et al. Differential effects of MYH9 and APOL1 risk variants on FRMD3 Association with Diabetic ESRD in African Americans. PLoS Genet. 7, e1002150 (2011). 685. Palmer, N. D. et al. Evaluation of candidate nephropathy susceptibility genes in a genome-wide association study of African American diabetic kidney disease. PLoS One 9, e88273 (2014). 686. Martini, S. et al. From single nucleotide polymorphism to transcriptional mechanism: a model for FRMD3 in diabetic nephropathy. Diabetes 62, 2605–2612 (2013). 687. Wang, Y. et al. COL4A3 Gene Variants and Diabetic Kidney Disease in MODY. Clin. J. Am. Soc. Nephrol. 13, 1162– 1171 (2018). 688. Marian, A. J. Clinical applications of molecular genetic discoveries. Transl. Res. 168, 6–14 (2016). 689. Abraham, G. & Inouye, M. Genomic risk prediction of complex human disease and its clinical application. Curr. Opin. Genet. Dev. 33, 10–16 (2015). 690. Sanseau, P. et al. Use of genome-wide association studies for drug repositioning. Nature biotechnology 30, 317– 320 (2012). 691. Zhang, J. et al. Use of genome-wide association studies for cancer research and drug repositioning. PLoS One 10, e0116477 (2015). 692. Nelson, M. R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47, 856– 860 (2015). 693. Hueber, W. et al. Effects of AIN457, a fully human antibody to interleukin-17A, on psoriasis, rheumatoid arthritis, and uveitis. Sci. Transl. Med. 2, 52ra72 (2010).

263 694. McInnes, I. B. et al. Efficacy and safety of secukinumab, a fully human anti-interleukin-17A monoclonal antibody, in patients with moderate-to-severe psoriatic arthritis: a 24-week, randomised, double-blind, placebo-controlled, phase II proof-of-concept trial. Ann. Rheum. Dis. 73, 349–356 (2014). 695. Sieper, J. et al. Secukinumab efficacy in anti-TNF-naive and anti-TNF-experienced subjects with active ankylosing spondylitis: results from the MEASURE 2 Study. Ann. Rheum. Dis. 76, 571–592 (2017). 696. Duerr, R. H. et al. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science 314, 1461–1463 (2006). 697. Moschen, A. R., Tilg, H. & Raine, T. IL-12, IL-23 and IL-17 in IBD: immunobiology and therapeutic targeting. Nat. Rev. Gastroenterol. Hepatol. 16, 185–196 (2019). 698. Neale, B. M. et al. Meta-analysis of genome-wide association studies of attention-deficit/hyperactivity disorder. J. Am. Acad. Child Adolesc. Psychiatry 49, 884–897 (2010). 699. Ripke, S. et al. Schizophrenia Psychiatric Genome-Wide Association Study (GWAS) Consortium. Genome-wide association study identifies five new schizophrenia loci. Nat. Genet. 43, 969–976 (2011). 700. De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014). 701. Fromer, M. et al. De novo mutations in schizophrenia implicate synaptic networks. Nature 506, 179–184 (2014). 702. Girard, S. L. et al. Increased exonic de novo mutation rate in individuals with schizophrenia. Nat. Genet. 43, 860– 863 (2011). 703. Xu, B. et al. Exome sequencing supports a de novo mutational paradigm for schizophrenia. Nat. Genet. 43, 864– 868 (2011). 704. Marshall, C. R. et al. Contribution of copy number variants to schizophrenia from a genome-wide study of 41,321 subjects. Nat. Genet. 49, 27–35 (2017). 705. Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014). 706. Georgieva, L. et al. De novo CNVs in bipolar affective disorder and schizophrenia. Hum. Mol. Genet. 23, 6677–6683 (2014). 707. Jiang, Y. et al. Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing. Am. J. Hum. Genet. 93, 249–263 (2013). 708. McCarthy, S. E. et al. De novo mutations in schizophrenia implicate chromatin remodeling and support a genetic overlap with autism and intellectual disability. Mol. Psychiatry 19, 652–658 (2014). 709. de Jong, S., Vidler, L. R., Mokrab, Y., Collier, D. A. & Breen, G. Gene-set analysis based on the pharmacological profiles of drugs to identify repurposing opportunities in schizophrenia. J. Psychopharmacol. 30, 826–830 (2016). 710. Giacomini, K. M. et al. Genome-wide association studies of drug response and toxicity: an opportunity for genome medicine. Nat. Rev. Drug Discov. 16, 1 (2017). 711. Ashley, E. A. Towards precision medicine. Nat. Rev. Genet. 17, 507 (2016). 712. GOV.UK. Matt Hancock announces ambition to map 5 million genomes. Department of Health and Social Care (2018). Available at: https://www.gov.uk/government/news/matt-hancock-announces-ambition-to-map-5- million-genomes. (Accessed: 27th September 2019) 713. Levy, S. E. & Myers, R. M. Advancements in Next-Generation Sequencing. Annu. Rev. Genomics Hum. Genet. 17, 95–115 (2016). 714. Lee, H. et al. Third-generation sequencing and the future of genomics. bioRxiv 48603 (2016). 715. Maxam, A. M. & Gilbert, W. A new method for sequencing DNA. Proc. Natl. Acad. Sci. 74, 560 LP – 564 (1977). 716. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U. S. A. 74, 5463–5467 (1977). 717. Sanger, F. & Coulson, A. R. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol. 94, 441–448 (1975). 718. Smith, L. M. et al. Fluorescence detection in automated DNA sequence analysis. Nature 321, 674–679 (1986). 719. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). 720. Pettersson, E., Lundeberg, J. & Ahmadian, A. Generations of sequencing technologies. Genomics 93, 105–111

264 (2009). 721. Service, R. F. Gene sequencing. The race for the $1000 genome. Science (New York, N.Y.) 311, 1544–1546 (2006). 722. Check Hayden, E. The $1,000 genome. Nature 4–5 (2014). 723. Berg, J. S., Khoury, M. J. & Evans, J. P. Deploying whole genome sequencing in clinical practice and public health: meeting the challenge one bin at a time. Genetics in medicine : official journal of the American College of Medical Genetics 13, 499–504 (2011). 724. Green, R. C. et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet. Med. 15, 565–574 (2013). 725. Wetterstrand, K. A. DNA Sequencing Costs: Data. National Human Genome Research Institute (2019). Available at: https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data. (Accessed: 27th September 2019) 726. Head, S. R. et al. Library construction for next-generation sequencing: overviews and challenges. Biotechniques 56, 61-passim (2014). 727. Albert, T. J. et al. Direct selection of human genomic loci by microarray hybridization. Nat. Methods 4, 903–905 (2007). 728. Okou, D. T. et al. Microarray-based genomic selection for high-throughput resequencing. Nat. Methods 4, 907–909 (2007). 729. Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009). 730. Gaudin, M. & Desnues, C. Hybrid Capture-Based Next Generation Sequencing and Its Application to Human Infectious Diseases. Front. Microbiol. 9, 2924 (2018). 731. Mamanova, L. et al. Target-enrichment strategies for next-generation sequencing. Nat. Methods 7, 111–118 (2010). 732. Verma, M., Kulshrestha, S. & Puri, A. Genome Sequencing. Methods Mol. Biol. 1525, 3–33 (2017). 733. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008). 734. Mardis, E. R. Next-generation sequencing platforms. Annu. Rev. Anal. Chem. (Palo Alto. Calif). 6, 287–303 (2013). 735. Blazej, R. G., Kumaresan, P. & Mathies, R. A. Microfabricated bioprocessor for integrated nanoliter-scale Sanger DNA sequencing. Proc. Natl. Acad. Sci. 103, 7240–7245 (2006). 736. Augustin, M. A., Ankenbauer, W. & Angerer, B. Progress towards single-molecule sequencing: enzymatic synthesis of nucleotide-specifically labeled DNA. J. Biotechnol. 86, 289—301 (2001). 737. Metzker, M. L. Sequencing in real time. Nat. Biotechnol. 27, 150 (2009). 738. Mardis, E. R. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 9, 387–402 (2008). 739. ATDBio. Next Generation Sequencing. ATDBio (2019). Available at: https://www.atdbio.com/content/58/Next- generation-sequencing#Library-preparation. (Accessed: 27th September 2019) 740. Dressman, D., Yan, H., Traverso, G., Kinzler, K. W. & Vogelstein, B. Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc. Natl. Acad. Sci. 100, 8817 LP – 8822 (2003). 741. Nakano, M. et al. Single-molecule PCR using water-in-oil emulsion. J. Biotechnol. 102, 117–124 (2003). 742. Kanagal-Shamanna, R. Emulsion PCR: Techniques and Applications BT - Clinical Applications of PCR. in (eds. Luthra, R., Singh, R. R. & Patel, K. P.) 33–42 (Springer New York, 2016). 743. Zhu, Z. et al. Single-molecule emulsion PCR in microfluidic droplets. Anal. Bioanal. Chem. 403, 2127–2143 (2012). 744. Hert, D. G., Fredlake, C. P. & Barron, A. E. Advantages and limitations of next-generation sequencing technologies: A comparison of electrophoresis and non-electrophoresis methods. Electrophoresis 29, 4618–4626 (2008). 745. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333 (2016). 746. Fedurco, M., Romieu, A., Williams, S., Lawrence, I. & Turcatti, G. BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. Nucleic Acids Res. 34, e22 (2006). 747. Adessi, C. et al. Solid phase DNA amplification: characterisation of primer attachment and amplification

265 mechanisms. Nucleic Acids Res. 28, E87–E87 (2000). 748. Porreca, G. J. Genome sequencing on nanoballs. Nat. Biotechnol. 28, 43 (2010). 749. Xu, Y. et al. A new massively parallel nanoball sequencing platform for whole exome research. BMC Bioinformatics 20, 153 (2019). 750. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005). 751. Loman, N. J. et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat. Biotechnol. 30, 434–439 (2012). 752. Rothberg, J. M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348–352 (2011). 753. Valouev, A. et al. A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence- dictated positioning. Genome Res. 18, 1051–1063 (2008). 754. Mardis, E. R. The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141 (2008). 755. Chen, F. et al. The History and Advances of Reversible Terminators Used in New Generations of Sequencing Technology. Genomics. Proteomics Bioinformatics 11, 34–40 (2013). 756. Li, Z. et al. A photocleavable fluorescent nucleotide for DNA sequencing and analysis. Proc. Natl. Acad. Sci. 100, 414–419 (2003). 757. Guo, J., Yu, L., Turro, N. J. & Ju, J. An Integrated System for DNA Sequencing by Synthesis Using Novel Nucleotide Analogues. Acc. Chem. Res. 43, 551–563 (2010). 758. Ardui, S., Ameur, A., Vermeesch, J. R. & Hestand, M. S. Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res. 46, 2159–2168 (2018). 759. Illumina. TruSeq® Synthetic Long-Read DNA Library Prep Kit for Whole Human Genome Phasing. (2014). 760. Illumina. Long Read Sequencing. Illumina.com (2019). Available at: https://www.illumina.com/science/technology/next-generation-sequencing/long-read-sequencing.html. (Accessed: 14th August 2019) 761. Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016). 762. Kumar, K. R., Cowley, M. J. & Davis, R. L. Next-Generation Sequencing and Emerging Technologies. Semin Thromb Hemost (2019). 763. Frommer, M. et al. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc. Natl. Acad. Sci. U. S. A. 89, 1827–1831 (1992). 764. Booth, M. J. et al. Oxidative bisulfite sequencing of 5-methylcytosine and 5-hydroxymethylcytosine. Nat. Protoc. 8, 1841 (2013). 765. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009). 766. Holt, R. A. & Jones, S. J. M. The new paradigm of flow cell sequencing. Genome Res. 18, 839–846 (2008). 767. Morin, R. et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 45, 81–94 (2008). 768. Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008). 769. Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods 5, 613–619 (2008). 770. Barbazuk, W. B., Emrich, S. J., Chen, H. D., Li, L. & Schnable, P. S. SNP discovery via 454 transcriptome sequencing. Plant J. 51, 910–918 (2007). 771. Emrich, S. J., Barbazuk, W. B., Li, L. & Schnable, P. S. Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res. 17, 69–73 (2007). 772. Thompson, J. & E Steinmann, K. Single Molecule Sequencing with a HeliScope Genetic Analysis System. Curr. Protoc. Mol. Biol. Chapter 7, Unit7.10 (2010). 773. Sboner, A., Mu, X. & Greenbaum, D. The real cost of sequencing: higher than you think! Genome Biol 12, 125

266 (2011). 774. Darriba, D., Flouri, T. & Stamatakis, A. The State of Software for Evolutionary Biology. Mol. Biol. Evol. 35, 1037– 1046 (2018). 775. Knapp, M. & Hofreiter, M. Next Generation Sequencing of Ancient DNA: Requirements, Strategies and Perspectives. Genes 1, (2010). 776. Borsting, C. & Morling, N. Next generation sequencing and its applications in forensic genetics. Forensic Sci. Int. Genet. 18, 78–89 (2015). 777. Yang, Y., Xie, B. & Yan, J. Application of Next-generation Sequencing Technology in Forensic Science. Genomics. Proteomics Bioinformatics 12, 190–197 (2014). 778. Alvarez-Cubero, M. J. et al. Next generation sequencing: an application in forensic sciences? Ann. Hum. Biol. 44, 581–592 (2017). 779. Zhang, Y. et al. BBS mutations modify phenotypic expression of CEP290-related ciliopathies. Hum. Mol. Genet. 23, 40–51 (2014). 780. Genin, E., Feingold, J. & Clerget-Darpoux, F. Identifying modifier genes of monogenic disease: strategies and difficulties. Hum. Genet. 124, 357–368 (2008). 781. Iatropoulos, P. et al. Discordant phenotype in monozygotic twins with renal coloboma syndrome and a PAX2 mutation. Pediatr. Nephrol. 27, 1989–1993 (2012). 782. Renkema, K. Y., Stokman, M. F., Giles, R. H. & Knoers, N. V. A. M. Next-generation sequencing for research and diagnostics in kidney disease. Nat. Rev. Nephrol. 10, 433 (2014). 783. Yohe, S. & Thyagarajan, B. Review of Clinical Next-Generation Sequencing. Arch. Pathol. Lab. Med. 141, 1544–1557 (2017). 784. Znaor, A., Lortet-Tieulent, J., Laversanne, M., Jemal, A. & Bray, F. International Variations and Trends in Renal Cell Carcinoma Incidence and Mortality. Eur. Urol. (2014). 785. Pezzolesi, M. G., Skupien, J., Mychaleckyj, J. C., Warram, J. H. & Krolewski, A. S. Insights to the genetics of diabetic nephropathy through a genome-wide association study of the GoKinD collection. Semin. Nephrol. 30, 126–140 (2010). 786. Pezzolesi, M. G. et al. Genome-wide association scan for diabetic nephropathy susceptibility genes in type 1 diabetes. Diabetes 58, 1403–1410 (2009). 787. Teumer, A. et al. Genome-wide Association Studies Identify Genetic Loci Associated With Albuminuria in Diabetes. Diabetes 65, 803–817 (2016). 788. Iyengar, S. K. et al. Genome-Wide Association and Trans-ethnic Meta-Analysis for Advanced Diabetic Kidney Disease: Family Investigation of Nephropathy and Diabetes (FIND). PLoS Genet. 11, e1005352 (2015). 789. Tanaka, N. et al. Association of Solute Carrier Family 12 (Sodium/Chloride) Member 3 With Diabetic Nephropathy, Identified by Genome-Wide Analyses of Single Nucleotide Polymorphisms. Diabetes 52, 2848–2853 (2003). 790. Mueller, P. W. et al. Genetics of Kidneys in Diabetes (GoKinD) Study: A Genetics Collection Available for Identifying Genetic Susceptibility Factors for Diabetic Nephropathy in Type 1 Diabetes. J. Am. Soc. Nephrol. 17, 1782–1790 (2006). 791. Pezzolesi, M. G. et al. An intergenic region on chromosome 13q33.3 is associated with the susceptibility to kidney disease in type 1 and 2 diabetes. Kidney Int. 80, 105–111 (2011). 792. Maeda, S. et al. Replication study for the association between four Loci identified by a genome-wide association study on European American subjects with type 1 diabetes and susceptibility to diabetic nephropathy in Japanese subjects with type 2 diabetes. Diabetes 59, 2075–2079 (2010). 793. Sandholm, N. et al. Chromosome 2q31.1 associates with ESRD in women with type 1 diabetes. J. Am. Soc. Nephrol. 24, 1537–1543 (2013). 794. Sandholm, N. et al. Confirmation of GLRA3 as a susceptibility locus for albuminuria in Finnish patients with type 1 diabetes. Sci. Rep. 8, 12408 (2018). 795. Genovese, G. et al. Association of trypanolytic APOL1 variants with kidney disease in African Americans. Science 329, 841–845 (2010). 796. Kao, W. H. L. et al. MYH9 is associated with nondiabetic end-stage renal disease in African Americans. Nat. Genet. 40, 1185–1192 (2008).

267 797. Kopp, J. B. et al. MYH9 is a major-effect risk gene for focal segmental glomerulosclerosis. Nat. Genet. 40, 1175– 1184 (2008). 798. Pattaro, C. et al. Genetic associations at 53 loci highlight cell types and biological pathways relevant for kidney function. Nat. Commun. 7, 10023 (2016). 799. Sandholm, N. & Groop, P.-H. Genetic basis of diabetic kidney disease and other diabetic complications. Curr. Opin. Genet. Dev. 50, 17–24 (2018). 800. Yoo, Y. J. et al. Were genome-wide linkage studies a waste of time? Exploiting candidate regions within genome- wide association studies. Genet. Epidemiol. 34, 107–118 (2010). 801. Benson, M. D., Khor, C. C., Gage, P. J. & Lehmann, O. J. A targeted approach to genome-wide studies reveals new genetic associations with central corneal thickness. Mol. Vis. 23, 952–962 (2017). 802. Brown, S. M. et al. Polymorphisms in key pulmonary inflammatory pathways and the development of acute respiratory distress syndrome. Exp. Lung Res. 41, 155–162 (2015). 803. Park, S. L. et al. Association of cancer susceptibility variants with risk of multiple primary cancers: The population architecture using genomics and epidemiology study. Cancer Epidemiol. Biomarkers Prev. 23, 2568–2578 (2014). 804. Sharma, P. R. et al. An Islet-Targeted Genome-Wide Association Scan Identifies Novel Genes Implicated in Cytokine-Mediated Islet Stress in Type 2 Diabetes. Endocrinology 156, 3147–3156 (2015). 805. Tabebi, M. et al. Association study of apoptosis gene polymorphisms in mitochondrial diabetes: A potential role in the pathogenicity of MD. Gene 639, 18–26 (2018). 806. Vyshkina, T. et al. Association of common mitochondrial DNA variants with multiple sclerosis and systemic lupus erythematosus. Clin. Immunol. 129, 31–35 (2008). 807. Abrantes, P. et al. Mitochondrial genome association study with peripheral arterial disease and venous thromboembolism. Atherosclerosis 252, 97–105 (2016). 808. Rosa, A. et al. Ulcerative Colitis Is Under Dual (Mitochondrial and Nuclear) Genetic Control. Inflamm. Bowel Dis. 22, 774–781 (2016). 809. Dankowski, T. et al. Male-specific association between MT-ND4 11719 A/G polymorphism and ulcerative colitis: a mitochondria-wide genetic association study. BMC Gastroenterol. 16, 118 (2016). 810. Persad, P. J. et al. Joint Analysis of Nuclear and Mitochondrial Variants in Age-Related Macular Degeneration Identifies Novel Loci TRPM1 and ABHD2/RLBP1. Invest. Ophthalmol. Vis. Sci. 58, 4027–4038 (2017). 811. Flaquer, A. et al. Mitochondrial genetic variants identified to be associated with posttraumatic stress disorder. Transl. Psychiatry 5, e524 (2015). 812. Tranah, G. J. et al. Mitochondrial DNA sequence variation in multiple sclerosis. Neurology 85, 325–330 (2015). 813. Mohideen, A. M. S. H., Dicks, E., Parfrey, P., Green, R. & Savas, S. Mitochondrial DNA polymorphisms, its copy number change and outcome in colorectal cancer. BMC Res. Notes 8, 272 (2015). 814. Knoll, N. et al. Mitochondrial DNA variants in obesity. PLoS One 9, e94882 (2014). 815. Xu, J. et al. Single nucleotide polymorphisms in the D-loop region of mitochondrial DNA is associated with the kidney survival time in chronic kidney disease patients. Ren. Fail. 37, 108–112 (2015). 816. Calvo, S. E., Clauser, K. R. & Mootha, V. K. MitoCarta2.0: an updated inventory of mammalian mitochondrial proteins. Nucleic Acids Res. 44, D1251-7 (2016). 817. Pagliarini, D. J. et al. A mitochondrial protein compendium elucidates complex I disease biology. Cell 134, 112–123 (2008). 818. Cotter, D., Guda, P., Fahy, E. & Subramaniam, S. MitoProteome: mitochondrial protein sequence database and annotation system. Nucleic Acids Res. 32, D463-7 (2004). 819. Taylor, S. W. et al. Characterization of the human heart mitochondrial proteome. Nat. Biotechnol. 21, 281–286 (2003). 820. Smith, A. C. & Robinson, A. J. MitoMiner v3.1, an update on the mitochondrial proteomics database. Nucleic Acids Res. 44, D1258-61 (2016). 821. Brandon, M. C. et al. MITOMAP: a human mitochondrial genome database—2004 update. Nucleic Acids Res. 33, D611–D613 (2005). 822. Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018).

268 823. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017). 824. Wain, H. M. et al. Guidelines for human gene nomenclature. Genomics 79, 464–470 (2002). 825. Purcell, S. & Chang, C. PLINK [1.09]. (2019). 826. McKnight, A. J. et al. A GREM1 gene variant associates with diabetic nephropathy. J. Am. Soc. Nephrol. 21, 773– 781 (2010). 827. Pruim, R. J. et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 26, 2336–2337 (2010). 828. Clark, C. P. et al. LocusZoom.js: Web-based plugin for interactive analysis of genome and phenome wide association studiesTitle. (2016). 829. Shea, L. & Ricchetti, C. Chronic kidney disease and hypertension: A destructive combination. U. S. pharmacist 37, (2012). 830. Keane, W. F. & Eknoyan, G. Proteinuria, albuminuria, risk, assessment, detection, elimination (PARADE): a position paper of the National Kidney Foundation. Am. J. Kidney Dis. 33, 1004–1010 (1999). 831. Yoshioka, T., Rennke, H. G., Salant, D. J., Deen, W. M. & Ichikawa, I. Role of abnormally high transmural pressure in the permselectivity defect of glomerular capillary wall: a study in early passive Heymann nephritis. Circ. Res. 61, 531–538 (1987). 832. Van Haute, L. et al. Mitochondrial transcript maturation and its disorders. J. Inherit. Metab. Dis. 38, 655–680 (2015). 833. Pearce, S. F. et al. Regulation of Mammalian Mitochondrial Gene Expression: Recent Advances. Trends Biochem. Sci. 42, 625–639 (2017). 834. Sofou, K. et al. Whole exome sequencing reveals mutations in NARS2 and PARS2, encoding the mitochondrial asparaginyl-tRNA synthetase and prolyl-tRNA synthetase, in patients with Alpers syndrome. Mol. Genet. genomic Med. 3, 59–68 (2015). 835. Musante, L. et al. Mutations of the aminoacyl-tRNA-synthetases SARS and WARS2 are implicated in the etiology of autosomal recessive intellectual disability. Hum. Mutat. 38, 621–636 (2017). 836. Diodato, D., Ghezzi, D. & Tiranti, V. The Mitochondrial Aminoacyl tRNA Synthetases: Genes and Syndromes. Int. J. Cell Biol. 2014, 787956 (2014). 837. Schwartzentruber, J. et al. Mutation in the nuclear-encoded mitochondrial isoleucyl-tRNA synthetase IARS2 in patients with cataracts, growth hormone deficiency with short stature, partial sensorineural deafness, and peripheral neuropathy or with Leigh syndrome. Hum. Mutat. 35, 1285–1289 (2014). 838. Hallmann, K. et al. A homozygous splice-site mutation in CARS2 is associated with progressive myoclonic epilepsy. Neurology 83, 2183–2187 (2014). 839. Raviglione, F. et al. Clinical findings in a patient with FARS2 mutations and early-infantile-encephalopathy with epilepsy. Am. J. Med. Genet. A 170, 3004–3007 (2016). 840. Cho, J. S. et al. FARS2 mutation and epilepsy: Possible link with early-onset epileptic encephalopathy. Epilepsy Res. 129, 118–124 (2017). 841. Shamseldin, H. E. et al. Genomic analysis of mitochondrial diseases in a consanguineous population reveals novel candidate disease genes. J. Med. Genet. 49, 234–241 (2012). 842. Elo, J. M. et al. Mitochondrial phenylalanyl-tRNA synthetase mutations underlie fatal infantile Alpers encephalopathy. Hum. Mol. Genet. 21, 4521–4529 (2012). 843. Almalki, A. et al. Mutation of the human mitochondrial phenylalanine-tRNA synthetase causes infantile-onset epilepsy and cytochrome c oxidase deficiency. Biochim. Biophys. Acta 1842, 56–64 (2014). 844. Yang, Y. et al. A Newly Identified Missense Mutation in FARS2 Causes Autosomal-Recessive Spastic Paraplegia. Hum. Mutat. 37, 165–169 (2016). 845. Vernon, H. J., McClellan, R., Batista, D. A. S. & Naidu, S. Mutations in FARS2 and non-fatal mitochondrial dysfunction in two siblings. Am. J. Med. Genet. A 167A, 1147–1151 (2015). 846. Zhang, J. et al. Keratin 4 regulates the development of human white sponge nevus. J. oral Pathol. Med. Off. Publ. Int. Assoc. Oral Pathol. Am. Acad. Oral Pathol. 47, 598–605 (2018). 847. Kurklu, E. et al. Clinical features and molecular genetic analysis in a Turkish family with oral white sponge nevus. Med. Oral Patol. Oral Cir. Bucal 23, e144–e150 (2018).

269 848. Cai, W. et al. Current approaches to the diagnosis and treatment of white sponge nevus. Expert Rev. Mol. Med. 17, e9 (2015). 849. Elsnerova, K. et al. Gene expression of membrane transporters: Importance for prognosis and progression of ovarian carcinoma. Oncol. Rep. 35, 2159–2170 (2016). 850. Hedditch, E. L. et al. ABCA transporter gene expression and poor outcome in epithelial ovarian cancer. J. Natl. Cancer Inst. 106, (2014). 851. Chen, K. G., Valencia, J. C., Gillet, J.-P., Hearing, V. J. & Gottesman, M. M. Involvement of ABC transporters in melanogenesis and the development of multidrug resistance of melanoma. Pigment Cell Melanoma Res. 22, 740– 749 (2009). 852. Kap, E. J. et al. SNPs in transporter and metabolizing genes as predictive markers for oxaliplatin treatment in colorectal cancer patients. Int. J. cancer 138, 2993–3001 (2016). 853. The Broad Institute of MIT and Harvard. rs72851722. GTEx Portal (2019). Available at: https://gtexportal.org/home/snp/rs72851722. (Accessed: 28th September 2019) 854. Lobley, A., Pierron, V., Reynolds, L., Allen, L. & Michalovich, D. Identification of human and mouse CatSper3 and CatSper4 genes: characterisation of a common interaction domain and evidence for expression in testis. Reprod. Biol. Endocrinol. 1, 53 (2003). 855. Jin, J.-L. et al. Catsper3 and encode two cation channel-like proteins exclusively expressed in the testis. Biol. Reprod. 73, 1235–1242 (2005). 856. Hildebrand, M. S. et al. Genetic male infertility and mutation of CATSPER ion channels. Eur. J. Hum. Genet. 18, 1178–1184 (2010). 857. Zhang, Y. et al. Sensorineural deafness and male infertility: a contiguous gene deletion syndrome. J. Med. Genet. 44, 233–240 (2007). 858. The Broad Institute of MIT and Harvard. rs17727871. GTEx Portal (2019). Available at: https://gtexportal.org/home/snp/rs17727871. (Accessed: 28th September 2019) 859. Ma, Z.-W. & Liu, D.-X. Humanin decreases mitochondrial membrane permeability by inhibiting the membrane association and oligomerization of Bax and Bid proteins. Acta Pharmacol. Sin. 39, 1012–1021 (2018). 860. Guo, B. et al. Humanin peptide suppresses apoptosis by interfering with Bax activation. Nature 423, 456–461 (2003). 861. Hashimoto, Y. et al. A rescue factor abolishing neuronal cell death by a wide spectrum of familial Alzheimer’s disease genes and Abeta. Proc. Natl. Acad. Sci. U. S. A. 98, 6336–6341 (2001). 862. Tajima, H. et al. Evidence for in vivo production of Humanin peptide, a neuroprotective factor against Alzheimer’s disease-related insults. Neurosci. Lett. 324, 227–231 (2002). 863. Romeo, M. et al. Humanin Specifically Interacts with Amyloid-beta Oligomers and Counteracts Their in vivo Toxicity. J. Alzheimers. Dis. 57, 857–871 (2017). 864. Minasyan, L., Sreekumar, P. G., Hinton, D. R. & Kannan, R. Protective Mechanisms of the Mitochondrial-Derived Peptide Humanin in Oxidative and Endoplasmic Reticulum Stress in RPE Cells. Oxid. Med. Cell. Longev. 2017, 1675230 (2017). 865. Sreekumar, P. G. et al. The Mitochondrial-Derived Peptide Humanin Protects RPE Cells From Oxidative Stress, Senescence, and Mitochondrial Dysfunction. Invest. Ophthalmol. Vis. Sci. 57, 1238–1253 (2016). 866. Nashine, S. et al. Humanin G (HNG) protects age-related macular degeneration (AMD) transmitochondrial ARPE-19 cybrids from mitochondrial and cellular damage. Cell Death Dis. 8, e2951 (2017). 867. Solanki, A. et al. Humanin Nanoparticles for Reducing Pathological Factors Characteristic of Age-Related Macular Degeneration. Curr. Drug Deliv. 16, 226–232 (2019). 868. Voigt, A. & Jelinek, H. F. Humanin: a mitochondrial signaling peptide as a biomarker for impaired fasting glucose- related oxidative stress. Physiol. Rep. 4, (2016). 869. Singh, B. K. & Mascarenhas, D. D. Bioactive peptides control receptor for advanced glycated end product-induced elevation of kidney insulin receptor substrate 2 and reduce albuminuria in diabetic mice. Am. J. Nephrol. 28, 890– 899 (2008). 870. Zhang, X. et al. Humanin prevents intra-renal microvascular remodeling and inflammation in hypercholesterolemic ApoE deficient mice. Life Sci. 91, 199–206 (2012).

270 871. Nashine, S., Cohen, P., Nesburn, A. B., Kuppermann, B. D. & Kenney, M. C. Characterizing the protective effects of SHLP2, a mitochondrial-derived peptide, in macular degeneration. Sci. Rep. 8, 15175 (2018). 872. Restrepo, N. A. et al. Mitochondrial variation and the risk of age-related macular degeneration across diverse populations. Pac. Symp. Biocomput. 243–254 (2015). 873. Li, S. et al. Mitochondrial Dysfunctions Contribute to Hypertrophic Cardiomyopathy in Patient iPSC-Derived Cardiomyocytes with MT-RNR2 Mutation. Stem cell reports 10, 808–821 (2018). 874. Lim, S. C. et al. Loss of mitochondrial DNA-encoded protein ND1 results in disruption of complex I biogenesis during early stages of assembly. FASEB J. Off. Publ. Fed. Am. Soc. Exp. Biol. 30, 2236–2248 (2016). 875. Martinez-Romero, I. et al. New MT-ND1 pathologic mutation for Leber hereditary optic neuropathy. Clin. Experiment. Ophthalmol. 42, 856–864 (2014). 876. Sheremet, N. L. et al. Previously Unclassified Mutation of mtDNA m.3472T>C: Evidence of Pathogenicity in Leber’s Hereditary Optic Neuropathy. Biochemistry. (Mosc). 81, 748–754 (2016). 877. Yu-Wai-Man, P., Griffiths, P. G., Hudson, G. & Chinnery, P. F. Inherited mitochondrial optic neuropathies. J. Med. Genet. 46, 145 LP – 158 (2009). 878. Ji, Y. et al. Mitochondrial ND1 Variants in 1281 Chinese Subjects With Leber’s Hereditary Optic Neuropathy. Invest. Ophthalmol. Vis. Sci. 57, 2377–2389 (2016). 879. Caporali, L. et al. Cybrid studies establish the causal link between the mtDNA m.3890G>A/MT-ND1 mutation and optic atrophy with bilateral brainstem lesions. Biochim. Biophys. Acta 1832, 445–452 (2013). 880. Murray, J. J., Nolan, K. W., McClelland, C. & Lee, M. S. Leber Hereditary Optic Neuropathy: Visual Recovery in a Patient With the Rare m.3890G>A Point Mutation. J. Neuroophthalmol. 37, 166–171 (2017). 881. Achilli, A. et al. Rare primary mitochondrial DNA mutations and probable synergistic variants in Leber’s hereditary optic neuropathy. PLoS One 7, e42242 (2012). 882. Zou, Y. et al. The MT-ND1 and MT-ND5 genes are mutational hotspots for Chinese families with clinical features of LHON but lacking the three primary mutations. Biochem. Biophys. Res. Commun. 399, 179–185 (2010). 883. Carreno-Gago, L. et al. Identification and characterization of the novel point mutation m.3634A>G in the mitochondrial MT-ND1 gene associated with LHON syndrome. Biochim. Biophys. acta. Mol. basis Dis. 1863, 182– 187 (2017). 884. Hudson, G. et al. Clinical Expression of Leber Hereditary Optic Neuropathy Is Affected by the Mitochondrial DNA– Haplogroup Background. Am. J. Hum. Genet. 81, 228–233 (2007). 885. Ji, Y. et al. Mitochondrial DNA Haplogroups M7b1′2 and M8a Affect Clinical Expression of Leber Hereditary Optic Neuropathy in Chinese Families with the m.11778G→A Mutation. Am. J. Hum. Genet. 83, 760–768 (2008). 886. Patsi, J. et al. LHON/MELAS overlap mutation in ND1 subunit of mitochondrial complex I affects ubiquinone binding as revealed by modeling in Escherichia coli NDH-1. Biochim. Biophys. Acta 1817, 312–318 (2012). 887. Blakely, E. L. et al. LHON/MELAS overlap syndrome associated with a mitochondrial MTND1 gene mutation. Eur. J. Hum. Genet. 13, 623–627 (2005). 888. Lin, J. et al. Novel mutations m.3959G>A and m.3995A>G in mitochondrial gene MT-ND1 associated with MELAS. Mitochondrial DNA 25, 56–62 (2014). 889. Horvath, R., Reilmann, R., Holinski-Feder, E., Ringelstein, E. B. & Klopstock, T. The role of complex I genes in MELAS: a novel heteroplasmic mutation 3380G>A in ND1 of mtDNA. Neuromuscul. Disord. 18, 553–556 (2008). 890. Negishi, Y. et al. Homoplasmy of a mitochondrial 3697G>A mutation causes Leigh syndrome. J. Hum. Genet. 59, 405–407 (2014). 891. Spangenberg, L. et al. 3697G>A in MT-ND1 is a causative mutation in mitochondrial disease. Mitochondrion 28, 54–59 (2016). 892. Spruijt, L. et al. A MELAS-associated ND1 mutation causing leber hereditary optic neuropathy and spastic dystonia. Arch. Neurol. 64, 890–893 (2007). 893. Wray, C. D. et al. A new mutation in MT-ND1 m.3928G>C p.V208L causes Leigh disease with infantile spasms. Mitochondrion 13, 656–661 (2013). 894. Delmiro, A. et al. Whole-exome sequencing identifies a variant of the mitochondrial MT-ND1 gene associated with epileptic encephalopathy: west syndrome evolving to Lennox-Gastaut syndrome. Hum. Mutat. 34, 1623–1627 (2013).

271 895. Gutierrez Cortes, N. et al. Novel mitochondrial DNA mutations responsible for maternally inherited nonsyndromic hearing loss. Hum. Mutat. 33, 681–689 (2012). 896. Iommarini, L. et al. Different mtDNA mutations modify tumor progression in dependence of the degree of respiratory complex I impairment. Hum. Mol. Genet. 23, 1453–1466 (2014). 897. Flaquer, A. et al. Mitochondrial genetic variants identified to be associated with BMI in adults. PLoS One 9, e105116 (2014). 898. Endres, D. et al. New Variant of MELAS Syndrome With Executive Dysfunction, Heteroplasmic Point Mutation in the MT-ND4 Gene (m.12015T>C; p.Leu419Pro) and Comorbid Polyglandular Autoimmune Syndrome Type 2. Frontiers in immunology 10, 412 (2019). 899. Catarino, C. B. et al. Characterization of a Leber’s hereditary optic neuropathy (LHON) family harboring two primary LHON mutations m.11778G>A and m.14484T>C of the mitochondrial DNA. Mitochondrion 36, 15–20 (2017). 900. Zhang, J. et al. Mitochondrial haplotypes may modulate the phenotypic manifestation of the LHON-associated m.14484T>C (MT-ND6) mutation in Chinese families. Mitochondrion 13, 772–781 (2013). 901. Gomez-Zaera, M. et al. Identification of somatic and germline mitochondrial DNA sequence variants in prostate cancer patients. Mutat. Res. 595, 42–51 (2006). 902. Seong, M.-W., Choi, J., Park, S. S., Kim, J. Y. & Hwang, J.-M. Novel MT-ND5 gene mutation identified in Leber’s hereditary optic neuropathy patient using mitochondrial genome sequencing. Journal of the neurological sciences 375, 301–303 (2017). 903. Han, J. et al. Ophthalmological manifestations in patients with Leigh syndrome. Br. J. Ophthalmol. 99, 528–535 (2015). 904. Zhou, N. et al. Whole-exome sequencing reveals a novel mutation of MT-ND5 gene in a mitochondrial cardiomyopathy pedigree: Patients who show biventricular hypertrophy, hyperlactacidemia, pulmonary hypertension, and decreased exercise tolerance. Anatol. J. Cardiol. 21, 18–24 (2019). 905. Ayalon, N., Flore, L. A., Christensen, T. G. & Sam, F. Mitochondrial encoded NADH dehydrogenase 5 (MT-ND5) gene point mutation presents as late onset cardiomyopathy. International journal of cardiology 167, e143-5 (2013). 906. Liu, Z.-W. et al. High incidence of coding gene mutations in mitochondrial DNA in esophageal cancer. Mol. Med. Rep. 16, 8537–8541 (2017). 907. Kraja, A. T. et al. Associations of Mitochondrial and Nuclear Mitochondrial Variants and Genes with Seven Metabolic Traits. Am. J. Hum. Genet. 104, 112–138 (2019). 908. Blanchet, L., Buydens, M. C., Smeitink, J. A. M., Willems, P. H. G. M. & Koopman, W. J. H. Isolated mitochondrial complex I deficiency: explorative data analysis of patient cell parameters. Curr. Pharm. Des. 17, 4023–4033 (2011). 909. Distelmaier, F. et al. Mitochondrial complex I deficiency: from organelle dysfunction to clinical disease. Brain 132, 833–842 (2009). 910. McFarland, R., Taylor, R. W. & Turnbull, D. M. A neurological perspective on mitochondrial disease. Lancet. Neurol. 9, 829–840 (2010). 911. Hinokio, Y. et al. A new mitochondrial DNA deletion associated with diabetic amyotrophy, diabetic myoatrophy and diabetic fatty liver. Muscle Nerve. Suppl. 3, S142-9 (1995). 912. Ha, H. & Lee, H. B. Reactive oxygen species as glucose signaling molecules in mesangial cells cultured under high glucose. Kidney Int. 58, S19–S25 (2000). 913. Ha, H., Yu, M. R., Choi, Y. J., Kitamura, M. & Lee, H. B. Role of High Glucose-Induced Nuclear Factor-κB Activation in Monocyte Chemoattractant Protein-1 Expression by Mesangial Cells. J. Am. Soc. Nephrol. 13, 894–902 (2002). 914. Iglesias-De La Cruz, M. C. et al. Hydrogen peroxide increases extracellular matrix mRNA through TGF-beta in human mesangial cells. Kidney Int. 59, 87–95 (2001). 915. Studer, R. K., Craven, P. A. & DeRubertis, F. R. Antioxidant inhibition of protein kinase C-signaled increases in transforming growth factor-beta in mesangial cells. Metabolism 46, (1997). 916. Ha, H., Lee, S. H. & Kim, K. H. Effects of Rebamipide in a Model of Experimental Diabetes and on the Synthesis of Transforming Growth Factor- ␤ and Fibronectin , and Lipid Peroxidation Induced by High Glucose in Cultured Mesangial Cells 1. 281, 1457–1462 (1997). 917. Coughlan, M. T. et al. RAGE-Induced Cytosolic ROS Promote Mitochondrial Superoxide Generation in Diabetes. J.

272 Am. Soc. Nephrol. 20, 742 LP – 752 (2009). 918. Galvan, D. L. et al. Real-time in vivo mitochondrial redox assessment confirms enhanced mitochondrial reactive oxygen species in diabetic nephropathy. Kidney Int. 92, 1282–1287 (2017). 919. Effect of intensive blood-glucose control with metformin on complications in overweight patients with type 2 diabetes (UKPDS 34). UK Prospective Diabetes Study (UKPDS) Group. Lancet (London, England) 352, 854–865 (1998). 920. Hou, W.-L. et al. Inhibition of mitochondrial complex I improves glucose metabolism independently of AMPK activation. J. Cell. Mol. Med. 22, 1316–1328 (2018). 921. Gorin, Y. The Kidney: An Organ in the Front Line of Oxidative Stress-Associated Pathologies. Antioxid. Redox Signal. 25, 639–641 (2016). 922. Jha, J. C., Banal, C., Chow, B. S. M., Cooper, M. E. & Jandeleit-Dahm, K. Diabetes and Kidney Disease: Role of Oxidative Stress. Antioxid. Redox Signal. 25, 657–684 (2016). 923. Forbes, J. M., Coughlan, M. T. & Cooper, M. E. Oxidative Stress as a Major Culprit in Kidney Disease in Diabetes. Diabetes 57, 1446 LP – 1454 (2008). 924. Lehmann, J. & Libchaber, A. Degeneracy of the genetic code and stability of the base pair at the second position of the anticodon. RNA 14, 1264–1269 (2008). 925. Bennetzen, J. L. & Hall, B. D. Codon selection in yeast. J. Biol. Chem. 257, 3026–3031 (1982). 926. Karchin, R. et al. LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics 21, 2814–2820 (2005). 927. Wellcome Trust Case Control Consortium. Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nat. Genet. 39, 1329–1337 (2007). 928. Hampe, J. et al. A genome-wide association scan of nonsynonymous SNPs identifies a susceptibility variant for Crohn disease in ATG16L1. Nat. Genet. 39, 207–211 (2007). 929. Smyth, D. J. et al. A genome-wide association study of nonsynonymous SNPs identifies a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region. Nat. Genet. 38, 617–619 (2006). 930. Savage, D. A. et al. Genetic association analyses of non-synonymous single nucleotide polymorphisms in diabetic nephropathy. Diabetologia 51, 1998–2002 (2008). 931. Kudla, G., Murray, A. W., Tollervey, D. & Plotkin, J. B. Coding-Sequence Determinants of Gene Expression in Escherichia coli. Science (80-. ). 324, 255 LP – 258 (2009). 932. Hamasaki-Katagiri, N. et al. The importance of mRNA structure in determining the pathogenicity of synonymous and non-synonymous mutations in haemophilia. Haemophilia 23, e8–e17 (2017). 933. Curran, J. F. & Yarus, M. Rates of aminoacyl-tRNA selection at 29 sense codons in vivo. J. Mol. Biol. 209, 65–77 (1989). 934. Cho, B.-K. et al. The transcription unit architecture of the Escherichia coli genome. Nat. Biotechnol. 27, 1043 (2009). 935. Li, G.-W., Oh, E. & Weissman, J. S. The anti-Shine–Dalgarno sequence drives translational pausing and codon choice in bacteria. Nature 484, 538 (2012). 936. Sørensen, M. A. & Pedersen, S. Absolute in vivo translation rates of individual codons in Escherichia coli: The two glutamic acid codons GAA and GAG are translated with a threefold difference in rate. J. Mol. Biol. 222, 265–280 (1991). 937. Quax, T. E. F., Claassens, N. J., Söll, D. & van der Oost, J. Codon Bias as a Means to Fine-Tune Gene Expression. Mol. Cell 59, 149–161 (2015). 938. Zhang, G., Hubalewska, M. & Ignatova, Z. Transient ribosomal attenuation coordinates protein synthesis and co- translational folding. Nat. Struct. &Amp; Mol. Biol. 16, 274 (2009). 939. Kimchi-Sarfaty, C. et al. A ‘silent’ polymorphism in the MDR1 gene changes substrate specificity. Science 315, 525– 528 (2007). 940. Mittal, P., Brindle, J., Stephen, J., Plotkin, J. B. & Kudla, G. Codon usage influences fitness through RNA toxicity. Proc. Natl. Acad. Sci. 115, 8639 LP – 8644 (2018). 941. Fredens, J. et al. Total synthesis of Escherichia coli with a recoded genome. Nature 569, 514–518 (2019).

273 942. Chamary, J. V & Hurst, L. D. The Price of Silent Mutations. Sci. Am. 300, 46–53 (2009). 943. Schattner, P. & Diekhans, M. Regions of extreme synonymous codon selection in mammalian genes. Nucleic Acids Res. 34, 1700–1710 (2006). 944. Macaya, D. et al. A synonymous mutation in TCOF1 causes Treacher Collins syndrome due to mis-splicing of a constitutive exon. Am. J. Med. Genet. Part A 149A, 1624–1627 (2009). 945. Ramser, J. et al. Rare missense and synonymous variants in UBE1 are associated with X-linked infantile spinal muscular atrophy. Am. J. Hum. Genet. 82, 188–193 (2008). 946. O’Driscoll, M., Ruiz-Perez, V. L., Woods, C. G., Jeggo, P. A. & Goodship, J. A. A splicing mutation affecting expression of ataxia–telangiectasia and Rad3–related protein (ATR) results in Seckel syndrome. Nat. Genet. 33, 497–501 (2003). 947. Bartoszewski, R. A. et al. A synonymous single nucleotide polymorphism in DeltaF508 CFTR alters the secondary structure of the mRNA and the expression of the mutant protein. J. Biol. Chem. 285, 28741–28748 (2010). 948. Stark, K. et al. Genetic association study identifies HSPB7 as a risk gene for idiopathic dilated cardiomyopathy. PLoS Genet. 6, e1001167 (2010). 949. Ho, P. A. et al. WT1 synonymous single nucleotide polymorphism rs16754 correlates with higher mRNA expression and predicts significantly improved outcome in favorable-risk pediatric acute myeloid leukemia: a report from the children’s oncology group. J. Clin. Oncol. 29, 704–711 (2011). 950. Ambudkar, S. V, Kim, I.-W. & Sauna, Z. E. The power of the pump: mechanisms of action of P-glycoprotein (ABCB1). Eur. J. Pharm. Sci. 27, 392–400 (2006). 951. van der Veldt, A. A. M. et al. Genetic polymorphisms associated with a prolonged progression-free survival in patients with metastatic renal cell cancer treated with sunitinib. Clin. Cancer Res. 17, 620–629 (2011). 952. Herrlinger, K. R. et al. ABCB1 single-nucleotide polymorphisms determine tacrolimus response in patients with ulcerative colitis. Clin. Pharmacol. Ther. 89, 422–428 (2011). 953. Livingstone, M. et al. Investigating DNA-, RNA-, and protein-based features as a means to discriminate pathogenic synonymous variants. Hum. Mutat. 38, 1336–1347 (2017). 954. Sauna, Z. E. & Kimchi-Sarfaty, C. Understanding the contribution of synonymous mutations to human disease. Nat. Rev. Genet. 12, 683 (2011). 955. Treutlein, J. et al. Genome-wide Association Study of Alcohol DependenceGWAS of Alcohol Dependence. JAMA Psychiatry 66, 773–784 (2009). 956. Glinskii, A. B. et al. Identification of intergenic trans-regulatory RNAs containing a disease-linked SNP sequence and targeting cell cycle progression/differentiation pathways in multiple common human disorders. Cell Cycle 8, 3925–3942 (2009). 957. Gibert, J.-M. et al. Strong epistatic and additive effects of linked candidate SNPs for Drosophila pigmentation have implications for analysis of genome-wide association studies results. Genome Biol. 18, 126 (2017). 958. Zanetti, D. & Weale, M. E. Transethnic differences in GWAS signals: A simulation study. Ann. Hum. Genet. 82, 280– 286 (2018). 959. Lin, P.-I., Vance, J. M., Pericak-Vance, M. A. & Martin, E. R. No Gene Is an Island: The Flip-Flop Phenomenon. Am. J. Hum. Genet. 80, 531–538 (2007). 960. Wallace, D. C. A mitochondrial paradigm of metabolic and degenerative diseases, aging, and cancer: a dawn for evolutionary medicine. Annu. Rev. Genet. 39, 359–407 (2005). 961. Kenney, M. C., Ferrington, D. A. & Udar, N. Mitochondrial Genetics of Retinal Disease. in Retina (ed. Sadda, S.) 635–641 (W.B. Saunders, 2013). 962. Castro, M. G. et al. Mitochondrial DNA haplogroups in Spanish patients with hypertrophic cardiomyopathy. Int. J. Cardiol. 112, 202–206 (2006). 963. Torroni, A. et al. Classification of European mtDNAs from an analysis of three European populations. Genetics 144, 1835–1850 (1996). 964. Ruiz-Pesini, E., Mishmar, D., Brandon, M., Procaccio, V. & Wallace, D. C. Effects of purifying and adaptive selection on regional variation in human mtDNA. Science 303, 223–226 (2004). 965. Mishmar, D. et al. Natural selection shaped regional mtDNA variation in humans. Proc. Natl. Acad. Sci. 100, 171 LP – 176 (2003).

274 966. van der Walt, J. M. et al. Mitochondrial polymorphisms significantly reduce the risk of Parkinson disease. Am. J. Hum. Genet. 72, 804–811 (2003). 967. van der Walt, J. M. et al. Analysis of European mitochondrial haplogroups with Alzheimer disease risk. Neurosci. Lett. 365, 28–32 (2004). 968. Ridge, P. G., Ebbert, M. T. W. & Kauwe, J. S. K. Genetics of Alzheimer’s disease. Biomed Res. Int. 2013, 254954 (2013). 969. Hassani-Kumleh, H. et al. Mitochondrial D-loop variation in Persian multiple sclerosis patients: K and A haplogroups as a risk factor!! Cell. Mol. Neurobiol. 26, 119–125 (2006). 970. Dunaief, J. L., Dentchev, T., Ying, G.-S. & Milam, A. H. The role of apoptosis in age-related macular degeneration. Arch. Ophthalmol. (Chicago, Ill. 1960) 120, 1435–1442 (2002). 971. Huerta, C. et al. Mitochondrial DNA polymorphisms and risk of Parkinson’s disease in Spanish population. J. Neurol. Sci. 236, 49–54 (2005). 972. Kenney, M. C. et al. Molecular and bioenergetic differences between cells with African versus European inherited mitochondrial DNA haplogroups: implications for population susceptibility to diseases. Biochim. Biophys. Acta 1842, 208–219 (2014). 973. Collins, D. W. et al. Association of primary open-angle glaucoma with mitochondrial variants and haplogroups common in African Americans. Mol. Vis. 22, 454–471 (2016). 974. Achilli, A. et al. Mitochondrial DNA backgrounds might modulate diabetes complications rather than T2DM as a whole. PLoS One 6, e21029 (2011). 975. Gray, L. J. et al. Screening for type 2 diabetes in a multiethnic setting using known risk factors to identify those at high risk: a cross-sectional study. Vasc. Health Risk Manag. 6, 837–842 (2010). 976. Lightstone, L. et al. High incidence of end-stage renal disease in Indo-Asians in the UK. QJM 88, 191–195 (1995). 977. Barbour, S. J., Er, L., Djurdjev, O., Karim, M. & Levin, A. Differences in progression of CKD and mortality amongst Caucasian, Oriental Asian and South Asian CKD patients. Nephrol. Dial. Transplant 25, 3663–3672 (2010). 978. Wilkinson, E. et al. Lack of awareness of kidney complications despite familiarity with diabetes: a multi-ethnic qualitative study. J. Ren. Care 37, 2–11 (2011). 979. Liu, D. et al. The modern spectrum of biopsy-proven renal disease in Chinese diabetic patients-a retrospective descriptive study. PeerJ 6, e4522 (2018). 980. Chang, T. I. et al. Renal outcomes in patients with type 2 diabetes with or without coexisting non-diabetic renal disease. Diabetes Res. Clin. Pract. 92, 198–204 (2011). 981. Chong, Y.-B. et al. Clinical predictors of non-diabetic renal disease and role of renal biopsy in diabetic patients with renal involvement: a single centre review. Ren. Fail. 34, 323–328 (2012). 982. Sharma, S. G. et al. The modern spectrum of renal biopsy findings in patients with diabetes. Clin. J. Am. Soc. Nephrol. 8, 1718–1724 (2013). 983. Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. & Hirschhorn, J. N. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat. Genet. 33, 177–182 (2003). 984. Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L. & Rothman, N. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J. Natl. Cancer Inst. 96, 434–442 (2004). 985. Ewen Callaway. Genome studies attract criticism. Nature 546, 463 (2017). 986. Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20, 467–484 (2019). 987. Treangen, T. J. & Salzberg, S. L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2011). 988. Nicolaou, N., Renkema, K. Y., Bongers, E. M. H. F., Giles, R. H. & Knoers, N. V. A. M. Genetic, environmental, and epigenetic factors involved in CAKUT. Nat. Rev. Nephrol. 11, 720 (2015). 989. Gulati, A. & Somlo, S. Whole exome sequencing: a state-of-the-art approach for defining (and exploring!) genetic landscapes in pediatric nephrology. Pediatr. Nephrol. 33, 745–761 (2018). 990. Mann, N. et al. Whole-Exome Sequencing Enables a Precision Medicine Approach for Kidney Transplant Recipients. J. Am. Soc. Nephrol. 30, 201–215 (2019).

275 991. Warejko, J. K. et al. Whole Exome Sequencing of Patients with Steroid-Resistant Nephrotic Syndrome. Clin. J. Am. Soc. Nephrol. 13, 53–62 (2018). 992. Connaughton, D. M. et al. Monogenic causes of chronic kidney disease in adults. Kidney Int. 95, 914–928 (2019). 993. Ashraf, S. et al. ADCK4 mutations promote steroid-resistant nephrotic syndrome through CoQ10 biosynthesis disruption. J. Clin. Invest. 123, 5179–5189 (2013). 994. Lin, F. et al. Whole exome sequencing reveals novel COL4A3 and COL4A4mutations and resolves diagnosis in Chinese families with kidney disease. BMC Nephrol. 15, 175 (2014). 995. Braun, D. A. et al. Whole exome sequencing identifies causative mutations in the majority of consanguineous or familial cases with childhood-onset increased renal echogenicity. Kidney Int. 89, 468–475 (2016). 996. Mallawaarachchi, A. C. et al. Whole-genome sequencing overcomes pseudogene homology to diagnose autosomal dominant polycystic kidney disease. Eur. J. Hum. Genet. 24, 1584 (2016). 997. Cabezas, O. R. et al. Polycystic Kidney Disease with Hyperinsulinemic Hypoglycemia Caused by a Promoter Mutation in Phosphomannomutase 2. J. Am. Soc. Nephrol. 28, 2529–2539 (2017). 998. van der Ven, A. T. et al. Whole-Exome Sequencing Identifies Causative Mutations in Families with Congenital Anomalies of the Kidney and Urinary Tract. J. Am. Soc. Nephrol. 29, 2348 LP – 2361 (2018). 999. Yao, T. et al. Integration of Genetic Testing and Pathology for the Diagnosis of Adults with FSGS. Clin. J. Am. Soc. Nephrol. 14, 213–223 (2019). 1000. Groopman, E. E. et al. Diagnostic Utility of exome sequencing for kidney disease. N. Engl. J. Med. 380, 142–151 (2019). 1001. Kesselheim, A., Ashton, E. & Bockenhauer, D. Potential and pitfalls in the genetic diagnosis of kidney diseases. Clin. Kidney J. 10, 581–585 (2017). 1002. Lu, J. T., Campeau, P. M. & Lee, B. H. Genotype-phenotype correlation--promiscuity in the era of next-generation sequencing. N. Engl. J. Med. 371, 593–596 (2014). 1003. Larsen, C. P., Durfee, T., Wilson, J. D. & Beggs, M. L. A Custom Targeted Next-Generation Sequencing Gene Panel for the Diagnosis of Genetic Nephropathies. American journal of kidney diseases : the official journal of the National Kidney Foundation 67, 992–993 (2016). 1004. Bailie, C., Kilner, J., Maxwell, A. P. & McKnight, A. J. Development of next generation sequencing panel for UMOD and association with kidney disease. PLoS One 12, e0178321 (2017). 1005. Lata, S. et al. Whole-Exome Sequencing in Adults With Chronic Kidney Disease: A Pilot Study. Ann. Intern. Med. 168, 100–109 (2018). 1006. Guan, M. et al. An Exome-wide Association Study for Type 2 Diabetes-Attributed End-Stage Kidney Disease in African Americans. Kidney Int. reports 3, 867–878 (2018). 1007. Rasouly, H. M. et al. The Burden of Candidate Pathogenic Variants for Kidney and Genitourinary Disorders Emerging From Exome Sequencing. Ann. Intern. Med. (2018). 1008. Mutz, K.-O., Heilkenbrinker, A., Lonne, M., Walter, J.-G. & Stahl, F. Transcriptome analysis using next-generation sequencing. Curr. Opin. Biotechnol. 24, 22–30 (2013). 1009. Hajkova, P. et al. DNA-methylation analysis by the bisulfite-assisted genomic sequencing method. Methods Mol. Biol. 200, 143–154 (2002). 1010. Tost, J. & Gut, I. G. DNA methylation analysis by pyrosequencing. Nat. Protoc. 2, 2265–2275 (2007). 1011. Park, P. J. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009). 1012. Giresi, P. G., Kim, J., McDaniell, R. M., Iyer, V. R. & Lieb, J. D. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res. 17, 877–885 (2007). 1013. Song, L. et al. Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 21, 1757–1767 (2011). 1014. Song, L. & Crawford, G. E. DNase-seq: A High-Resolution Technique for Mapping Active Gene Regulatory Elements across the Genome from Mammalian Cells. Cold Spring Harb. Protoc. 2010, pdb.prot5384 (2010). 1015. Jiang, C. & Pugh, B. F. Nucleosome positioning and gene regulation: advances through genomics. Nat. Rev. Genet. 10, 161–172 (2009). 1016. Zhang, Z. & Pugh, B. F. High-resolution genome-wide mapping of the primary structure of chromatin. Cell 144,

276 175–186 (2011). 1017. Rizzo, J. M. & Buck, M. J. Key Principles and Clinical Applications of ‘Next-Generation’ DNA Sequencing. Cancer Prev. Res. 5, 887–900 (2012). 1018. Palculict, M. E., Zhang, V. W., Wong, L.-J. & Wang, J. Comprehensive Mitochondrial Genome Analysis by Massively Parallel Sequencing BT - Mitochondrial DNA: Methods and Protocols. in (ed. McKenzie, M.) 3–17 (Springer New York, 2016). 1019. Calvo, S. E. et al. Molecular diagnosis of infantile mitochondrial disease with targeted next-generation sequencing. Sci. Transl. Med. 4, 118ra10 (2012). 1020. Lechowicz, U. et al. Application of nextgeneration sequencing to identify mitochondrial mutations: Study on m.7511T>C in patients with hearing loss. Mol. Med. Rep. 17, 1782–1790 (2018). 1021. Xu, B. et al. Novel mutation of ND4 gene identified by targeted next-generation sequencing in patient with Leigh syndrome. J. Hum. Genet. 62, 291–297 (2017). 1022. Yu, C. et al. Deciphering the Spectrum of Mitochondrial DNA Mutations in Hepatocellular Carcinoma Using High- Throughput Sequencing. Gene Expr. 18, 125–134 (2018). 1023. Tseng, C.-C. et al. Next-generation sequencing profiling of mitochondrial genomes in gout. Arthritis Res. Ther. 20, 137 (2018). 1024. Douglas, A. P. et al. Next-generation sequencing of the mitochondrial genome and association with IgA nephropathy in a renal transplant population. Sci. Rep. 4, 7379 (2014). 1025. Genomic Core Techonology Unit. QUB Genomics. (2019). Available at: https://www.qub.ac.uk/sites/core- technology-units/Genomics/. (Accessed: 9th September 2019) 1026. Illumina. NextSeqTM 500 Sequencing System. (2014). 1027. Illumina. NextSeqTM 550 Sequencing System. Illumina Datasheet (2019). 1028. Illumina. NovaSeqTM 6000 Sequencing System. Data Sheet 4, 1–4 (2019). 1029. Illumina. NexteraTM DNA Flex Library Preparation Kit. 4–7 (2017). 1030. Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013). 1031. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53 (2008). 1032. Illumina. Illumina sequencing introduction. 1–8 (2017). 1033. Nakazato, T., Ohta, T. & Bono, H. Experimental Design-Based Functional Mining and Characterization of High- Throughput Sequencing Data in the Sequence Read Archive. PLoS One 8, (2013). 1034. Illumina. NextSeqTM 500 System Guide. (2015). 1035. Illumina. Indexed Sequencing Overview Guide. San Diego Calif. (2019). 1036. Illumina. Field Guide to Methylation Methods. (2014). 1037. McCaughan, J. A., McKnight, A. J. & Maxwell, A. P. Genetics of New-Onset Diabetes after Transplantation. J. Am. Soc. Nephrol. 25, 1037–1049 (2014). 1038. Cruise, S. et al. Early key findings from a study of older people in Northern Ireland The NICOLA Study Northern Ireland Cohort for the Longitudinal Study of Ageing … Understanding today for a healthier tomorrow. NICOLA Report (2017). 1039. Mcknight, A. et al. Investigation of DNA polymorphisms in SMAD genes for genetic predisposition to diabetic nephropathy in patients with type 1 diabetes mellitus. Diabetologia 52, 844–849 (2009). 1040. Illumina. Scalable Nucleic Acid Quality Assessments for Illumina Next-Generation Sequencing Library Prep - The Fragment Analyser. 1–4 (2017). 1041. Illumina. bcl2fastq2 Conversion Software v2.20. Illumina Software Guide (2019). 1042. KAPABIOSYSTEMS. KAPA HyperPlus Kit. 1–20 (2017). 1043. Roche. SeqCap EZ Target Enrichment System. (2017). 1044. Agilent. Quality Analysis of Eukaryotic Total RNA with the Agilent 5200 Fragment Analyzer System. (2019). 1045. Agilent. Assessment of Genomic DNA Quality with the Agilent 5200 Fragment Analyzer System. (2019).

277 1046. Illumina. Application Note: RNA Sequencing. Application Note: RNA Sequencing (2015). 1047. Illumina. Illumina Data Aggregation. (2019). Available at: https://support.illumina.com/help/BaseSpace_Sequence_Hub/Source/Informatics/BS/AutomaticDataAggregation _swBS.htm. (Accessed: 9th September 2019) 1048. Illumina. mtDNA Variant Processor. User Guide (2016). 1049. Illumina. mtDNA Variant Processor App Guide. (2019). Available at: https://support.illumina.com/help/BS_App_MDProcessor_Online_1000000007932/Content/Source/Informatics/A pps/Workflow_appMD_P.htm. (Accessed: 9th September 2019) 1050. Illumina. mtDNA Variant Analyzer. User Guide (2016). 1051. Illumina. RNA Express. User Guide (2014). 1052. Illumina. RNA Express App Guide. (2019). Available at: https://support.illumina.com/help/BaseSpace_App_RNAExpress_help/RNAExpress_Apps_help.htm. (Accessed: 9th September 2019) 1053. Illumina. MethylSeq. User Guide (2015). 1054. UCSC. Sequence and Annotation Download. UCSC Genome Browser (2019). Available at: https://hgdownload.soe.ucsc.edu/downloads.html. (Accessed: 29th September 2019) 1055. Roche. SeqCap Epi Enrichment System. (2018). 1056. Illumina. MethylSeq App Guide. (2019). Available at: http://support.illumina.com/help/BaseSpace_App_MethylSeq_help/methylseq-app-help.htm. (Accessed: 9th September 2019) 1057. Illumina. Illumina DRAGEN Bio-IT Platorm. User Guid. (2019). 1058. Chaitankar, V. et al. Next generation sequencing technology and genomewide data analysis: Perspectives for retinal research. Prog. Retin. Eye Res. 55, 1–31 (2016). 1059. Illumina. Understanding Run Charts in BaseSpace Sequencing Hub. (2019). Available at: https://help.basespace.illumina.com/articles/descriptive/runs-charts/. (Accessed: 9th September 2019) 1060. Roche. KAPA RNA HyperPrep Kit with RiboErase. Product Overview (2019). Available at: https://sequencing.roche.com/en-us/products-solutions/by-category/library-preparation/rna-library- preparation/kapa-rna-hyperprep-kit-with-riboerasehmr.html. (Accessed: 24th September 2019) 1061. SNPedia. SNPedia: Gs1040. SNPedia.com (2011). Available at: https://www.snpedia.com/index.php/Gs1040. (Accessed: 24th September 2019) 1062. Hay, M. Haplogroup H. Eupedia (2017). Available at: https://www.eupedia.com/europe/Haplogroup_H_mtDNA.shtml. (Accessed: 24th September 2019) 1063. Santoro, A. et al. Evidence for sub-haplogroup h5 of mitochondrial DNA as a risk factor for late onset Alzheimer’s disease. PLoS One 5, e12037–e12037 (2010). 1064. Marinov, G. K., Wang, Y. E., Chan, D. & Wold, B. J. Evidence for site-specific occupancy of the mitochondrial genome by nuclear transcription factors. PLoS One 9, e84713–e84713 (2014). 1065. Ahirwar, A. K. et al. Interaction of iNOS Gene (C150T) Polymorphism and Endothelial Dysfunction in Pathophysiology of Metabolic Syndrome. Tokai J. Exp. Clin. Med. 43, 24–29 (2018). 1066. Chen, A., Raule, N., Chomyn, A. & Attardi, G. Decreased Reactive Oxygen Species Production in Cells with Mitochondrial Haplogroups Associated with Longevity. PLoS One 7, e46473 (2012). 1067. Guney, O., Ak, H., Atay, S., Ozkaya, A. B. & Aydin, H. H. Mitochondrial DNA polymorphisms associated with longevity in the Turkish population. Mitochondrion 17, 7–13 (2014). 1068. GeneCards. PTPRN2. GeneCards Human Gene Database (2019). 1069. GeneCards. COX19. GeneCards Human Gene Database (2019). Available at: https://www.genecards.org/cgi- bin/carddisp.pl?gene=COX19. (Accessed: 24th September 2019) 1070. Abderrahmani, A. et al. The Transcriptional Repressor REST Determines the Cell-Specific Expression of the Human MAPK8IP1 Gene Encoding IB1 (JIP-1). Mol. Cell. Biol. 21, 7256–7267 (2001). 1071. Lacombe, M.-L., Tokarska-Schlattner, M., Boissan, M. & Schlattner, U. The mitochondrial nucleoside diphosphate kinase (NDPK-D/NME4), a moonlighting protein for cell homeostasis. Lab. Investig. 98, 582–588 (2018).

278 1072. Hendrickson, S. L. et al. Genetic variants in nuclear-encoded mitochondrial genes influence AIDS progression. PLoS One 5, e12862 (2010). 1073. Go, M. J. et al. New susceptibility loci in MYL2, C12orf51 and OAS1 associated with 1-h plasma glucose as predisposing risk factors for type 2 diabetes in the Korean population. J. Hum. Genet. 58, 362–365 (2013). 1074. Lundberg, M., Krogvold, L., Kuric, E., Dahl-Jørgensen, K. & Skog, O. Expression of Interferon-Stimulated Genes in Insulitic Pancreatic Islets of Patients Recently Diagnosed With Type 1 Diabetes. Diabetes 65, 3104 LP – 3110 (2016). 1075. Qu, H.-Q., Polychronakos, C. & Consortium, T. I. D. G. Reassessment of the type I diabetes association of the OAS1 locus. Genes Immun. 10 Suppl 1, S69–S73 (2009). 1076. Li, C.-I., Samuels, D. C., Zhao, Y.-Y., Shyr, Y. & Guo, Y. Power and sample size calculations for high-throughput sequencing-based experiments. Brief. Bioinform. 19, 1247–1255 (2018). 1077. Wang, G. T., Li, B., Santos-Cortez, R. P. L., Peng, B. & Leal, S. M. Power analysis and sample size estimation for sequence-based association studies. Bioinformatics 30, 2377–2378 (2014). 1078. McCullough, K. P., Morgenstern, H., Saran, R., Herman, W. H. & Robinson, B. M. Projecting ESRD Incidence and Prevalence in the United States through 2030. J. Am. Soc. Nephrol. 30, 127–135 (2019). 1079. Public Health England. Chronic kidney disease prevalence model About Public Health England. (2014). 1080. Diabetes Voice. Priced out: the impact of rising insulin costs. Diabetes Voice (2019). Available at: https://diabetesvoice.org/en/news/priced-out-the-impact-of-rising-insulin-costs/. (Accessed: 26th September 2019) 1081. Gu, H. F. Genetic and Epigenetic Studies in Diabetic Kidney Disease. Front. Genet. 10, 507 (2019). 1082. Chung, K. W. et al. Mitochondrial Damage and Activation of the STING Pathway Lead to Renal Inflammation and Fibrosis. Cell Metab. (2019). 1083. Hirakawa, Y., Tanaka, T. & Nangaku, M. Mechanisms of metabolic memory and renal hypoxia as a therapeutic target in diabetic kidney disease. J. Diabetes Investig. 8, 261–271 (2017). 1084. Kato, M. & Natarajan, R. Epigenetics and epigenomics in diabetic kidney disease and metabolic memory. Nat. Rev. Nephrol. 15, 327–345 (2019). 1085. Singh, S., Sonkar, S. K., Sonkar, G. K. & Mahdi, A. A. Diabetic kidney disease: A systematic review on the role of epigenetics as diagnostic and prognostic marker. Diabetes. Metab. Res. Rev. 35, e3155 (2019). 1086. Keating, S. T., van Diepen, J. A., Riksen, N. P. & El-Osta, A. Epigenetics in diabetic nephropathy, immunity and metabolism. Diabetologia 61, 6–20 (2018). 1087. Thomas, M. C. Epigenetic Mechanisms in Diabetic Kidney Disease. Curr. Diab. Rep. 16, 31 (2016). 1088. Groopman, E. E. et al. Diagnostic Utility of Exome Sequencing for Kidney Disease. N. Engl. J. Med. 380, 142–151 (2018). 1089. Mooyaart, A. L. Genetic associations in diabetic nephropathy. Clin. Exp. Nephrol. 18, 197–200 (2014). 1090. Berger, M. et al. Hidden population substructures in an apparently homogeneous population bias association studies. Eur. J. Hum. Genet. 14, 236–244 (2006). 1091. Saeed, M. Locus and gene-based GWAS meta-analysis identifies new diabetic nephropathy genes. Immunogenetics 70, 347–353 (2018). 1092. Armstrong, R. A. When to use the Bonferroni correction. Ophthalmic Physiol. Opt. 34, 502–508 (2014). 1093. Noonan, J. P. & McCallion, A. S. Genomics of long-range regulatory elements. Annu. Rev. Genomics Hum. Genet. 11, 1–23 (2010). 1094. van Heyningen, V. & Bickmore, W. Regulation from a distance: long-range control of gene expression in development and disease. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 368, 20120372 (2013). 1095. Legati, A. et al. New genes and pathomechanisms in mitochondrial disorders unraveled by NGS technologies. Biochim. Biophys. Acta 1857, 1326–1335 (2016). 1096. Pedersen, B. S., Schwartz, D. A., Yang, I. V & Kechris, K. J. Comb-p: software for combining, analyzing, grouping and correcting spatially correlated P-values. Bioinformatics 28, 2986–2988 (2012). 1097. Butcher, L. M. & Beck, S. Probe Lasso: a novel method to rope in differentially methylated regions with 450K DNA methylation data. Methods 72, 21–28 (2015).

279 1098. Jaenisch, R. & Bird, A. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat. Genet. 33 Suppl, 245–254 (2003). 1099. Irizarry, R. A. et al. The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat. Genet. 41, 178–186 (2009). 1100. Rutten, B. P. F. et al. Longitudinal analyses of the DNA methylome in deployed military servicemen identify susceptibility loci for post-traumatic stress disorder. Mol. Psychiatry 23, 1145–1156 (2018). 1101. Ventham, N. T. et al. Integrative epigenome-wide analysis demonstrates that DNA methylation may mediate genetic risk in inflammatory bowel disease. Nat. Commun. 7, 13507 (2016). 1102. Neidhart, M. Chapter 1 - DNA Methylation – Introduction. in (ed. Neidhart, M. B. T.-D. N. A. M. and C. H. D.) 1–8 (Academic Press, 2016). 1103. Montero, R. M. et al. Defining Phenotypes in Diabetic Nephropathy: a novel approach using a cross-sectional analysis of a single centre cohort. Sci. Rep. 8, 53 (2018). 1104. Kryuchkova-Mostacci, N. & Robinson-Rechavi, M. Tissue-Specificity of Gene Expression Diverges Slowly between Orthologs, and Rapidly between Paralogs. PLoS Comput. Biol. 12, e1005274 (2016). 1105. Gómez-Ramos, A. et al. Similarities and Differences between Exome Sequences Found in a Variety of Tissues from the Same Individual. PLoS One 9, e101412 (2014). 1106. Kryuchkova-Mostacci, N. & Robinson-Rechavi, M. A benchmark of gene expression tissue-specificity metrics. Brief. Bioinform. 18, 205–214 (2017). 1107. Dick, K. J. et al. DNA methylation and body-mass index: a genome-wide analysis. Lancet (London, England) 383, 1990–1998 (2014). 1108. Yang, R. et al. A systematic survey of human tissue-specific gene expression and splicing reveals new opportunities for therapeutic target identification and evaluation. (2018). 1109. Li, M., Schröder, R., Ni, S., Madea, B. & Stoneking, M. Extensive tissue-related and allele-related mtDNA heteroplasmy suggests positive selection for somatic mutations. Proc. Natl. Acad. Sci. U. S. A. 112, 2491–2496 (2015). 1110. Herbers, E., Kekäläinen, N. J., Hangas, A., Pohjoismäki, J. L. & Goffart, S. Tissue specific differences in mitochondrial DNA maintenance and expression. Mitochondrion 44, 85–92 (2019). 1111. Krjutškov, K. et al. Tissue-specific mitochondrial heteroplasmy at position 16,093 within the same individual. Curr. Genet. 60, 11–16 (2014). 1112. Wagner, M. et al. Mitochondrial DNA mutation analysis from exome sequencing-A more holistic approach in diagnostics of suspected mitochondrial disease. J. Inherit. Metab. Dis. 42, 909–917 (2019). 1113. Bosworth, C. M., Grandhi, S., Gould, M. P. & LaFramboise, T. Detection and quantification of mitochondrial DNA deletions from next-generation sequence data. BMC Bioinformatics 18, 407 (2017). 1114. Zhang, P. et al. Mitochondria sequence mapping strategies and practicability of mitochondria variant detection from exome and RNA sequencing data. Brief. Bioinform. 17, 224–232 (2016). 1115. Sirard, M.-A. Distribution and dynamics of mitochondrial DNA methylation in oocytes, embryos and granulosa cells. Sci. Rep. 9, 11937 (2019). 1116. Kim, K. M., Noh, J. H., Abdelmohsen, K. & Gorospe, M. Mitochondrial noncoding RNA transport. BMB Rep. 50, 164–174 (2017). 1117. Dong, Y., Yoshitomi, T., Hu, J.-F. & Cui, J. Long noncoding RNAs coordinate functions between mitochondria and the nucleus. Epigenetics Chromatin 10, 41 (2017). 1118. Vendramin, R., Marine, J.-C. & Leucci, E. Non-coding RNAs: the dark side of nuclear–mitochondrial communication. EMBO J. 36, 1123–1133 (2017). 1119. Rackham, O. & Filipovska, A. Analysis of the human mitochondrial transcriptome using directional deep sequencing and parallel analysis of RNA ends. Methods in molecular biology 1125, (2014). 1120. van Dijk, E. L., Jaszczyszyn, Y., Naquin, D. & Thermes, C. The Third Revolution in Sequencing Technology. Trends Genet. 34, 666–681 (2018). 1121. Manolio, T. A. et al. Opportunities, resources, and techniques for implementing genomics in clinical care. Lancet 394, 511–520 (2019).

280