Genetic and Epigenetic Analysis of Type 2 Diabetes among Qatari Families

Wadha Al Muftah, MD

Department of Genomics of Common Disease School of Public Health Faculty of Medicine Imperial College London

Thesis Submitted for the degree of Doctor of Philosophy

Copyright Declaration

‘The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work’.

2

Abstract

Type 2 diabetes (T2D) is a complex multifactorial disorder driven by both genetic and environmental factors. The rapid increase of T2D in Qatar -prevalence of 16.3% in 2014 according to the International Diabetes Federation (IDF) - motivated the introduction of genetic studies among this under-presented population.

Major progress to study the genetic basis of T2D came from studying common variants. However, these variants were of mild effect sizes. Studies have been shifted from applying the common variants hypothesis towards the investigation of other genetic variables including rare variants, copy number variants (CNV) and epigenetic mechanisms. This PhD project is focused on identifying genomic risk factors of T2D among the Qatari population.

Eight Qatari families were analysed using the advances in genotyping and sequencing technologies. Three analyses were carried out; the aim was to identify known or novel rare variants within candidate T2D , identification of potential large CNVs related to T2D and detection of methylation associations with T2D.

Three data sets were generated for this PhD project; these were genome-wide genotyping arrays (Illumina HumanOmni2.5M bead chip), whole genome sequencing (WGS) (Illumina hiSeq2500 platform) and genome-wide methylation arrays (Illumina 450K BeadChip capturing more than 485,577 probes). I performed three analytical experiments for the aim of this PhD project. First, linkage analysis and runs of homozygosity in combination with WGS were used to map rare variants. Second, PennCNV prediction algorithm for CNV calling was applied aiming to identify rare T2D-related CNVs. Third, association test using a linear mixed model was used for the identification of methylation associations with T2D.

The findings included the identification of three rare variants detected through WGS. One variant identified within known monogenic T2D (GCKR) and two variants detected within known T2D genes (IGFBP2 and TGM2). These genes are functionally related to T2D and replication analysis will be required to assess their contribution to T2D among Qataris. A number of large rare CNVs detected from CNV analysis in the second experiment of the project, but potential T2D genes were not found within candidate CNVs. Also, methylation analysis replicated a significant association with T2D at cg19693031 in TXNIP (p=5.28x10¯⁶) among Qataris.

The identified candidate genes were proved to be involved in the pathogenesis of T2D by causing insulin resistance and beta-cell dysfunction.

3

Table of Contents

Copyright Declaration ………………………………………………………………………………………………………………. 2

Abstract ……………………………………………………………………………………………………………………………………. 3

Table of Contents ……………………………………………………………………………………………………………………… 4

List of Tables …………………………………………………………………………………………………………………………….. 11

List of Figures ……………………………………………………………………………………………………………………………. 13

Acknowledgements ………………………………………………………………………………………………………………….. 15

Declaration of Originality …………………………………………………………………………………………………………. 16

List of Publications ……………………………………………………………………………………………………………………. 17

Abbreviation …………………………………………………………………………………………………………………………….. 20

1 Introduction ……………………………………………………………………………………………………………. 24

1.1 Prevalence of Type 2 Diabetes ………………………………………………………………………………………. 25

1.2 Prevalence of Type 2 Diabetes in Qatar ……………………………………………………………………….. 25

1.3 Gene-Environment Interaction in T2D ………………………………………………………………………….. 26

1.4 Obesity risk in Developing Type 2 Diabetes ………………………………………………………………….. 28

1.5 Heritability of Type 2 Diabetes ……………………………………………………………………………………… 29

1.6 Genetic Basis of Type 2 Diabetes …………………………………………………………………………………… 29

1.6.1 Monogenic Form of T2D ……………………………………………………………………………………. 30

1.6.2 Polygenic Form of T2D ………………………………………………………………………………………. 32

1.7 Gene Mapping Approaches to Identify T2D Genes ………………………………………………………. 32

1.7.1 Linkage Analysis ……………………………………………………………………………………………………… 32

1.7.2 Candidate Gene Studies …………………………………………………………………………………………. 34

1.7.3 Genome Wide Association Studies …………………………………………………………………………. 35

1.8 Missing Heritability of Complex Diseases ……………………………………………………………………….. 42

1.8.1 Rare Variants ………………………………………………………………………………………………………….. 43

1.8.2 Genomic Structural Variation (GSV) ………………………………………………………………………… 44

1.8.3 Epigenetic Modifications ………………………………………………………………………………………… 44

1.8.4 Parent-of-Origin ……………………………………………………………………………………………………… 44 4

1.9 Next Generation Sequencing (NGS) ……………………………………………………………………………….. 45

1.10 Copy Number Variation (CNV) ……………………………………………………………………………………... 47

1. 10.1 Copy Number Variants: Mechanisms of formation ……………………………………………… 49

1. 10.2 Copy Number Variants prediction and Validation ……………………………………………….. 50

1.10.3 Role of Copy Number Variants in T2D Development ………………………………………….. 52

1.11 Epigenetic of Type 2 Diabetes ………………………………………………………………………………………. 53

1.11.1 Epigenetic Mechanisms ……………………………………………………………………………………….. 53

1.11.1.1 DNA methylation ………………………………………………………………………………….. 54

1.11.1.2 Histone Modifications …………………………………………………………………………… 55

1.11.1.3 RNA interference ………………………………………………………………………………….. 56

1.11.1.4 Environmental factors …………………………………………………………………………… 56

1.11.2 Heritability Estimates of DNA methylation ……………………………………………………………. 58

1.11.3. Role of Epigenetics in Developing T2D ………………………………………………………………… 59

1. 11.4 Characterization of T2D Epigenetic Risk ………………………………………………………………. 59

1.12 Genetic of Type 2 Diabetes in Qatar …………………………………………………………………………….. 61

1. 13 Aims and Objectives ……………………………………………………………………………………………………. 62

1.13.1 Overall Aim ………………………………………………………………………………………………………….. 62

1.13.2 Objectives ………………………………………………………………………………………………………….. 62

2 Material and Methods …………………………………………………………………………………………… 63

2.1 Ethical Approval ………………………………………………………………………………………………………………. 64

2.2 Study Subjects …………………………………………………………………………………………………………………. 64

2.2.1 Qatari Families with Early Onset of Type 2 Diabetes ………………………………………………. 64

2.2.2 Qatari Families with Type 2 Diabetes and Obesity ………………………………………………….. 68

2.2.3 Replication Samples ……………………………………………………………………………………………….. 68

5

2.3 DNA extraction ………………………………………………………………………………………………………………… 68

2.4 Sample Preparation ………………………………………………………………………………………………………….. 69

2.5 Whole Genome Genotyping Arrays ………………………………………………………………………………….. 69

2.6 Prediction of Copy Number Variants associated with T2D ………………………………………………. 70

2.6.1 Generation of B allele frequency and Log R ratio using signal intensity files …………… 71

2.6.2 PennCNV algorithm for CNV prediction ………………………………………………………………….. 72

2.6.3 Filtration of Samples with low Quality Calls ……………………………………………………………. 72

2.6.4 Merging adjacent CNV calls ……………………………………………………………………………………. 73

2.6.5 Filtering Large and Rare CNVs ………………………………………………………………………………… 73

2.6.6 Exclude false CNV calls …………………………………………………………………………………………… 74

2.7 Assessment of Candidate CNVs ………………………………………………………………………………………. 74

2.7.1 Identify rare CNVs …………………………………………………………………………………………………. 74

2.7.2 Visualization of candidate CNVs …………………………………………………………………………….. 74

2.7.3 Replicate CNVs in other dataset …………………………………………………………………………….. 75

2.8 Identification of Rare T2D Variants ………………………………………………………………………………….. 75

2.8.1 Linkage Analysis …………………………………………………………………………………………………….. 76

2.8.1.1 GenomeStudio ……………………………………………………………………………………….. 76

2.8.1.2 SNP array Quality Control ……………………………………………………………………….. 77

2.8.1.2.1 PLINK ………………………………………………………………………………………. 77

2.8.1.2.2 MERLIN …………………………………………………………………………………… 78

2.8.1.3 Marker Selection ……………………………………………………………………………………. 78

2.8.1.3.1 Informativeness …………………………………………………………….……….. 78

2.8.1.3.2 LD Pruning ……………………………………………………………………………… 80

2.8.1.4 Running Linkage Analysis using MERLIN ………………………………………………… 80

2.8.1.5 Runs of Homozygosity (ROH) ……………………………………………………….………… 81

6

2.8.2 Whole Genome Sequencing (WGS) ……………………………………………………………………….. 82

2.8.2.1 Quality control measures for WGS …………………………………………………………. 82

2.8.2.1.1 QC of raw data using FASTQC ………………………………………………………………. 83

2.8.2.1.2 QC of aligned reads using QPLOT …………………………………………………………. 86

2.8.2.2 Generation of variant calls using GATK best practices pipeline ……………….. 87

2.8.2.3 Variants filtration and identification of rare variants using Qiagen Ingenuity

Variant Analysis (IVA) ………………………………………………………………………………………….. 89

2.8.2.4 IVA filtration options used at different WGS analysis stages ……………………. 90

2.8.2.4.1 Confidence ………………………………………………………………………………. 90

2.8.2.4.2 Frequency ……………………………………………………………………………….. 90

2.8.2.4.3 Predicted Deleterious ……………………………………………………………… 91

2.8.2.4.4 Biological Context …………………………………………………………………… 91

2.8.2.4.5 Genetic Analysis...... 91

2.8.2.5 Validation of Candidate genes using PCR-based method ……………………….. 92

2.8.2.6 Sanger Sequencing ………………………………………………………………………………… 94

2.9 Whole Genome Methylation Array ……………………………………………………………………………….. 95

2.9.1 Data preprocessing …………………………………………………………………………………… 95

2.9.2 Lumi: QN+BMIQ pipeline ………………………………………………………………………….. 98

2.9.3 Quantification of Leukocytes Composition ……………………………………………….. 101

2.9.4 Statistical Analysis …………………………………………………………………………………….. 102

3 Mapping Rare Genetic Variants of T2D among Qatari Families ……………………………………………… 103

3.1 Background ……………………………………………………………………………………………………………………. 104

3.2 SNP array Quality Control ……………………………………………………………………………………………… 104

3.3 Identification of Candidate Regions ………………………………………………………………………………. 105

7

3.4 Linkage Analysis ……………………………………………………………………………………………………………… 105

3.4.1 Running Linkage Analysis using markers obtained from Illumina 2.5M platform … 106

3.4.1.1 Non-parametric Linkage Analysis ……………………………………………………………. 106

3.4.1.2 Variance Component Linkage Analysis ……………………………………………………. 106

3.4.1.3 LOD Score Calculation …………………………………………………………………………….. 106

3.4.1.4 Per-family Linkage Analysis …………………………………………………………………….. 108

3.4.2 Running Linkage Analysis using Panel V markers ………………………………………………….. 110

3.5 Runs of Homozygosity …………………………………………………………………………………………………….. 111

3.5.1 Detection of Runs of Homozygosity (ROH) ……………………………………………………………… 111

3.6 Identification of T2D Candidate Variants through Whole Genome Sequencing Analysis .. 112

3.7 Qiagen Ingenuity Variant Analysis (IVA) …………………………………………………………...... 112

3.8 Candidate Variants within T2D genes ……………………………………………………………………………… 112

3.9 Variants Identified from each analysis ……………………………………………………………………………. 115

3.9.1 Variants within Monogenic Genes of T2D ………………………………………………………………. 115

3.9.1.1 Filtering options ……………………………………………………………………………………… 115

3.9.1.2 Variants Identified within T2D monogenic genes ……………………………………. 116

3.9.2 Variants within Candidate Regions …………………………………………………………………………. 117

3.9.2.1 Filtering options ……………………………………………………………………………………… 117

3.9.2.2 Variants Identified within overlapping regions ……………………………………….. 120

3.9.2.3 Variants Identified within Linkage regions ………………………………………………. 121

3.9.2.4 Variants Identified within Homozygosity regions …………………………………….. 122

3.10 Variants from Global Analysis ……………………………………………………………………………………….. 123

3.11 Validation of Variants within Candidate T2D genes ……………………………………………………… 123

3.11.1 Known monogenic T2D genes ………………………………………………………………………………. 123

3.11.2 Potential T2D genes ……………………………………………………………………………………………… 126 8

3.12 Discussion …………………………………………………………………………………………………………………….. 129

3.12.1 Linkage Analysis and Runs of Homozygosity …………………………………………………………. 129

3.12.1 Variant detection through WGS ……………………………………………………………………………. 130

3.13 Summary ……………………………………………………………………………………………………………………….. 134

3.14 Limitations …………………………………………………………………………………………………………………….. 135

3.15 Future work …………………………………………………………………………………………………………………… 135

4 Assessments of Copy Number Variants among Qatari Families with Type 2 Diabetes .. 136

4.1 Background ……………………………………………………………………………………………………………………… 137

4.2 CNV Prediction Algorithms ………………………………………………………………………………………………. 137

4.3 CNV Prediction using PennCNV ……………………………………………………………………………………….. 138

4.3.1 Data Quality Control ……………………………………………………………………………………………….. 138

4.3.2 Description of Large and Dense CNVs used in the Analysis ……………………………………… 139

4.3.2.1 Size Distribution of Copy Number ……………………………………………………………. 139

4.3.2.2 Frequency of Copy Number Variants …………………………………………………. 141

4.4 Analysis of Large and Dense CNV calls ……………………………………………………………………………… 141

4.4.1 Merging adjunct CNVs using PennCNV ……………………………………………………………………. 142

4.4.2 Identification of Large CNVs ……………………………………………………………………………………. 142

4.5 CNV Calling using PennCNV ………………………………………………………………………………………………. 143

4.6 Assessment of candidate CNVs …………………………………………………………………………………………. 144

4.6.1 CNVs shared between samples and overlap with genes ………………………………………….. 145

4.6.2 CNVs in one sample and overlap with genes …………………………………………………………… 146

4.6.3 Visualization of candidate CNVs ………………………………………………………………………………. 146

4.6.4 Candidate CNVs overlapping with CNVs in other dataset ………………………………………… 150

4.6.5 Identification of potential T2D genes within candidate CNVs ………………………………….. 150

4.6.5.1 CNVs shared between samples and overlap with genes ………………………….. 150

4.6.5.2 CNVs in single samples and overlap with genes ………………………………………. 151

4.7 Discussion …………………………………………………………………………………………………………………….. 153

9

4.8 Summary ………………………………………………………………………………………………………………………. 155

4.9 Limitations ……………………………………………………………………………………………………………………. 155

4.10 Future work ………………………………………………………………………………………………………………….. 156

4.10.1 Samples ……………………………………………………………………………………………………………….. 156

4.10.2 Trio-based CNV calling using PennCNV ………………………………………………………………… 156

4.10.3 CNV calling using another prediction algorithm …………………………………………………… 156

5 Epigenetic Associations of Type 2 Diabetes among Qatari Families ……………………………………….. 157

5.1 Background …………………………………………………………………………………………………………………….. 158

5.2 Methylation Analysis Pipeline ………………………………………………………………………………………… 159

5.3 Qatari and TwinsUK Cohort ……………………………………………………………………………………………. 159

5.4 Estimation of Differentially Methylated Loci …………………………………………………………………. 160

5.5 CpG Sites Selected for Replication …………………………………………………………………………………. 162

5.6 Qatari and TwinsUK Meta-analysis ………………………………………………………………………………… 162

5.7 T2D-related Differential Methylation …………………………………………………………………………….. 164

5.7.1 Replication of T2D associated sites in the Qatari Families ……………………………………… 164

5.7.2 Replication of T2D association in the TwinsUK ………………………………………………………. 164

5.8 Discussion ………………………………………………………………………………………………………………………. 168

5.9 Summary ………………………………………………………………………………………………………………………… 171

5.10 Limitations ……………………………………………………………………………………………………………………… 171

5.11 Future work ……………………………………………………………………………………………………………………. 171

6 Conclusions ……………………………………………………………………………………………………………………………… 176

References ………………………………………………………………………………………………………………………………….. 182

Appendix A …………………………………………………………………………………………………………………………………. 205

Appendix B …………………………………………………………………………………………………………………………………. 215

Appendix C ………………………………………………………………………………………………………………………………….. 234

10

List of Tables

Table 1.1 ……………………………………………………………………………………………………………………………………… 31

Table 1.2 ……………………………………………………………………………………………………………………………………… 39

Table 2.1 ……………………………………………………………………………………………………………………………………… 67

Table 2.2 ……………………………………………………………………………………………………………………………………… 94

Table 2.3 ……………………………………………………………………………………………………………………………………… 94

Table 3.1 ………………………………………………………………………………………………………………………………………. 109

Table 3.2 ………………………………………………………………………………………………………………………………………. 110

Table 3.3 ……………………………………………………………………………………………………………………………………… 110

Table 3.4 ……………………………………………………………………………………………………………………………………… 110

Table 3.5 ……………………………………………………………………………………………………………………………………… 111

Table 3.6 ……………………………………………………………………………………………………………………………………… 111

Table 3.7 ……………………………………………………………………………………………………………………………………… 113

Table 3.8 ……………………………………………………………………………………………………………………………………… 115

Table 3.9 ……………………………………………………………………………………………………………………………………… 119

Table 3.10 ……………………………………………………………………………………………………………………………………. 119

Table 3.11 ……………………………………………………………………………………………………………………………………. 120

Table 3.12 ……………………………………………………………………………………………………………………………………. 122

Table 4.1 ……………………………………………………………………………………………………………………………………… 139

Table 4.2 ………………………………………………………………………………………………………………………………. 141

Table 4.3 ……………………………………………………………………………………………………………………………………… 143

Table 4.4 ……………………………………………………………………………………………………………………………………… 144

Table 4.5 ……………………………………………………………………………………………………………………………………… 146

11

Table 4.6 ……………………………………………………………………………………………………………………………………… 147

Table 4.7 ……………………………………………………………………………………………………………………………………… 151

Table 4.8 ……………………………………………………………………………………………………………………………………… 152

Table 4.9 ……………………………………………………………………………………………………………………………………… 153

Table 5.1 ……………………………………………………………………………………………………………………………………… 164

Table 5.2 ……………………………………………………………………………………………………………………………………… 168

12

List of Figure

Figure 1.1 ……………………………………………………………………………………………………………………………….. 27

Figure 2.1 ……………………………………………………………………………………………………………………………….. 67-68

Figure 2.2 ……………………………………………………………………………………………………………………………….. 71

Figure 2.3 ……………………………………………………………………………………………………………………………….. 72

Figure 2.4 ……………………………………………………………………………………………………………………………….. 74

Figure 2.5 ……………………………………………………………………………………………………………………………..… 76

Figure2.6 ………………………………………………………………………………………………………………………………… 80

Figure 2.7 ……………………………………………………………………………………………………………………………….. 81

Figure 2.8 ……………………………………………………………………………………………………………………………….. 85

Figure 2.9 ……………………………………………………………………………………………………………………………….. 86

Figure 2.10 ……………………………………………………………………………………………………………………………… 87

Figure 2.11 ……………………………………………………………………………………………………………………………… 89

Figure 2.12 ……………………………………………………………………………………………………………………………… 90

Figure 2.13 ……………………………………………………………………………………………………………………………… 93

Figure 2.14 ……………………………………………………………………………………………………………………………… 97

Figure 2.15 ……………………………………………………………………………………………………………………………… 98

Figure 2.16 ……………………………………………………………………………………………………………………………… 100

Figure 2.17 ……………………………………………………………………………………………………………………………… 100

Figure 2.18 ……………………………………………………………………………………………………………………………… 101

Figure 2.19 ……………………………………………………………………………………………………………………………… 101

Figures3.1. (a) …………………………………………………………………………………………………………………………. 108

Figures3.1. (b) …………………………………………………………………………………………………………………………. 108

Figure 3.2 ………………………………………………………………………………………………………………………………… 116

Figure 3.3 ………………………………………………………………………………………………………………………………… 121

Figure 3.4 ………………………………………………………………………………………………………………………………… 126

Figure 3.5 ………………………………………………………………………………………………………………………………… 128

Figure 3.6 ………………………………………………………………………………………………………………………………… 129

Figure 4.1 ………………………………………………………………………………………………………………………………… 139

Figure 4.2 ………………………………………………………………………………………………………………………………… 140

13

Figure 4.3 ………………………………………………………………………………………………………………………………… 141

Figure 4.4 ………………………………………………………………………………………………………………………………… 142

Figure 4.5 ………………………………………………………………………………………………………………………………… 145

Figure 4.6 ………………………………………………………………………………………………………………………………… 148

Figure 4.7 ………………………………………………………………………………………………………………………………… 149

Figure 4.8 ………………………………………………………………………………………………………………………………… 150

Figure 5.1 ………………………………………………………………………………………………………………………………… 162

Figure 5.2 ………………………………………………………………………………………………………………………………… 166

Figure 5.3 ………………………………………………………………………………………………………………………………… 167

Figure 5.4 ………………………………………………………………………………………………………………………………… 173

Figure 5.5 ………………………………………………………………………………………………………………………………… 174

Figure 5.6 ………………………………………………………………………………………………………………………………… 175

14

Acknowledgements

The PhD journey was a life-changing experience and full of intellectual challenges. The continuous support, guidance and encouragement that I received from so many different people helped achieving the aims of this PhD project.

I would like first to thank God for giving me the power, the faith and the strength to pursue my PhD studies, and putting together all the work in this PhD thesis.

Many thanks to all my supervisors Dr Mario Falchi, Prof Karsten Suhre and Prof Philippe Froguel for their contribution in this PhD project. I would like first to thank my primary supervisor Dr Mario Falchi for his support and guidance throughout this journey to successfully answer the PhD questions. Also, a special thanks to Prof Karsten Suhre (my secondary supervisor at Weill Cornell Medical College in Qatar (WCMC-Q)) for the time that he dedicated to discuss the results of this PhD work. I would like to thank Prof Philippe Froguel, my secondary supervisor and head of department, for his continuous support.

Many thanks also to my colleagues at Imperial College Sadia Saeed, Dr Julia El-Sayed Mustafa and Dr Haya Al-Saud for their encouragement and support, and for the help they offered.

Special thanks to my colleagues at WCMC-Q starting with Prof Khalid Machaca, the associate dean of research at WCMC-Q, for his contribution and valuable input during the scientific discussions throughout my PhD. I would like also to thank Dr Pankaj Kumar, Dr Noha Yousri and Dr Shaza Zaghlool for their guidance and advice throughout my PhD. Also, I would like to thank Dr Cindy McKeon for her time and energy to help in recruiting the study subjects, and for being the best companion. Thanks to my colleagues (Mashael, Dennis, Marjonneke, Sweety, Fatima, Omar, Sara, Ania and Shahina) for their friendship and for being the best buddies.

I am grateful to my parents and my siblings, Maha, Reem, Jassim and Mohammed. I would like to thank all of them for being by my side and giving me the courage to face the challenges during this journey. Big thanks to my great husband, Fahad, for believing in me and being proud of my work.

I would like to dedicate this thesis to my husband and our newly born baby Mariam. “Dear Mariam, I am very fortunate to have you in my life and I will always make you proud of me. Love you so much …”

Declaration of Originality

I, Wadha Al-Muftah declare that all the data included in this thesis is the result of my own work and research under the supervision of Dr Mario Falchi, Prof Karsten Suhre and Prof Philippe Froguel. All information derived from published and unpublished work of others has been acknowledged in the thesis and a list of references is given.

Wadha Al Muftah

July 2015

16

List of Publications

 Wadha A. Al Muftah, Mashael Al-Shafai, Shaza B. Zaghlool, Alessia Visconti, Pei-Chien Tsai, Pankaj Kumar, Tim Spector, Jordana Bell, Mario Falchi, and Karsten Suhre (2015). Epigenetic Associations of Type 2 diabetes and Obesity in an Arab Population (accepted for Clinical Epigenetics Journal).

 Alessia Visconti, Wadha A. Al Muftah, Mashael Al-Shafai, Karsten Suhre and Mario Falchi (2015). PopPANTE: Population and Pedigree Association Testing of the Epigenome (submitted).

 Yousri, N. A., Mook-Kanamori DO, Selim MM, Takiddin AH, Al-Homsi H, Al-Mahmoud KA, Karoly ED, Krumsiek J, Do KT, Neumaier U, Mook-Kanamori MJ, Rowe J, Chidiac OM, McKeon C, Al Muftah WA, Kader SA, Kastenmüller G, Suhre K (2015). A systems view of type 2 diabetes- associated metabolic perturbations in saliva, blood and urine at different timescales of glycaemic control. Diabetologia: 1-13.

 Zaghlool SB, Al-Shafai M, Al Muftah WA, Kumar P, Falchi M, Suhre K (2015). Association of DNA methylation with age, gender, and smoking in an Arab population. Clinical Epigenetics 7(1): 6.

 Mook-Kanamori MJ, El-Din Selim MM, Takiddin AH, Al-Mahmoud KA, Al-Homsi H, McKeon C, Al Muftah WA, Kader SA, Mook-Kanamori DO, Suhre K (2014). Elevated HbA1c levels in individuals not diagnosed with type 2 diabetes in Qatar: a pilot study. Qatar medical journal 2014(2): 106.

 Kumar P, Al-Shafai M, Al Muftah WA, Chalhoub N, Elsaid MF, Aleem AA, Suhre K (2014). Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance. BMC Research Notes, 7:747, 2014

 Mook-Kanamori DO, El-Din Selim MM, Takiddin AH, Al-Homsi H, Al-Mahmoud KA, Al-Obaidli A, Zirie MA, Rowe J, Yousri NA, Karoly ED, Kocher T, Sekkal Gherbi W, Chidiac OM, Mook-Kanamori MJ, Abdul Kader S, Al Muftah WA, McKeon C, Suhre K (2014). 1, 5-anhydroglucitol in saliva is a non-invasive marker of short-term glycemic control, J Clin Endocrinol Metab, 99:E479-483, 2014.

17

 Mook-Kanamori MJ, Selim MM, Takiddin AH, Al-Homsi H, Al-Mahmoud KA, Al-Obaidli A, Zirie MA, Rowe J, Gherbi WS, Chidiac OM, Kader SA, Al Muftah WA, McKeon C, Suhre K, Mook- Kanamori DO (2013). Ethnic and gender differences in advanced glycation end products measured by skin auto-fluorescence. Dermato-Endocrinology, 5:325-330, 2013.

Conference Abstracts

 Gaurav Thareja, Manish Kumar, Mashel Al-Shafai, Ziad Kronfol, Wadha Ahmed Al Muftah, Karsten Suhre+, Pankaj Kumar (2015). Y- haplogroup diversity in the State of Qatar. WCMC-Q research retreat, February 2015. Doha, Qatar.

 Shaza Zaghlool, Mashael Al-shafai, Wadha Al Muftah, Pankaj Kumar, Karsten Suhre (2014). Inheritance of Methylation in the Qatari Population. Qatar Foundation Annual Research Forum, November 2014, Doha, Qatar.

 Wadha Al-Muftah, Mashael Al-Shafai, Cindy McKeon, Philippe Froguel, Mario Falchi and Karsten Suhre (2014). Genetics of Type 2 Diabetes among Qatari Families. System Biology of Diabetes: Towards Precision Medicine, January 2014, Doha, Qatar.

 Mashael Nedham Al-shafai, Wadha Al Muftah, Pankaj Kumar, Khaled Machaca, Karsten Suhre, Mario Falchi (2014). The 16p11.2 Deletion in an Extremely Obese Patient from Qatar. Qatar Foundation Annual Research Forum, November 2014, Doha, Qatar.

 Shaza Zaghlool, Mashel Al-Shafai, Wadha Ahmed Al Muftah, Pankaj Kumar, Karsten (2013) Suhre Study of Methylation Differences in Qatari Population. The 5th Pan Arab Human Genetics Conference, November 2013, Dubai, United Arab Emirates.

 Wadha Al-Muftah, Amal Adam, Mario Falchi, Karsten Suhre, Philippe Froguel (2012). Genetics of Type 2 Diabetes among Qatari Families. Translational Systems Biology and Bioinformatics Symposium, June 2012, Doha, Qatar.

 Wadha Al-Muftah, Ramin Badii, Aisha Al Zeyara, Amelie Bonnefond, Mario Falchi, Philippe Froguel (2011). Genetics of Type 2 Diabetes among Qatari Families. Qatar Foundation Annual Research Forum, November 2011, Doha, Qatar.

18

 Wadha Al-Muftah, Philippe Froguel, Alexandra IF Blakemore (2011). Genetic Analysis of severe Obesity in Qatar and European Children. XVII International DALM Symposium on Diabetes, Obesity, and the Metabolic Syndrome, March 2011, Doha, Qatar.

19

Abbreviations

aCGH Array Comparative Genomic Hybridization anti-GAD Glutamic Acid Decarboxylase Autoantibodies

BAC Bacterial Artificial Chromosome

BMI Body Mass Index

BAF B-Allele Frequency

BAM Binary Alignment Map

BMIQ Beta-Mixture Quantile Normalization

Bp

CNV Copy Number Variation

CN Copy Number

Cm Centimeter

CMC Combined Multivariate and Collapsing

CGH Comparative Genomic Hybridization

CpG —C—phosphate—G—

DNA Deoxyribonucleic Acid

DIAGRAM DIAbetes Genetics Replication and Meta-analysis

DGI Diabetes Genetics Initiative

DMR Differentially Methylated Regions

DZ Dizygotic

DNMT Deoxyribonucleic Acid Methyltransferase dbSNP Single Nucleotide Polymorphism Database

ELAND Efficient Large-Scale Alignment of Nucleotide Databases

EWAS Epigenome Wide Association Study

EDTA EthyleneDiamine Tetra-Acetic acid

FUSION Finland–United States Investigation of NIDDM Genetics

FISH Fluorescence in situ Hybridization

FoSTeS Fork Stalling and Template Switching

FDR False Discovery Rate

GWAS Genome Wide Association Study

20

GATK Genome Analysis Toolkit

GSV Genomic structural variant

HWE Hardy-Weinberg equilibrium

HMM Hidden Markov Model

HBD Homozygosity by Descent

IL-6 Interleukin-6

IDF International Diabetes Federation

IBD Identity by Descent

IBS Identity-by-State

IVA Ingenuity Variant Analysis

Kg Kilograms

Kb Kilobase

KoGES Korean Genome and Epidemiology Study

LD Linkage Disequilibrium

LOD Logarithm of the Odds

LRR Log R Ratio mtDNA Mitochondrial DNA mRNA Messenger Ribonucleic Acid

MODY Maturity Onset Diabetes of the Young

MAF Minor Allele Frequency

MAP Mapping File

ME Mendelian Error

MCP-1 monocyte chemoattractant -1

MHC Major Histocompatibility Complex m Meter m2 Meter squared

Mb Megabase mmol/l Millimoles per litre

MZ Monozygotic

NIDDM Non-Insulin Dependent Diabetes Mellitus

ND Neonatal Diabetes

21

NGS Next Generation Sequencing

NPL Non-Parametric Linkage

NEFA Non-Esterified Fatty Acids

NAHR Non-Allelic Homologous Recombination

NHEJ Non-Homologous End-Joining

OGTT Oral Glucose Tolerance Test

PNDM permanent neonatal diabetes mellitus

PCR Polymerase Chain Reaction

PFB Population Frequency of B-allele

PED Pedigree File

POE Parent of Origin Effect

P P-value

QF Qatar Foundation

QDA Qatar Diabetes Association

QC Quality Control qPCR Quantitative Fluorescent Polymerase Chain Reaction

QN Quantile Normalization

QBRI Qatar Biomedical Research Institute

RNA Ribonucleic Acid

RNAi Ribonucleic Acid interference

ROH Runs of Homozygosity

RBP4 Retinol-binding Protein-4

ROMA Representational Oligonucleotide Microarray Analysis

RT-PCR Real Time Polymerase Chain Reaction

SNP Single Nucleotide Polymorphism

SCH Supreme Council of Health

SAM Sequence Alignment Map

SIFT Scale-Invariant Feature Transform

T2D Type 2 Diabetes

T1D Type 1 Diabetes

TNDM Transient Diabetes Mellitus

22

TNF-α Tumor Necrosis Factor alpha

VCF Variant Call Format

VC Variance Component

VNTR Number of Tandem Repeats

WCMC-Q Weill Cornell Medical College in Qatar

WHO World Health Organization

WGS Whole Genome Sequencing

WES Whole Exome Sequencing

WTCCC Wellcome Trust Case Control Consortium

23

CHAPTER 1

INTRODUCTION

24

1 Introduction

1.1 Prevalence of Type 2 Diabetes

Type 2 Diabetes Mellitus (T2D) is a chronic condition formerly known as non-insulin dependent diabetes mellitus or adult onset diabetes mellitus. It is characterized by a high blood glucose level as a result of impaired insulin resistance and insulin secretion (Rung, Cauchi et al. 2009). T2D has become a major challenging health issue worldwide in the current decade (Doria, Patti et al. 2008, Bener, Zirie et al. 2009). Recent reports by the International Diabetes Federation have shown that the number of individuals with diabetes is 387 million globally, with an estimate of up to 592 million cases by 2035.The prevalence of T2D is growing rapidly with an estimate of 8.3% global increase annually (www.IDF.com). The number of cases is dramatically increasing worldwide showing greater effects in low and middle income countries due to lack of health care (Drong, Lindgren et al. 2012). It is reported that T2D is directly related to 4.9 million deaths per year, from which one person dies every seven seconds worldwide (http://www.idf.org/diabetesatlas/update-2014).

1.2 Prevalence of Type 2 Diabetes in Qatar

The prevalence of T2D in Qatar is increasing rapidly every year. Qatar and other countries from the Gulf region are ranked within the top 10 countries with the highest incidence of T2D. In 2014, the international diabetes federation reports had shown a prevalence of 16.3% among population of Qatar (http://www.idf.org/membership/mena/qatar). In 2012, the Supreme Council of Health (SCH) in Qatar conducted the World Health Organization (WHO) STEP survey. This survey is a well- established method that allows countries to identify the risk factors of chronic non-communicable diseases aiming to develop a complete outline of their population. A total number of 2,496 adults participated in the STEPwise survey. Reports from the SCH in Qatar showed a prevalence of 12.7% of diagnosed diabetes Qatari respondents to the STEPwise survey. The numbers reported by the IDF and the STEPwise survey brought the attention to the rapid increase in number of diabetic cases among the Qatari population.

The population of Qatar is descended from several migratory tribes who travelled to Qatar from different areas within the Arabian Peninsula. The Qataris are group of Bedouins, Persian or South- Asian mixture, and from African origin (Zaghlool, Al-Shafai et al. 2015). The majority of the population remains tribal and consists of a large number of consanguineous families. Consanguinity or consanguineous marriages are defined as the union between two individuals who share the same

25

ancestor within population isolates, with a coefficient of inbreeding (F) equal or larger than 0.0156 (Bittles 2001, Hamamy 2012); inbreeding coefficient calculates the likelihood of two genes at any locus in an offspring from a consanguineous marriage. These genes are expected to be identical and inherited from both parents (Tadmouri, Nair et al. 2009, Zaghlool, Al-Shafai et al. 2015). The Consanguinity rate in Qatar is about 51% with a coefficient of inbreeding of 0.024 according to Bener et al (Bener, Hussain et al. 2007). From the STEPwise survey, a consanguinity rate of 37% was detected among the survey respondents.

Several risk factors have been found to be associated with T2D, such as central obesity, hypertension, cardiovascular diseases and other metabolic disorders (Bener, Zirie et al. 2009). In addition, studies in the State of Qatar and other Gulf countries found that a family history of T2D was more likely higher among consanguineous families. Direct association of T2D and consanguinity was not well explained, but this may suggested a recessive transmission of T2D (Gosadi, Goyder et al. 2014). Reports have documented that consanguineous marriages resulted in a high genetically homogenous population (Bener, Hussain et al. 2007). Consanguinity played an important risk factor for T2D and multiple chronic disorders among Qataris caused by excess of homozygosity acting on recessive alleles (Bener, Hussain et al. 2007).

1.3 Gene-Environment Interaction in T2D

T2D and its complications are life-threatening conditions affecting multiple organs including cardiac, renal, nervous and vascular systems. Extensive efforts have been made to provide better understanding of gene-environment interactions that contributed to the development of T2D (Ahlqvist, Ahluwalia et al. 2011, Langenberg, Sharp et al. 2014). Complex interaction between genetic and environmental factors, such as changes in life style, high calorie intake and reduced physical activities has been explained (Sladek, Rocheleau et al. 2007). This complexity has made significant contribution to the rapid development of T2D among genetically susceptible individuals (Sladek, Rocheleau et al. 2007, Jafar-Mohammadi and Mccarthy 2008). Despite the fact of environmental factors effects, studies reported that genetic factors were major determinant of T2D. Although, a change in the environment with more exposure to westernized lifestyle affects the prevalence of T2D globally, genetic factors played a key role in defining individuals’ response to such changes. This meant that over years of environmental changes, genetic variants explained that some individuals were more subjective to unhealthy lifestyle and disease development (Ridderstråle and Groop 2009).

26

T2D is thought to be a result of the combined effects of insulin resistance and impaired function of pancreatic β cells resulting in impaired insulin uptake, which led to increasing glucose level in circulating blood (Kahn 1994). Although, the role of these effects in the pathogenesis of T2D is still argued, insulin resistance is thought to arise earlier during the development of the disease before any indication of glucose intolerance and impairment of pancreatic β-cells function (Martin, Warram et al. 1992). Different pathways are believed to interact during the development of T2D with an underlying effect of environmental and genetic factors (figure 1.1 shows the pathogenesis of T2D). There were compelling evidences of the involvement of genetic risk factors in T2D development. These included familial aggregation of insulin sensitivity and secretion, higher concordance rate reported among monozygotic twins as opposed to dizygotic twins, and prevalence of T2D among different ethnic group (Flegal, Ezzati et al. 1991, Weijnen, Rich et al. 2002). Genetic factors have been estimated to account for 30%-70% of developing T2D (Poulsen, Kyvik et al. 1999) with the hypothesis that T2D resulted from an interaction of multiple genes (Rich 1990, Kahn, Vicent et al. 1996).

Figure 1.1 Pathogenesis of Type 2 Diabetes. This figure illustrates the complex pathogenesis of T2D showing the genetic and environmental interaction that may influence T2D development by affecting insulin resistance and/or insulin sensitivity. Source: Doria, A., et al. Cell Metab 8(3): 186-200 (2008).

27

1.4 Obesity risk in Developing Type 2 Diabetes

Several studies have linked obesity with insulin resistance (Reaven 1988). Changes that occured during a normal life cycle, such as puberty, pregnancy and aging observed with insulin resistance led to variations in insulin sensitivity (DeFronzo 1979, Buchanan, Metzger et al. 1990, Moran, Jacobs et al. 1999). On the other hand, enhanced insulin sensitivity was usually associated with increased carbohydrate intake that could lead to obesity (Chen, Bergman et al. 1988).

Prevalence of obesity was increasing rapidly in modern societies and was thought to result from increased sedentary lifestyle, high calorie intake and genetic factors (Kahn, Hull et al. 2006). Risk of associations between obesity genes and developing T2D had been considered when studying T2D genes. The most successful discovery was the detection of a strong association of the FTO locus with developing T2D (Dina, Meyre et al. 2007, Frayling, Timpson et al. 2007). However, genome-wide association studies (GWAS) could not replicate such association when cases and controls when matched by BMI.

Different metabolites that were obesity related were involved in the development of insulin resistance. These included secretion of large amounts of non-esterified fatty acids (NEFA), glycerol, hormones –including adiponectin - and cytokines by adipose tissues in obesity (Wellen and Hotamisligil 2005, Scherer 2006, Shoelson, Lee et al. 2006). Insulin resistance was induced by retinol- binding protein-4 (RBP4) through reduced signaling of phosphatidylinositol-3-OH kinase in muscles and increased expression of glucogneogenic enzymes in the liver. Fatty acid oxidation was stimulated by adiponectin, which acted as insulin sensitizer (Yang, Graham et al. 2005, Kadowaki, Yamauchi et al. 2006, Scherer 2006). In addition, increased release of proinflammatory cytokines, such as tumor necrosis factor (TNF-α), interleukin-6 (IL-6), monocyte chemoattractant protein-1 (MCP-1) and other inflammatory factors acted as potential inflammatory mediators. Inflammatory mediators were involved in different pathways that lead to in insulin resistance (Fain, Madan et al. 2004, Wellen and Hotamisligil 2005). Increased production of NEFA by adipose tissue in obese and diabetic individuals was found to be the most critical factor correlated with insulin resistance (Roden, Price et al. 1996, Boden 1997). Different metabolites were believed to interact with increased NEFA release, which had been proposed to reduce insulin-receptor signaling. Decreased insulin level due to β cells dysfunction resulted in dysregulation of hepatic glucose production, in addition to impaired adipocyte metabolism. This could result in increased level of NEFA release due to lipolysis. Combining the effect of increased NEFA and β cells dysfunction could be associated with weight gain, which could exacerbate insulin resistance (Kahn, Hull et al. 2006).

28

1.5 Heritability of Type 2 Diabetes

The risk of developing T2D has been found to be driven by a complex interplay between genetic and environmental factors, as explained above in section 1.3. Genetic studies provided small evidence of familial clustering of T2D (Groop and Pociot 2014); and more comprehensive understanding of T2D heritability has been made through the twin studies of human T2D. These studies have explained less than 20% of T2D risk (Groop and Pociot 2014). The importance of twin studies came from two main reasons. First, heritability estimates can be derived by comparing the phenotypic agreements between monozygotic (MZ) and dizygotic (DZ) twin pairs. Second, the contribution of environmental factors in underlying genetic variations can be explained by phenotypic discordance in MZ twins (Boomsma, Busjahn et al. 2002, Kaminsky, Tang et al. 2009). Twin studies have shown that the concordance rate among monozygotic twin pairs is estimated by 70% and the concordance rate among dizygotic twins is estimated by 20%-30% (Newman, Selby et al. 1987, Kaprio, Tuomilehto et al. 1992, Hari Kumar and Modi 2014, Vaag, Brons et al. 2014). The clustering of T2D in families allowed the estimation of the life-time risk of developing the disease. It has been estimated that there was about 40% risk of developing T2D for offspring with one affected parent and this risk is increased to almost 70% when both parents are affected (Kaprio, Tuomilehto et al. 1992, Groop, Forsblom et al. 1996). Interestingly, the risk was greater if the mother is affected suggesting a potential epigenetic influence on T2D risk (Meigs, Cupples et al. 2000).

1.6 Genetic Basis of Type 2 Diabetes

Successful studies in genetics of diabetes have resulted from investigating both monogenic and syndromic forms of the disease. Although these studies have identified a small number of susceptibility genetic variants, discovery of common diabetes genes provided better understanding of pathogenesis of complex forms of T2D (Jafar-Mohammadi and Mccarthy 2008). Defining the T2D causative genes will provide concrete understanding of the disease genetic architecture and will improve our knowledge about disease diagnosis, treatment and prognosis (O'Rahilly, Barroso et al. 2005, Jafar-Mohammadi and McCarthy 2008).

29

1.6.1 Monogenic Form of T2D

Monogenic diabetes is referred to a form of T2D that resulted from mutation in a single gene. The genetic predisposition to T2D has been investigated for more than 20 years with relative success in defining the molecular basis of T2D. Familial clustering of T2D has been speculated in the 80s (Zimmet 1982), but the underlying genetic basis of familial segregation is not fully understood (Bonnefond and Froguel 2015). In 1985, T2D has been found to be transmitted as an autosomal dominant mode of inheritance (Vionnet, Hani et al. 2000). Evidence for familial clustering was first observed in 1992 by studying French families using familial linkage analysis (discussed in section

1.7.1) (Froguel, Vaxillaire et al. 1992). Monogenic genes including maturity onset diabetes of the young (MODY) genes were involved in different conditions of insulin resistance, neonatal diabetes genes and mitochondrial diabetes genes. In addition to, other rare genetic syndromes were involved. These genes were clinically diagnosed at early age of onset and exhibited Mendelian pattern of inheritance. To date, at least eight MODY genes were reported and clinically described as familial forms of non-insulin dependent diabetes with dominant mode of inheritance. Likewise, infants diagnosed with T2D in the first few months of life defined clinically as neonatal diabetes and were classified into permanent neonatal diabetes mellitus (PNDM). While infants who have non- permanent neonatal diabetes with episodes of re-emission are known to have transient diabetes mellitus (TNDM) (Froguel and Velho 2001, Hattersley, Bruining et al. 2006, Murphy, Ellard et al. 2008). It has been reported that approximately 20 different gene mutations or chromosomal realignments causing MODY and neonatal diabetes resulted in pancreatic beta-cell insufficiency (Ridderstråle and Groop 2009). This has led to further investigations of monogenic forms of T2D, which has in turn shed light on the different physiological mechanisms of T2D involving beta-cell function (Murphy, Ellard et al. 2008, Vaxillaire and Froguel 2008, Vaxillaire, D et al. 2009). In addition, mitochondrial diabetes was found to co-segregate in families having maternal history of T2D. Mitochondrial diabetes was found to be caused by point mutations, deletions, or duplications of mtDNA (Gerbitz, van den Ouweland et al. 1995, Luft and Landau 1995).

Estimations of type 2 diabetes cases with monogenic forms of the disease have been accounted for less than 2%-5% (Duncan, Navas et al. 1998). Monogenic T2D subtypes revealed the underlying biological pathways involved in pancreatic β-cells functions and glucose homeostasis (Duncan, Navas et al. 1998, Jafar-Mohammadi and McCarthy 2008). Table 1.1 included a list of genes involved in monogenic forms of T2D.

30

Table1.1. List of monogenic genes of type 2 diabetes

Gene Location Gene Name Diabetes Phenotype Maturity Onset Diabetes of the Young (MODY) HNF4A 20q12 hepatocyte nuclear factor 4α MODY 1; β-cell dysfunction GCK 7p15 glucokinase MODY 2; hyperglycaemia HNF1A 12q24 hepatocyte nuclear factor 1α MODY 3; β-cell dysfunction PDX1 13q12.1 insulin promoter factor-1 MODY 4; pancreatic agenesis and neonatal diabetes HNF1B 17q21 hepatocyte nuclear factor 1β MODY 5; pancreatic atrophy and neonatal diabetes NEUROD1 2q32 neurogenic differentiation 1 MODY 6; β-cell dysfunction KLF11 2p25 Kruppel-like factor 11 MODY 7; inactivation of insulin promoter CEL 9q34 Bile salt dependent lipase MODY 8; β-cell dysfunction PAX4 7q32.1 Paired Box Gene 4 MODY 9; β-cell dysfunction INS 11p15.5 Insulin MODY 10; impaired insulin secretion BLK 8p23-p22 B Lymphoid Tyrosine Kinase MODY 11; β-cell dysfunction Neonatal Diabetes KCNJ11 11p15.1 Potassium Channel, Inwardly Rectifying Permanent neonatal Subfamily J, Member 11 diabetes, reduced insulin secretion ABCC8 11p15.1 ATP-Binding Cassette, Sub-Family C Permanent neonatal (CFTR/MRP), Member 8 diabetes, reduced insulin secretion PLAGL1/ 6q24 Pleomorphic adenoma gene 1; Transient neonatal diabetes HYMAI hydatidiform mole transcript GATA6 18q11.2 GATA Binding Protein 6 pancreatic agenesis GATA4 8p23.1 GATA Binding Protein 4 pancreatic agenesis GLIS3 9p24.2 Zinc Finger Protein 515 β-cell dysfunction IER3IP1 18q12 Immediate Early Response 3 Interacting Permanent neonatal Protein 1 diabetes MNX1 7q36 Motor Neuron And Pancreas Homeobox 1 pancreatic hypoplasia NEUROG3 10q21.3 Neurogenin 3 Permanent neonatal diabetes, β-cell dysfunction NKX2-2 20p11.22 NK2 Homeobox 2 reduced insulin secretion PTF1A 10p12 Pancreas Specific Transcription Factor, 1a Permanent neonatal diabetes, cerebellar agenesis RFX6 6q22.1 Regulatory Factor X, 6 pancreatic hypoplasia SLC2A2 3q26.2 Solute Carrier Family 2 (Facilitated Transient neonatal diabetes, Glucose Transporter), Member 2 lethal neonatal diabetes SLC19A2 1q24.2 Solute Carrier Family 19 (Thiamine thiamine-responsive Transporter), Member 2 megaloblastic anaemia syndrome

31

1.6.2 Polygenic Form of T2D

Polygenic diabetes is referred to common forms of diabetes that resulted from combined effects of multiple common genetic variants and environmental factors. Common variants were thought to play a crucial role in developing T2D and understanding its physiology (Froguel and Velho 2001). Studies of polygenic forms of T2D revealed that common variants within several T2D genes plus MODY genes were potentially involved in β-cells functions. Also, they were found to be involved in glucose hemostasis pathways.

1.7 Gene Mapping Approaches to Identify T2D Genes

Genetic mapping approaches have been used in the latest decades to understand the genetic basis of most Mendelian inherited diseases and complex traits (Altshuler, Daly et al. 2008). Genetic mapping was a powerful approach used to identify genes and biological processes involved in developing inherited phenotypic traits (Altshuler, Daly et al. 2008). Different gene mapping approaches enabled the identification of both rare and common variants that are predisposing to T2D. Traditionally, two approaches have been widely used to identify the underlying genetic etiology of a disease phenotype. This included linkage analysis to determine genomic regions that might harbor disease-associated variants shared between affected subjects within a family. Also, earlier attempts included candidate gene studies to define a correlation between a disease phenotype with variation in genomic DNA sequences (Vaxillaire and Froguel 2009). Linkage analysis and candidate gene studies have shown limited success in the discovery of causative genetic variants. This led to the improvement of single nucleotide polymorphisms (SNPs) genotyping technology, which included the possibility of genotyping millions of variants. In addition to, the recent advances in the new technologies of next generation sequencing (NGS) including exome and genome sequencing to identify causal variants.

The above mentioned gene mapping approaches are discussed in this section.

1.7.1 Linkage Analysis Linkage analysis is a classical strategy that is used to map disease causing genes. Linkage is usually implemented to identify genomic regions that harbor a locus or multiple loci that modulated the expression of a disease trait (Ferreira 2004). This method is performed by using SNPs or microsatellite markers to look for co-segregations between markers and disease of interest. Marker

32

co-segregation with a disease phenotype that is only shared between affected relatives provided evidenced for linkage (Kruglyak, Daly et al. 1996, Jafar-Mohammadi and Mccarthy 2008). This method tested for inheritance patterns with different disease traits and used two different inheritance models, which were parametric and non-parametric analysis. The former assumed a trait-causing specific genetic model and the latter didn’t assume an inheritance model (Kruglyak, Daly et al. 1996, Jafar-Mohammadi and Mccarthy 2008). The specification of a precise genetic model to perform parametric linkage analysis has limited the use of this sort of analysis to discrete traits with Mendelian mode of inheritance. However, multiple discrete traits might have complex mode of inheritance that might be the result of multiple gene interaction. Therefore, non-parametric linkage (a model free linkage approach) better to be chosen. This method didn’t require a specification for the mode of inheritance. Also, it is based on finding similarities between the genotype and phenotype of interest that are shared between affected individuals (Ferreira 2004).

Intensive investigations using family linkage scan were successful in revealing a number of genes that were associated with T2D: calpin 10 (CAPN10), adiponectin, C1Q and collagen domain containing (ADIPOQ) (Vionnet, Hani et al. 2000), hepatocyte nuclear factor 4α (HNF4A) (Silander, Mohlke et al. 2004), ectonucleotide pyrophosphatase/phosphodiesterase 1 (ENPP1) (Meyre, Bouatia-Naji et al. 2005), and transcription factor-7- like 2 (TCF7L2) (Grant, Thorleifsson et al. 2006, Bonnefond and Froguel 2015). However, successful association studies had replicated the risk loci associated with T2D within HNF4A and TCF7L2 only (Sladek, Rocheleau et al. 2007, Kooner, Saleheen et al. 2011, Bonnefond and Froguel 2015).

The most successful association identified with T2D before the application of association studies (discussed in section 1.7.3) was the detction of TCF7L2 . This gene included a common variant that was found to have the strongest effect on T2D risk with an odds ratio (OR) of 1.50 (Grant, Thorleifsson et al. 2006). The association risk variants within TCF7L2 with T2D was identified in chromosome 10q among Icelandic and Mexican-American populations (Duggirala, Blangero et al. 1999, Reynisdottir, Thorleifsson et al. 2003, Ahlqvist, Ahluwalia et al. 2011). The risk association between a number of genetic markers and T2D within TCF7L2 had been confirmed in other studies from different ethnicities (Tong, Lin et al. 2009). The underlying mechanisms by which TCF7L2 predisposed to T2D were not explained yet. Risk variants within this gene had been reported to be associated with impaired insulin secretion, as well as, pancreatic β-cells dysfunctions (Lyssenko, Lupi et al. 2007). In 1996, a group of researchers were successful in reporting a significant linkage region located on chromosome 2q37 within sib pairs of Mexican Americans (Hanis, Boerwinkle et al. 1996). The same

33

results were investigated again and found to have an interaction with another locus within chromosome 15 (Cox, Frigge et al. 1999). Subsequently, the identified region was narrowed down to 7 cM representing 1.7 megabases only of the genomic DNA. The overlapping gene was genotyped with 21 SNPs and three markers were found to be associated with T2D within the intronic region of the gene coding for CAPN 10 (Bouchard, Pérusse et al. 1987, Carey, Nguyen et al. 1996, Horikawa, Oda et al. 2000, Ling, Poulsen et al. 2004). Although, the association of CAPN10 with T2D risk had been extensively studied in several association studies, the results couldn’t confirm any association. However, a recent meta-analysis study included 2758 T2DM cases and 2762 controls suggested an association of CAPN10 with increased risk of developing T2D (Yan, Li et al. 2014).

1.7.2 Candidate Gene Studies

Candidate gene studies were used for investigating the genetic variants among the general populations on the basis of association testing. The concept of candidate gene studies was to analyze disease variants within genes that had biological relevance to the disease of interest. The reasoning behind this approach was that focusing on variants within specific biologically-related genomic regions would lead to the detection of variant mutations influencing the studied trait (Ahlqvist, Ahluwalia et al. 2011). Early detection of T2D common variants was made through testing candidate genes: 1) genes with known biological function in glucose regulation; 2) variants within MODY genes; 3) genes proved to be associated with T2D tested in animal models; and 4) genes included in inherited diseases with diabetic phenotype (Froguel and Velho 2001).

The application of candidate gene studies-involving functional and/or positional candidate genes (Doria, Patti et al. 2008) in more complex forms of T2D have been in some cases successful (Huang, Cheng et al. 2006, Guan, Pluzhnikov et al. 2008, Wheeler and Barroso 2011). Functional candidate gene studies focused on demonstrating genetic variants within T2D known genes, whilst positional candidate gene studies designed to detect genes in linkage regions. Several risk loci associated with T2D have been discovered by using the candidate gene approach, namely the peroxisome proliferators-activated receptor-γ (PPRG- γ), insulin receptor substrate 1 (IRS1), potassium inwardly- rectifying channel, subfamily J, member 11 (KCNJ11), Wolfram syndrome 1 (WFS1), HNF1 homeobox A (HNF1A), and HNF1 homeobox B (HNF1B) (Gloyn, Weedon et al. 2003, Altshuler, Daly et al. 2008, Wheeler and Barroso 2011),(Ahlqvist, Ahluwalia et al. 2011). Associations of these variants have been replicated in large association studies. However, two candidate genes showed strong associations with T2D.

34

The first gene with evidences of T2D association was the PPARG gene, which encoded the nuclear receptor PPAR-γ (Deeb, Fajas et al. 1998). This gene was a strong T2D gene, as PPAR-γ receptor was targeted by the insulin-sensitizing class of drugs (thiazolidinedione) used to treat T2D (Ahlqvist, Ahluwalia et al. 2011). The PPARG gene was expressed in adipose tissue; 15% of Europeans showed that a substitution of a proline for alanine at position 12 (Pro12Ala) in PPARG was associated with insulin sensitivity, elevated transcriptional activity and protective against T2D (Deeb, Fajas et al. 1998). Although initial replication studies showed negative outcomes due to power limitations, meta-analysis studies combining all published data showed a strong evidence for association between Pro12Ala variant and T2D (Altshuler, Hirschhorn et al. 2000, Tönjes, Scholz et al. 2006, Gouda, Sagoo et al. 2010).

IRS1 is a protein that is phosophorylated by insulin receptor tyrosine kinase and involved in insulin function. The IRS1 Gly972Arg polymorphism was first reported in 1993 (Almind, Bjo̸rbaek et al. 1993) to be associated with T2D, but findings from subsequent studies have been inconsistent. This might be due to small sample sizes or due to interaction with body fat (Van Dam, Hoebee et al. 2004). However, a few years ago a study showed a strong association of Gly972Arg polymorphism with TD (Rung, Cauchi et al. 2009).

Associations of other common variants within KCNJ11, WFS1, HNF1A, and HNF1B have been replicated in various large association studies (Gudmundsson, Sulem et al. 2007, Sandhu, Weedon et al. 2007, Saxena, Voight et al. 2007, Scott, Mohlke et al. 2007, Zeggini, Weedon et al. 2007, Franks, Rolandsson et al. 2008, Voight, Scott et al. 2010).

1.7.3 Genome Wide Association Studies

The term Genome Wide Association Study (GWAS), also known as common-variants association study referred to the investigation of common genetic variants associated with a disease phenotype. GWAS has been established based on the improved knowledge of human genetic variations and patterns of SNP markers at linkage disequilibrium. This was combined with advances of genotyping technologies that provided the possibility to perform an efficient and cost-effective genome-wide genotyping using very large number of markers (Ku, Loy et al. 2010). Basically, these studies compared between cases and controls using a large number of markers distributed over the genome (including SNPs and sometimes copy number variants – CNVs) (Bonnefond, Durand et al. 2010). GWAS offered high resolution genome scans using more than a million probes to examine the associations between common variants and disease traits. This was usually followed by replication

35

studies in independent cohorts to confirm findings of GWAS. Associations were strictly determined when the obtained associations reached genome-wide significance level with P-values of <5x10¯⁸ as suggested by Risch and Merikangas (Risch and Merikangas 1996, Dudbridge and Gusnanto 2008, Ku, Loy et al. 2010, Visscher, Brown et al. 2012). Statistically significant associations could be estimated after multiple comparisons adjustments; to correct the statistical confidence measures of a given dataset based on the number of hypotheses being tested (Noble 2009).

The advances in GWAS provided better understanding of the genetic architecture of T2D (Doria, Patti et al. 2008), which were poorly understood when GWAS was first implemented in 2007 (Bonnefond and Froguel 2015). A major breakthrough in the understanding of genetic basis of T2D had been made by the introduction of GWA studies in 2007.

Five large GWA studies of T2D were published in 2007 (Saxena, Voight et al. 2007, Scott, Mohlke et al. 2007, Sladek, Rocheleau et al. 2007, Steinthorsdottir, Thorleifsson et al. 2007, Zeggini, Weedon et al. 2007) followed by another five smaller genome wide screening studies for T2D (Florez, Manning et al. 2007) (Hanson, Bogardus et al. 2007, Hayes, Pluzhnikov et al. 2007, Rampersaud, Damcott et al. 2007, Salonen, Uimari et al. 2007). The large GWA studies were performed at two stages starting with a screening cohort that included a large number of cases and controls. Then, the most significant associations were replicated in an independent dataset. Both the screening and replication cohorts included Europeans; an exception was made for the Decode study that involved replication data of Chinese and West Africans. About 7,000 cases and 12,000 controls were involved in the GWA screening set. A larger set of cases (20,000) and controls (26,000) were included in the replication studies. These studies were conducted using Illumina 300K genotyping arrays and Affymetrix 500K array platforms; both platforms covered almost 80% of common genetic variants present in Caucasians’ genomes.

The GWA screening studies identified multiple potential variants. However, 15 loci were found to be related to the risk of developing T2D. Three loci have been previously identified using classical mapping approaches (TCF7L2, KCNJ11, PPARG); the remaining 12 loci were potential novel T2D- related genes, which were ADAMTS9, CDC123/CAMK1D, CDKAL1, CDKN2A/B, IGF2BP2, JAZF1, NOTCH2, RBMS1, THADA, and TSPAN8/LGR5 (Saxena, Voight et al. 2007, Scott, Mohlke et al. 2007, Sladek, Rocheleau et al. 2007, Steinthorsdottir, Thorleifsson et al. 2007, Zeggini, Weedon et al. 2007).

The major advances of the GWAS method inspired other researchers to apply this approach to a larger population sample (Rung, Cauchi et al. 2009). The initial meta-analysis study was the DIAbetes

36

Genetics Replication and Meta-analysis (DIAGRAM) consortium that combined association data from three groups of European descendants. The study included the Diabetes Genetics Initiative (DGI), Finland–United States Investigation of NIDDM Genetics (FUSION) and Wellcome Trust Case Control Consortium (WTCCC) (Zeggini, Scott et al. 2008). This meta-analysis (including 10,128 Europeans) and consequent replication (including 53,975) identified six novel in JAZF1, CDC123-CAMK1D, TSPAN8-LGR5, THADA, ADAMTS9 and NOTCH2 (Zeggini, Scott et al. 2008). In addition, the association of IRS1 with T2D, which was identified through candidate gene studies, had been confirmed in this meta-analysis study, details described above (Rung, Cauchi et al. 2009). IRS1 risk variants were further associated with insulin resistance and hyperinsulinemia (Rung, Cauchi et al. 2009). A large meta-analysis study (DIAGRAM+) added to the DIAGRAM study five European cohorts (DGDG, KORA, Rotterdam, DeCODE, EUROSPAN). These cohorts included a total of 8,130 cases and 38,987 controls. Replication analysis for 34,412 cases and 59,925 controls was followed. Twelve novel associations were detected within the genomic regions of KLF14, TP53INP1, HNF1A, ZFAND6, BCL11A, ZBED3, CHCHD9, KCNQ1, CENTD2, HMGA2, PRC1, and DUSP9. Two loci identified within MTNR1B, IRS1 had been previously reported using samples that partially overlapped with participants of this meta-analysis (Voight, Scott et al. 2010). Data from DIAGRAM+ study was combined with data from European individuals including 12,171 cases and 56,862 controls creating the largest current GWAS dataset (DIAGRAMv3). The combined GWA studies dataset provided the basis of SNPs marker selection to customize Illumina CardioMetabochip (Metabochip). In 2012, a meta-analysis study was conducted on variants of the Illumina Metabochip involving more than 150 individuals (Morris, Voight et al. 2012) . Eight new genetic loci identified within ZMIZ1, ANK1, KLHDC5, TLE1, ANKRD55, CILP2, MC4R and BCAR1, achieved genome-wide significant level. In addition, two genetic loci with significant associations identified within HMG20A and GRB14 (Morris, Voight et al. 2012). These two associations had been reported previously in South Asian population (Kooner, Saleheen et al. 2011).

Table 1.2 includes all T2D-associated genes identified through GWAS and meta-analysis.

For a better understanding of T2D genetic basis, a recent trans-ethnic meta-analysis association study was conducted using data collected from European, Mexican and Mexican American, south Asian and East Asian descendants. Seven new loci predisposing to T2D (within TMEM154, SSR1- RREB1, FAF1, POU5F1-TCF19, LPP, ARL15 and MPHOSPH9) observed to have strong association signals. This study highlighted the importance of joining GWA studies from various ancestry to provide better characterization of common variants contributing to T2D risk (Consortium, Consortium et al. 2014). Both GWAS and subsequent meta-analysis studies have confirmed the

37

presence of more than 120 genetic variants contributing to the risk of T2D and other glycemic traits (Prasad and Groop 2015) (Singh 2015). Together, these studies have explained only 10% of T2D risk (Singh 2015). Most of the T2D associated SNPs discovered by the GWAS approach were involved in beta-cell function (Bonnefond, Durand et al. 2010) and insulin resistance (Doria, Patti et al. 2008). Although, GWA studies have been successful in identifying more than 500 genetic markers with minor allele frequency (MAF) of >5%, however, these associations had modest effect sizes (Maher 2008). The “common disease, common variant” hypothesis underling GWA studies had explained a small fraction of the genetic component for these traits (Maher 2008).

38

Table1.2. Genetic variants identified through large GWAS and met-analysis

Gene SNP Mechanism loci Reference PPARG rs18012824 Insulin 3p25.2 (Barroso, Gurnell et al. Sensitivity 1999, Agostini, Gurnell et al. 2004) KCNJ11/ ABCC8 rs5219/rs757110 Beta-cell 11p15.1 (Gloyn, Weedon et al. dysfunction 2003, Florez, Burtt et al. 2004) TCF7L2 rs7903146 Beta-cell 10q25.2 (Grant, Thorleifsson et dysfunction al. 2006, Zhang, Qi et al. 2006) WFS1 rs1801214 Beta-cell 4p16.1 (Sandhu, Weedon et al. dysfunction 2007) CDKN2A/2B rs10811661 Beta-cell 9p21.3 (Steinthorsdottir, dysfunction Thorleifsson et al. 2007, Zeggini, Weedon et al. 2007) CDKAL1 rs7754840 Beta-cell 6p22.3 (Steinthorsdottir, dysfunction Thorleifsson et al. 2007) SLC30A8 rs13266634 Decreased β - 8q24.11 (Sladek, Rocheleau et cell function al. 2007) HHEX/IDE rs1111875 Beta-cell 10q23.33 (Sladek, Rocheleau et dysfunction al. 2007) HNF1B (TCF2) rs757210 Beta-cell 17q12 (Winckler, Graham et dysfunction al. 2005) FTO rs8050136 obesity, 16q12.2 (Burton, Clayton et al. Increased 2007, Frayling, Timpson triglycerides et al. 2007) and cholesterol IGF2BP2 rs4402960 Beta-cell 3q28 (Zeggini, Weedon et al. dysfunction 2007, Omori, Tanaka et al. 2009) CDC123/CAMK1D rs12779790 Beta-cell 10p13 (Zeggini, Scott et al. dysfunction 2008, Omori, Tanaka et al. 2009) JAZF1 rs864745 Beta-cell 7p15.1 (Zeggini, Scott et al. dysfunction 2008, Omori, Tanaka et al. 2009) TSPAN8/LGR5 rs7961581 Beta-cell 12q21.1 (Zeggini, Scott et al. dysfunction 2008, Omori, Tanaka et al. 2009) THADA rs7578597 Beta-cell 2p21 (Zeggini, Scott et al. dysfunction 2008, Omori, Tanaka et al. 2009) ADAMTS9 rs4607103 Decreased 3p14.1 (Zeggini, Scott et al. insulin 2008, Omori, Tanaka et sensitivity al. 2009)

39

Gene SNP Phenotype Location Reference

NOTCH2 rs10923931 Unknown 1p11.2 (Zeggini, Scott et al. (Membrane 2008, Omori, Tanaka et receptor) al. 2009) KCNQ1 rs2237892/rs231362 Beta-cell 11p15.5 (Voight, Scott et al. dysfunction 2010) IRS1 rs2943641/rs7578326 Increased 2q36.3 (Voight, Scott et al. insulin 2010) resistance DGKB-TMEM195 rs2191349 Decreased β - 7p21.2 (Dupuis, Langenberg et cell function, al. 2010) increased fasting glucose GCK rs4607517 Insulin 7p13 (Dupuis, Langenberg et sensitivity al. 2010) GCKR rs780094 Increased 2p23.3 (Dupuis, Langenberg et insulin al. 2010) resistance PROX1 rs340874 Decreased β - 1q32.3 (Dupuis, Langenberg et cell function, al. 2010) increased fasting glucose ADCYS rs11708067 Decreased 3q21.1 (Dupuis, Langenberg et insulin al. 2010) sensitivity DUSP9 rs5945326 Beta-cell Xq28 (Voight, Scott et al. dysfunction 2010) BCL11A rs243021 Beta-cell 2p16.1 (Voight, Scott et al. dysfunction 2010) ZBED3 rs4457053 Beta-cell 5q13.3 (Voight, Scott et al. dysfunction 2010) KLF14 rs972283 Insulin action 7q32.3 (Voight, Scott et al. 2010) TP53INP1 rs896854 Unknown 8q22.1 (Voight, Scott et al. 2010) CHCHD9 rs13292136 Unknown 9q21.31 (Voight, Scott et al. 2010) CENTD2/ARAP1 rs1552224 Beta-cell 11q13.4 (Voight, Scott et al. dysfunction 2010) HMGA2 rs1531343 Transcriptional 12q14.3 (Voight, Scott et al. regulator 2010) HNF1A rs7957197 Beta-cell 12q24.31 (Voight, Scott et al. dysfunction 2010) ZFAND6 rs11634397 Beta-cell 15q25.1 (Voight, Scott et al. dysfunction 2010) PRC1 rs8042680 Unknown 15q26.1 (Voight, Scott et al. ( Cytokinesis 2010) regulator)

40

Gene SNP Phenotype Location Reference

MTNR1B Decreased β - 11q14.3 (Bouatia-Naji, rs10830963/rs1387153 cell function, Bonnefond et al. 2009, increased Lyssenko, Nagorny et fasting glucose al. 2009, Prokopenko, Langenberg et al. 2009, Voight, Scott et al. 2010) ZMIZ1 rs12571751 Beta-cell 10q22.3 (Morris, Voight et al. dysfunction 2012) ANK1 rs516946 Beta-cell 8p11.21 (Morris, Voight et al. dysfunction 2012) KLHDC5 rs10842994 Beta-cell 12p11.22 (Morris, Voight et al. dysfunction 2012) TLE1 rs2796441 Beta-cell 9q21.31 (Morris, Voight et al. dysfunction 2012) ANKRD55 rs459193 Insulin 5q11.2 (Morris, Voight et al. sensitivity 2012) CILP2 rs10401969 Beta-cell 1913.11 (Morris, Voight et al. dysfunction 2012) MC4R rs12970134 Beta-cell 18q21.32 (Morris, Voight et al. dysfunction 2012) BCAR1 rs7202877 Beta-cell 16q23.1 (Morris, Voight et al. dysfunction 2012) HMG20A rs7177055 Beta-cell 15q24.3 (Morris, Voight et al. dysfunction 2012) GRB14 rs13389219/rs3923113 Insulin 2q24.3 (Morris, Voight et al. sensitivity 2012, Consortium, Consortium et al. 2014) ST6GAL1 rs16861329 Insulin 3q27.3 (Consortium, sensitivity Consortium et al. 2014) VPS26A rs1802295 Beta-cell 10q22.3 (Consortium, dysfunction Consortium et al. 2014) HMG20A rs7178572 Beta-cell 15q24.3 (Consortium, dysfunction Consortium et al. 2014) AP3S2 rs2028299 Beta-cell 15q26.1 (Consortium, dysfunction Consortium et al. 2014) HNF4A rs4812829 Beta-cell 20q13.12 (Consortium, dysfunction Consortium et al. 2014) TMEM154 rs6813195 Decreased β - 4q31.3 (Consortium, cell function Consortium et al. 2014) SSR1-RREB1 rs9505118 Insulin 6p25.1 (Consortium, sensitivity Consortium et al. 2014) FAF1 rs17106184 unknown 1p33 (Consortium, Consortium et al. 2014) POU5F1-TCF19 rs3130501 Beta-cell 6p21.33 (Consortium, dysfunction Consortium et al. 2014) LPP rs6808574 unknown 3q27.3 (Consortium, Consortium et al. 2014)

41

Gene SNP Phenotype Location Reference ARL15 rs702634 unknown 5q11.2 (Consortium, Consortium et al. 2014) MPHOSPH9 rs4275659 Decreased β - 12q24.31 (Consortium, cell function Consortium et al. 2014)

1.8 Missing Heritability of Complex Diseases

Heritability is usually referred to the percentage of phenotypic variation that resulted from an additive genetic effect (Prasad and Groop 2015). Genetic studies have shown that multiple complex diseases such as T2D are known to be partially clustered within the families and are thought to be predisposed by complex interplay between genetic and environmental factors. However, the underlying genetic components contributing to the risk of complex diseases are not fully explained yet. Although, several genetic risk loci have been identified, comprehensive understanding of disease heritability is still challenged (Hardy and Singleton 2009, Manolio, Collins et al. 2009). The genetic architecture of complex diseases can be better explained by detecting the heritable genetic components contributing to the disease risk (Prasad and Groop 2015). Inability to outline such genetic components is usually defined as missing heritability (Groop and Pociot 2014). Partial heritability estimates of complex phenotypes have been made by using the improved technology of genome-wide association studies (GWAS). These studies have revealed hundreds of common genetic variants that are believed to be associated with common complex disorders (Frazer, Ballinger et al. 2007, Hardy and Singleton 2009, Manolio, Collins et al. 2009). However, common variants were not the only contributors for the genetic variance in developing complex diseases (Collins, Guyer et al. 1997, Pritchard 2001, Reich and Lander 2001). In addition, these variants were found to have small effect sizes, which can be explained partially by the limited GWA studies to specific ethnic group, and to an inadequate design of GWAS (Pearson and Manolio 2008). Different potential sources of missing heritability have been discussed in literature which may possibly make inaccurate definition of some complex diseases and difficult identification of genetic causes of such diseases. These sources included, rare variants with strong effects that were not captured by the existing genotyping arrays to detect variants with minor-allele frequency of 5% or more in a population; structural variants that were not well-tagged by the available assays; inability to provide better understanding of shared environment between family members (Manolio, Collins et al. 2009); distorted parent-of-origin effects (Prasad and Groop 2015); gene-gene interaction (Prasad and Groop 2015); the unexplained role of gut microbiome in the development of T2D (I Naseer, Bibi et al. 2014); 42

and the early studies of defining the role of epigenetic modifications related to T2D (Prasad and Groop 2015).

Here I will focus only on potential sources of missing heritability that have been examined in this PhD project;

1.8.1 Rare Variants GWASs have assumed a potential role of low frequency genetic variants to the missing heritability of complex diseases. Genetic variants with low minor allele (MAF) are usually referred to variants with 0.005

43

1.8.2 Genomic Structural Variation (GSV) Additional strategies have been adopted to identify more genomic factors to explain the missing heritability, for example investigating the genomic structural variations including copy number variants, inversions and translocations. Such variations are not well-tagged by the available genotyping assays (Manolio, Collins et al. 2009). Also, the potential contribution of the different forms of these structural variations has not been explicitly investigated through most GWAS. The investigation of copy number variants (CNV) has gained special attention as the methods of identifying CNVs have been highly improved (Scherer, Lee et al. 2007, McCarroll 2008). Studies have reported that both common and rare CNVs are related to disease risk. Disease-related common CNVs with modest effect sizes are found to be small (20–45 kb), while highly penetrant rare CNVs with large effect sizes are found to be very large (ranging from 600kb to 3Mb) (de Vries, Pfundt et al. 2005, Sebat, Lakshmi et al. 2007, Xu, Roos et al. 2008). Several studies proved the contribution of CNVs to explain missing heritability in multiple complex diseases such as obesity (Walters, Coin et al. 2013)) and neurological disorders (Rudd, Axelsen et al. 2014). However, such CNVs are found to have relatively small effects on risk of T2D (Groop and Pociot 2014). More details about CNVs, what they are? Mechanisms of CNV formation and their implications to T2D are discussed in section 1.10.

1.8.3 Epigenetic Modifications The environment and other external factors such as the age could influence the gene expression and eventually the disease phenotype (Prasad and Groop 2015). Despite the fact that the nucleotide sequence is not altered, the phenotype is changed by epigenetic changes. These changes included DNA methylation, histone modifications, or activation of microRNAs (Skinner 2011, Prasad and Groop 2015). Studies have reported that changes in the epigenome are found to be constant and heritable during cell divisions (Chong and Whitelaw 2004, Liu, Li et al. 2008). More details about epigenetics are discussed in section 1.11.

1.8.4 Parent-of-Origin The differential expression of a disease phenotype that is reliant on whether the transmission of the allele comes from the father or the mother is usually referred to parent-of-origin effects (POE) (Rampersaud, Mitchell et al. 2008). The contribution of POE in explaining the missing heritability of complex diseases such as obesity and T2D is still under investigation (Rampersaud, Mitchell et al. 2008). However, few association studies have shown evidences of POE. Studies on effects of parental origins gained particular attention especially when studying T2D, since POE have shown to

44

be involved in the development of T2D (Rampersaud, Mitchell et al. 2008, Groop and Pociot 2014). More details are presented in Chapter five section 5.9.

Despite the fact that hundreds of genetic variations have been determined, little has been achieved in explaining the missing heritability of common complex diseases. Comprehensive strategies to identify sources of genetic variations could enhance our understanding of disease risk (Manolio, Collins et al. 2009).

1.9 Next Generation Sequencing (NGS)

NGS is a high throughput, fast, low cost and efficient tool that provides full DNA sequence generating millions of short DNA reads at a single time. This approach enables the identification of various gene mutations that are essential in biological and medical implications (Bentley, Balasubramanian et al. 2008). Following the post-GWAS era, many studies have been shifted towards the application of sequencing technologies. This has been triggered by incomplete information about biological relevance of variants detected through GWAS (Marjoram and Thomas 2014). Recent advances in next generation sequencing technology (including whole exome and whole genome) has been used to detect rare variants that might be of etiological significance for T2D (Metzker 2010). It is suggested that rare and deleterious variants may have strong effects on complex diseases (Cirulli, Kasperavičiūtė et al. 2010). In family studies, it is expected that rare variants in the general population could be common in some families due to variants enrichment in that particular family (Shi and Rao 2011). Successful attempts proved the importance of using the new high throughput next generation sequencing technologies. An important example has been published in 2010 by Bonnefond et al (Bonnefond, Durand et al. 2010); the study has assessed the feasibility of implementing the high throughput technology of NGS for molecular diagnosis of monogenic neonatal T2D. In this study patients carrying known heterozygous point mutations within ABCC8, KCNJ11 and INS have been sequenced using whole exome sequencing (Bonnefond, Durand et al. 2010). Initial diagnosis was based on direct Sanger sequencing of about 42 PCR reactions from ABCC8, KCNJ11 and INS. Results from WES of patients presented with neonatal diabetes have shown novel mutation within the ABCC8 (Bonnefond, Durand et al. 2010). Applying WES for the identification of causal variants of these three genes was found to be more accurate, rapid and cost-effective compared to Sanger

45

sequencing. (Bonnefond, Durand et al. 2010, Greeley, John et al. 2011, Bonnefond, Philippe et al. 2014). Another study has published recently discussed the implementation of a new approach based on combining two technologies, the PCR enrichment in microdroplets (RainDance Technologies) and NGS (Bonnefond, Philippe et al. 2014). The application of this approach has shown to be rapid, efficient, and cost-effective. Also, more accurate diagnosis of various complex disorders such as T2D can be achieved using this approach (Bonnefond, Philippe et al. 2014). Other studies described different analytical strategies for better understanding of rare variants characteristics. Two strategies are usually applied to map rare genetic variants using next generation sequencing, namely family-based phenotypes and population extreme phenotypes (Cirulli and Goldstein 2010). The first strategy suggested initial sequencing of distant relatives and affected individuals to identify shared variants with the assumption that a small number of variants will be shared between the distant relatives. The identified variants will be filtered based on their function, frequency and segregation within affected individuals. Incorporation of linkage evidence could facilitate the identification of variants by focusing on regions of interest (Cirulli and Goldstein 2010). The combination of linkage evidence and functional association could improve the detection of disease-causal variants. Second, individuals who had extreme disease phenotype better to be sequenced. Since, disease genes are expected to have high frequencies among individuals with extreme phenotype. Sequencing a small number of individuals with extreme phenotypes could identify potential variants associated with the disease (Cirulli and Goldstein 2010). Although, the identification of disease related variants could be successful with sequencing a small number of individuals, defining the causative variants will be highly dependent on their function. To complement this strategy, it is better to be followed with elucidating variants co-segregating among family members with extreme phenotype (Cirulli and Goldstein 2010). More analytical strategies have been introduced looking for rare variants, for example methods for collapsing rare genetic, as complex diseases may result from interaction of multiple rare variants; and group of rare variants could be associated to a specific disease phenotype (Schork, Wessel et al. 2008). Several methods could be implemented to detect rare/low frequency variants based on collapsing or accumulating rare variants into a single locus within a selected region (Li and Leal 2008). Multiple computational tools could be used for collapsing rare variants, such as Combined Multivariate and Collapsing (CMC) method, and Generalized C-alpha. These methods could be implemented using binary and quantitative traits on family-based subjects, and assess combined effects of traits on disease phenotype.

46

In this study, I will take the advantage of the advanced technology of next generation sequencing in sequencing all families selected for the project combined with using linkage analysis to define region of interest.

1.10 Copy Number Variation (CNV)

Over the past ten years, ample genomic variabilities of an intermediate size known as structural variations had been recognized. These variations included deletions, duplications, insertions, inversions, translocations and copy number variants. CNVs formed the largest element of structural variants (Zarrei, MacDonald et al. 2015). Copy number variants (CNVs) defined as large DNA segments that were ranging in size from 1 kilobase to several megabases and were considered as an important source for genetic diversity. CNVs could result from deletions, insertions, duplications, triplications or translocations (Redon, Ishikawa et al. 2006, Stankiewicz and Lupski 2010, Zarrei, MacDonald et al. 2015). It had been noted that copy number variants had substantial effects on chromosomal domains and could be found in coding or non-coding regions within genomes (Beckmann, Sharp et al. 2008, Girirajan, Campbell et al. 2011, Zarrei, MacDonald et al. 2015). Comprehensive understanding of the total number of CNVs, size, position, genomic content, distribution of CNVs among population and their effects on disease risk is still required (Stankiewicz and Lupski 2010, Zarrei, MacDonald et al. 2015). Different phenotypes might be affected by CNVs depending on the overlapping genes and its function within the CNV region (de Smith, Walters et al. 2008). Alterations in gene expression levels could be affected by variations of copy number in the genome (de Smith, Walters et al. 2008). Heritable changes in gene expression had been reported to be affected by CNVs with an estimate of 17.7% (Stranger, Forrest et al. 2007). Different genetic components were found to contribute to the development of complex disorders for example, CNVs within known genes and associated variants within known genes (Fanciulli, Norsworthy et al. 2007, Stranger, Forrest et al. 2007, de Smith, Walters et al. 2008). There were a number of successful studies, which reported the association of copy number with complex disorders such as, diseases related to the immune system, cancer, neurological, psychiatric disorders including bipolar and schizophrenia, obesity and T2D (Conrad, Andrews et al. 2006, de Smith, Tsalenko et al. 2007, Fanciulli, Norsworthy et al. 2007, Wong, deLeeuw et al. 2007, Lee, Moon et al. 2014, Zhang, Li et al. 2015). Major interests of CNV research was focused on investigating large de novo CNVs with the hypothesis that such variant were more disease relevant. Large CNV regions could harbor large number of disease relevant variants (Lupski 2007, Stankiewicz and Lupski 2010, Coe, Witherspoon et

47

al. 2014). In addition, the contribution of rare CNVs to the risk of complex diseases have been studied lately, since multiple studies have shown that analysing rare CNVs are disease-associated (de Smith, Walters et al. 2008). While the role of rare variants in disease risk has established, studying rare CNVs might partially explain heritability of the disease (Manolio, Collins et al. 2009, Lettre 2014, Li, Gong et al. 2015).

CNVs have been primarily investigated using the bacterial artificial chromosome (BAC) arrays-based comparative genomic hybridization (CGH) to assess gross genetic abnormalities in genomes (Shizuya, Birren et al. 1992, Iafrate, Feuk et al. 2004). The BACs involved in these arrays were DNA fragments that had been cloned and transformed in bacteria, usually Escherichia coli. These large (>300kb) cloned DNA fragments could be maintained even after a series of growth over 100 generations. The cloning effectiveness and constant maintenance of cloned DNA allowed the construction of DNA libraries with full representation of complex genomic structure.

The CGH technique was developed and designed to identify genomic alterations including small chromosomal rearrangements at a large scale. The array CGH technique was modified from fluorescence in situ hybridization (FISH) technology. Basically, array CGH used fluorescently labelled probes that differentiated between test and reference genomes after competitive hybridization to metaphase (Kallioniemi, Kallioniemi et al. 1992, Iafrate, Feuk et al. 2004, Sebat, Lakshmi et al. 2004).

In 2004, two successful studies identified widespread copy number variants within the based on using the microarray-based CGH (Iafrate, Feuk et al. 2004, Sebat, Lakshmi et al. 2004). Different technologies have been implemented in these studies, one used bacterial artificial chromosome (BAC) array CGH, which identified 255 variant loci in 39 individuals (Iafrate, Feuk et al. 2004). While the other used Representational Oligonucleotide Microarray Analysis (ROMA) for random identification of 76 CNVs within 20 individuals (Sebat, Lakshmi et al. 2004). A catalogue for all CNV variants was established and called Database of Genomic Variants (DGV) (http://dgv.tcag.ca/dgv/app/home) (Iafrate, Feuk et al. 2004).

Following the usage of the above technologies, CNVs research had been developed and different experimental technologies had been implemented. A study was published in 2005 used a custom- designed array involved 2000 BAC probes that targeted 130 chromosomal aberrations within the human genome (Sharp, Locke et al. 2005). In this study, 47 individuals were examined and 160 CNVs were detected (Sharp, Locke et al. 2005). Another study adapted the idea and used a custom- designed array targeting Mendelian inheritance errors. This had identified 586 variants (Conrad,

48

Andrews et al. 2006). Additional studies had been conducted using a wide range of array technologies with high resolution (including 244,000 probes approximately) leding to the identification of more variants (de Smith, Tsalenko et al. 2007). Another progress of array technology was made by Conrad et al. In his study, they used 20 NimbleGen arrays including more than 2 million of array probes for accurate CNVs detection among 41 European samples (Conrad, Pinto et al. 2010).

Major advances in the analysis of structural variants had been made lately with the introduction of whole genome sequencing technology, which had been confirmed to be a powerful method (Xie and Tammi 2009). For example, The 1000 Genomes Project had conducted WGS for 185 individuals and about 20,000 novel structural variants were identified including 6,000 large CNV deletions. (Mills, Walter et al. 2011, Zarrei, MacDonald et al. 2015).

1. 10.1 Copy Number Variants: Mechanisms of formation

There are different mechanisms involved in CNVs formation and account for the majority of chromosomal rearrangements within the human genome. These mechanisms involved non-allelic homologous recombination (NAHR), non-homologous end-joining (NHEJ) and Fork Stalling and Template Switching (FoSTeS) (Stankiewicz and Lupski 2010).

Briefly, complex CNVs and regions with high crossover between multi-allelic DNA sequence repeats were more likely to mediate both duplications and deletions. Complex CNV regions were enriched with duplications that were not commonly seen within simple CNV regions. The complex variation of CNVs structure might reveal that CNVs could possibly had different ages within the human genome (Stankiewicz and Lupski 2002, Redon, Ishikawa et al. 2006). Conserved CNVs that were resulted from early mutational events might be formed by double-stranded DNA repair mechanisms as seen with non-homologous end-joining mechanism (McCarroll, Hadnott et al. 2006). This mechanism was used by eukaryotic cells to repair DNA breaks that are resulted from ionizing radiation, or reactive oxygen.

Another possible mechanism for genomic rearrangement could be explained by FoSTeS model (Stankiewicz and Lupski 2010). The advances in the array-based CGH provided the ability to identify complex genomic rearrangements. Basically, this model was based on DNA replication mechanism. Here, a switch between the original DNA templates with another DNA replication fork could restart new DNA synthesis (Lee, Carvalho et al. 2007, Zhang, Khajavi et al. 2009).

Recently, a number of tandem repeats (VNTR) were found to contribute in formation of CNVs mechanisms and accounts for 11.2% of validated CNVs (Conrad, Pinto et al. 2010). NAHR mechanism

49

was found to mediate large CNVs. Small CNVs were found to be mediated by VNTR and accounted for 3.5 times of forming small CNVs than NAHR. About 13.5% of CNV formation was mediated by NAHR and 11.2% by VNTRs (Conrad, Pinto et al. 2010).

CNVs could be inherited or might be de novo, which resulted in further complexity of understanding the mechanisms of CNVs formation. The lack of large-scale analysis of family trios had limited the understanding of the occurrence of de novo CNVs within the human genome (Den Dunnen, Grootscholten et al. 1989, van Ommen 2005). However, the occurrence of de novo CNVs had been estimated with than 3.3% (Wang, Li et al. 2007, McCarroll, Kuruvilla et al. 2008).

1. 10.2 Copy Number Variants prediction and Validation

Over the past few years, various techniques had been established in order to identify and validate CNVs in the human genome. This section provided brief description of these different available techniques focusing on the application of genotyping array techniques.

CNV detection methods could be divided into array-based techniques and sequencing- based methods (Carter 2007). Array-based methods allowed CNV analysis at genome-wide level; these methods included SNP genotyping array platform and aCGH. Sequencing-based methods utilizing from the advanced NGS technologies had increased the accuracy of detecting different genomic variations including CNVs (Wang, Nettleton et al. 2014). This method provided rapid, accurate and cost-effective technology for CNV screening compared to microarray technologies. However, the application of NGS was limited by the lack of effective statistical analysis methods.

SNP genotyping array technology provided a high resolution gene scan allowing the detection of SNPs and CNVs (Winchester, Yau et al. 2009). The application of genome-wide SNP genotyping technology yielded a large number of CNVs that were associated with risk of complex disorders. The main advantage of using this method was the feasibility of analyzing both SNPs and CNVs. In addition, this method required low amount of DNA samples compared to other methods, such as aCGH. Also, the method was cost-effective and more samples could be tested at low budget. SNP array technique allowed targeted detection of genomic variations –including CNVs- at genome wide level; using this technique for targeted CNV detection was cheaper compared to other high throughput technologies such as NGS (Winchester, Yau et al. 2009). The most common SNP genotyping arrays used for CNVs detection were generated by two manufacturers, Illumina Inc. (Dan Diego, USA) and Affymetrix Inc. (Santa Clara, USA). The assays used in both techniques detected CNVs at high coverage (Winchester, Yau et al. 2009). 50

Array CGH was a high resolution technique that was commonly used to compare copy number status of the sample with a reference DNA sample using fluorescence in situ hybridization (FISH) (Carter 2007, Winchester, Yau et al. 2009). This method was first established to investigate genomic variations in tumors. The key advantage of aCGH was the capability of detecting DNA copy alterations –including deletion, duplication or amplification- at several genomic regions. The limitation of this method was that the array resolution was restricted by the size of targeted region (Carter 2007).

CNV prediction algorithms had been developed over the past to enable CNV calling using the signal intensity data produced by SNP array and/or aCGH (Coin, Asher et al. 2010). Also, most of the SNP genotyping arrays used for CNV detection were generated using two platforms Illumina Inc. (Dan Diego, USA) and Affymetrix Inc. (Santa Clara, USA). High density oligonucleotide arrays for SNP genotyping were implemented in these two high throughput platforms. Signal intensity data generated by both illumina and Affymetrix platforms were used to calculate Log R Ratio (LRR) – a measure of total intensity signals from all probes for each SNP- and B allele Frequency (BAF) - the proportion of intensity signals between two probes for each SNP- to call SNP alleles used for CNV calling. Most of the existing calling algorithms such as, PennCNV, cnvHAP and QuantiSNP were based on the hidden Markov model (HMM) which was based on using a one-dimensional longitudinal information from each individual (Colella, Yau et al. 2007, Wang, Li et al. 2007, Coin, Asher et al. 2010). Other algorithms were based on different method, such method used the signal intensities from all individuals focusing on a subset of probes each time (Coin, Asher et al. 2010).

Methods of the CNVs experimental validation were designed to measure CNVs, such as multiple ligation-dependent probe amplification (MLPA). This method was high throughput and commonly used to target up to 50 DNA sequences (Eijk-Van Os and Schouten 2011). It was a multiplex polymerase chain reaction (PCR)-based method that allowed copy number measurement through targeted copy number genotyping at multiple loci using a single pair of primers (Schouten, McElgunn et al. 2002). Each MLPA probe contained two oligonucleotides, one was recognized by forward primer and the other by reverse primer. When both were hybridized near the target DNA, ligation would occur successfully. Then, ligated probes would be amplified (Eijk-Van Os and Schouten 2011). Another non-array based measurement of copy numbers was quantitative fluorescent PCR (qPCR) or real-time PCR.

This method depended on the idea that through the exponential step of PCR amplification, the amount of produced PCR was directly related to the quantity of the target DNA used in the initial

51

sample. The use of equivalent quantities of starting DNA, variations in copy number at regions of interest could be comparable between samples (Mackay, Arden et al. 2002).

1.10.3 Role of Copy Number Variants in T2D Development

Multiple studies have focused on identifying CNV associations with T2D, for example a study was conducted by Shtir et al. to study the CNVs associations with T2D among 194 Caucasian using two CNV calling algorithms (Shtir, Pique-Regi et al. 2009). In this study, they were successful to report little evidences of association; however the two methods indicated low level of agreement and further analysis was required (Shtir, Pique-Regi et al. 2009). Another study published by Craddock et al. conducted a large genome-wide study to assess CNV associations with eight common diseases using the advances of array CGH to type 19,000 individuals (Craddock, Hurles et al. 2010). He reported a potential association of common CNV within TSPAN8 gene with T2D and suggested that common CNVs were unlikely to have a significant role in developing T2D (Craddock, Hurles et al. 2010). A study was conducted among Korean population included 1,204 individuals from the Korean Genome and Epidemiology Study (KoGES) to detect CNV associations with T2D in the leptin receptor gene (Jeon, Shim et al. 2010, Jeon, Shim et al. 2010). Genome-wide genotyping (using Affymetrix 50K SNP chip) data was extracted from candidate T2D and obesity genes in order to identify CNVs within or near these candidate genes. CNVs around the leptin receptor gene (LEPR) were detected and found to be associated with T2D risk and other metabolic traits, such as high fasting glucose and total cholesterol levels (Jeon, Shim et al. 2010, Jeon, Shim et al. 2010). Bae et al identified three potential T2D risk CNVs in a Korean population (Bae, Cheong et al. 2011). In this study, a total of 275 unrelated T2D cases and 496 unrelated controls were genotyped using Illumina HumanHap300 platform and PennCNV algorithm was used for CNV calling. Three potential associated CNVs were detected, however the underlying mechanisms of these associations were unknown and required further investigations (Bae, Cheong et al. 2011). Recently, a case-control study was carried out among 216 Thai individuals with diabetes and 192 individuals with no diabetes aiming to identify CNVs within CAPN10 SNP-44 region (Plengvidhya, Chanprasert et al. 2015); this had been reported previously to be associated with T2D (Arslan, Acik et al. 2014). In this study, CNVs were detected within CAPN10 SNP-44 causing deviations from Hardy–Weinberg equilibrium (HWE). This suggested that the distribution of certain genotype showing deviations from HWE might be due to the existence of CNVs. However, no association was detected between the identified CNVs and T2D (Plengvidhya, Chanprasert et al. 2015).

52

Recent reports showed evidences of CNVs associations with T2D, but intensive investigations are still required to understand the underlying mechanisms of CNVs in T2D risk.

1.11 Epigenetic of Type 2 Diabetes

Risk of developing T2D could be due to several environmental and genetic predisposing factors. Several studies revealed an association between epigenetic mechanisms and risk of developing different diseases, such as T2D (Maier and Olek 2002, Kirchner, Osler et al. 2013). Epigenetic alterations could be described as changes that did not modify the DNA sequence itself, but could be due to biological events such as DNA methylation, histone modification and changes of chromatin structure (Riggs 1975, Jaenisch and Bird 2003). These epigenetic events are believed to elucidate the association between environmental changes and phenotypes, and also could explain the hidden heritability of complex disorders (Manolio, Collins et al. 2009). Early studies on epigenetic characteristics of T2D have been mainly focused on investigating the methylation status of specific set of CpG sites within known T2D candidate genes. Relatively few studies on DNA methylation candidate gene to characterize the epigenetic basis of T2D risk have been published (Ling, Del Guerra et al. 2008, Liu, Chen et al. 2012, Yang, Dayeh et al. 2012). This could be partially explained by the underpowered methylation candidate gene studies that were difficult to be replicated (Drong, Lindgren et al. 2012). The feasibility of conducting epigenetic studies have been improved with the introduction of genome-wide Illumina HumanMethylation450 BeadChip that targetd more than 450,000 CpG sites (Bibikova, Le et al. 2009). However, only few studies have been published lately using the advances of EWAS technology (Barres, Osler et al. 2009, Bell, Finer et al. 2010, Toperoff, Aran et al. 2012, Volkmar, Dedeurwaerder et al. 2012, Dayeh, Volkov et al. 2014, Chambers, Loh et al. 2015). Implementing EWAS was still challenging; this could be due to technical issues, access to tissue samples and the high bioinformatics skills needed (Rakyan, Down et al. 2011). Studies used this technology to characterize epigenetic risk of T2D will be discussed later in section 1.11.4.

1.11.1 Epigenetic Mechanisms Different biological events involved in epigenetic changes were DNA methylation and histone modifications. Another mechanism of epigenetic control included gene transcription and translation by small non-coding RNAs. However, multiple environmental factors were involved in changes of gene expression that represented changes in phenotype such as, diet, aging and hormonal changes during pregnancy (Feil and Fraga 2011, Kirchner, Osler et al. 2013).

53

1.11.1.1 DNA methylation DNA methylation could be referred to genetic changes affecting gene activity or gene functions (Holliday and Pugh 1975, Riggs 1975). The only nucleic acid that could be modified and methylated was cytosine at the 5-carbon position of a DNA (Lister, Pelizzola et al. 2009). It had been shown that at a single-base resolution maps of methylation changes and under normal physiological circumstances, around 5% of almost all cytosines were methylated (Suzuki and Bird 2008, Lister, Pelizzola et al. 2009). Though, variable patterns of DNA methylation could be observed among vertebrates. Primarily, DNA methylation of cytosine nucleic acid occurs exclusively on CpG nucleotides and methylated cytosine was usually followed by guanine. However, non-CpG methylation that was found in the context of CpA, CpT and CpC (i.e., methylated cytosine followed by the other DNA bases adenine, thymine or cytosine) had been detected in plants, animals and humans. (Meyer, Niedenhof et al. 1994, Barres, Osler et al. 2009, Yan, Zierath et al. 2011). DNA methylation was involved primarily in gene silencing to regulate gene expression. Accordingly, chromosomal regions including telomeres, centromeres and X-chromosomes showed high level of DNA methylation (Riggs 1975, Kirchner, Osler et al. 2013). Also, DNA methylation was an essential step during normal cell development and was linked to different processes including genomic imprinting, X-chromosome inactivation and cancer development (Li, Bestor et al. 1992, Okano, Bell et al. 1999).

It had been hypothesized that there were different roles of DNA methylation depending whether methylation occurred within the promoter region or within the gene body. It was thought that DNA methylation in promoter region silenced gene activity, while intragenic methylation enhanced the gene activity. However, these observations were not well understood yet (Rideout, Coetzee et al. 1990, Nguyen, Gonzales et al. 2001).

Methylation usually occured at the DNA base cytosine and two enzymatic activities were thought to be involved, maintenance methylation and de novo methylation. DNA methyltransferase (DNMT) enzymes including DNMT1, DNMT3a and DNMT3b were involved in these activities (Phillips 2008). The de novo DNMTs (DNMT3a and DNMT3b) were responsible for the initial pattern of DNA methylation during early process of cellular development. The role of de novo DNMTs was not well defined yet, but studies had reported that these enzymes could be involved in chromatin- remodeling complexes (Bourc'his, Xu et al. 2001, Ooi, Qiu et al. 2007, Phillips 2008). The role of maintenance enzyme DNMT1 was to replicate the methylation patterns from an existing DNA strand

54

to the daughter strand to ensure the maintenance of DNA methylation during cellular cycles (Phillips 2008).

Preservation of DNA methylation over generations had been widely investigated. Epigenetic modulations could be driven by the environment and the status of the parent when the methylation association was transmitted to the fetus. This idea could be termed as intrauterine programing, which was involved in the development of several disorders such as diabetes (Lucas 1994, Sasaki and Matsui 2008, Kirchner, Osler et al. 2013).

1.11.1.2 Histone Modifications Histones are protein elements of chromatin with projecting N-terminal tails and have an important impact on regulating gene expression process. Histones are found in eukaryotes and involved in DNA packaging into nucleosomes (a nucleosome is made up of 146 bp of DNA sequence around histone constructing the bulk of chromatin fibers). There were five classes of histone proteins, which were H1, H2A, H2B, H3 and H4 (Cedar and Bergman 2009, Djupedal and Ekwall 2009, Zhang, Wen et al. 2012, Kirchner, Osler et al. 2013). There were two known types of chromatin fibers, the accessible that is referred to DNA coding genes involved in early replication and is usually linked to RNA polymerases (euchromatin). While the inaccessible highly condensed chromatin is referred to heterochromatin included non-coding repeat sequences. Histone proteins were subject to different modifications at N-terminal tails including acetylation, methylation, phosphorylation, ADP- ribosylation and ubiquitination (Li, Carey et al. 2007, Djupedal and Ekwall 2009, Zhang, Wen et al. 2012). These chemical modifications had an extensive effect on gene expression and chromatin structure; these effects were dependent on the location of histone modification. Various amino acids were affected by these chemical modification, but the most studied modification was at lysine amino acid (Koch, Andrews et al. 2007, Zhang, Wen et al. 2012). Methylation at lysine residues is linked to activation or inactivation of gene expression depending on the methylated residues. Additionally, methylated patterns of lysine residues can occur in a mono-, di-, or tri-methylation. For example, methylation changes on class histone H3 lysine 4 (H3K4), H3K36, and H3K79 is correlated to gene expression activation, while gene silencing is usually related to di- and tri-methylation on H3K9, H3K27, and H4K20. Histone lysine methylation is known to be regulated by approximately 80 enzymes (Zhang, Wen et al. 2012). These enzymes included lysine methyltransferases (KMTs) family (Shi, Lan et al. 2004, Qian and Zhou 2006).

55

1.11.1.3 RNA interference RNA interference (RNAi) is a biological process that is used in multiple organisms where double- stranded RNA molecules could intervene with gene silencing and gene expression. This could occur by causing the destruction of messenger RNA (mRNA) in the cytoplasm, which could prevent protein production (Montgomery, Xu et al. 1998). RNAi provided the technical knowledge to investigate gene functions through knockdown of concordant mRNA (Jackson and Standart 2007). Recent experiments conducted in histones of yeast provided an improved understanding of heterochromatin function and mechanisms of heterochromatin formation (Levenson and Sweatt 2005, Djupedal and Ekwall 2009). RNAi pathway interruption resulted in moderation of heterochromatin function that caused a reduction in methylation of histone3, and inaccurate expression of normally silent transcripts (Hall, Shankaranarayana et al. 2002, Volpe, Kidner et al. 2002). The mechanism of heterochromatin formation is directly promoted by a protein complex that is associated with small RNAs, which are non-coding RNA molecules (Verdel, Jia et al. 2004). There were some examples that explained the regulation process of gene expression in eukaryotes through RNAi; these examples included X-chromosome inactivation (Brown and Chow 2003, Chow and Brown 2003), generation of circadian rhythmicity (Sauman and Reppert 1996, Crosthwaite 2004) and microRNAs (miRNA)(Ambros 2004). MicroRNAs were small non-coding RNA molecules consisting of approximately 22 nucleotides, miRNA could be found in animals, plants and some viruses (Ambros 2004).

1.11.1.4 Environmental factors Epigenetic modifications could be influenced by environmental factors, allowing them to play a role in the pathogenesis of T2D (Ling and Groop 2009, Feil and Fraga 2012). Investigations on environmental epigenetics have been mainly focused on DNA methylation changes due to its importance for development (Feil and Fraga 2012). Insights into mechanisms of DNA methylation and chromatin modifications associated with environmentally triggered phenotypes are not well understood yet, particularly in mammals (Jirtle and Skinner 2007). The exposure to various external factors including diet, pregnancy and aging could induce epigenetic alterations and subsequently health-related issues (Feil and Fraga 2012). Examples for the effect of these factors are discussed here. Briefly, DNA methylation pattern has been proved to be age-related (Fraga and Esteller 2007, Bollati, Schwartz et al. 2009, Zaghlool, Al-Shafai et al. 2015). Multiple studies introduced the relation between aging and methylation patterns. Recently, a replication study using Qatari samples confirmed such an association (Zaghlool, Al-Shafai et al. 2015). In this study, associations of known methylated sites with age identified in Caucasian were replicated among Qataris. This replication

56

was achieved regardless of ethnic background and environmental exposure (Zaghlool, Al-Shafai et al. 2015). In addition, aging has been linked to T2D risk, as both aging and T2D showed a reduction in oxidative capacity and mitochondrial function (Chinnusamy and Zhu 2009, Sánchez 2010, Feil and Fraga 2012). The underlying mechanisms of these defects could be driven by genetic and environmental factors (Feil and Fraga 2012). Also, epigenetic alterations could affect important genes in mitochondrial respiratory chain during the course of life, such as COX7A1 (Feil and Fraga 2012). This gene was a protein-coding gene involved in complex 4 of the respiratory chain, showed a reduction of expression levels in muscles of diabetic patients, associated to changes in DNA methylation during aging (Feil and Fraga 2012). DNA methylation patterns at the promoter region of COX7A1 was higher in skeletal muscle of old twins compared to young ones, while COX7A1 gene expression showed opposite pattern of DNA methylation (Feil and Fraga 2012). Furthermore, increased in vivo glucose uptake in skeletal muscle was linked to levels of COX7A1 (Feil and Fraga 2012). These observations proved a role of age in modulating DNA methylation, gene expression, and subsequently in vivo metabolism.

Few studies have shown that nutritional components contributing to epigenetic changes in humans could have long term effects on different phenotypes (Feil and Fraga 2012). A good example of is shown in the agouti mouse; the agouti gene encoded a paracrine-signaling molecule, which stimulated the production of yellow pigments from melanocytes. Melanocytes produced yellow coat pigment instead of black and made mice susceptible to some disease development, such as diabetes and obesity (Baccarelli, Wright et al. 2009, Steegers-Theunissen, Obermann-Borst et al. 2009, Hoyo, Murtha et al. 2011). The level of agouti gene methylation modulated its expression and thus coat color and disease risk. Moreover, introduction of dietary supplements, such as folic acid, vitamin B12, choline, or betadine to pregnant mice raised the methylation level of agouti gene in the offspring. This resulted in reduced gene expression and a brown coat color (Bollati, Baccarelli et al. 2007). Another example was that intrauterine growth retardation (IUGR) that was associated with Pdx1 epigenetic silencing. Pdx1 was involved in the development of pancreatic cells and β-cell differentiation- impairment of β-cell function and development of T2D in adult offspring (Christensen, Houseman et al. 2009, Daxinger and Whitelaw 2010). Another evidence of environmental role in epigenetic has been described by Gorden et al in a study published in 2012. This study discussed the implementation genome-wide methylation assay using cord blood mononuclear cells (CBMCs), vascular endothelial cells (HUVECs) and placenta to study methylation of neonatal twins. The study reported a low heritability rate and methylation changes were highly influenced by intra-uterine environment (Gordon, Joo et al. 2012).

57

The above examples brought the attention to the role of the extrinsic factors on epigenetic modifications to investigate the effects of these factors on epigenetics.

1. 11.2 Heritability Estimates of DNA methylation

Studies have reported differences in DNA methylation patterns between related individuals; these changes have driven the need to study the normal patterns of DNA methylation and understand the basis of these genetic differences (Flanagan, Popendikyte et al. 2006, Bock, Walter et al. 2008, Kaminsky, Tang et al. 2009, Schneider, Pliushch et al. 2010). Variations in DNA methylation patterns could be influenced by multiple factors including genetic and environmental factors, and heritable changes involved in epigenetic mechanisms (Bell and Spector 2011). Different studies have shown evidences of heritability and DNA methylation variability among twins. A study was conducted by Gervin et al. among 49 MZ twins and 40 DZ twins to examine heritability of methylation in the major histocompatibility complex (MHC) region have shown an overall low heritability rate among the different pairs. The study concluded that the genetic influence on DNA methylation at specific CpG site was relatively small, and large variabilities in DNA methylation were likely to be driven by non-genetic factors (Gervin, Hammerø et al. 2011). Generally, studies found a moderately low heritability estimates of DNA methylation across the genome (Boks, Derks et al. 2009, Kaminsky, Tang et al. 2009). Recent figures of DNA methylation heritability rate at CpG-site- specific reported an estimate of 12-18% in blood (Bell, Tsai et al. 2012, Gordon, Joo et al. 2012), 5% in placenta (Gordon, Joo et al. 2012), and 7% in HUVECs (Gordon, Joo et al. 2012). Available data on methylation heritability rate were mostly resulted from investigating a subset of CpG sites (Boks, Derks et al. 2009, Kaminsky, Tang et al. 2009). However, implementations of high resolution genome-wide methylation studies measured methylation profiles at each CpG to reflect the amount of methylation and to provide better heritability estimates (Bell and Spector 2012).

1. 11.3. Role of Epigenetics in Developing T2D

The potential involvement of epigenetic changes in T2D risk has been supported by different evidences. First, the fetal origins hypothesis has established the basis of ‘metabolic programing’, which suggestd that the exposure to different nutritional elements during the pre-natal and neonatal periods has long-term effects during adult life (Hales and Barker 1992). Fetal origins hypothesis is supported by strong epidemiological data connecting early life events to health

58

changes and disease susceptibility. For example the study of Dutch hunger winter dry season have reported a link between the exposure to famine and DNA methylation alterations. The study reported that the exposure to famine in the peri-conceptional period has caused metabolic and mental disorders in the next generation (Heijmans, Tobi et al. 2008). A second observation came from the study of intra-uterine effects on predisposition to phenotype differences (Pettitt, Aleck et al. 1988, Hales and Barker 1992). A follow-up study was conducted on Pima Indians (the Native Americans) with high prevalence of T2D showed a substantially higher risk of developing T2D among offspring born to mothers with gestational diabetes (45%) than offspring born to mothers with no gestational diabetes (1.4%) (Pettitt, Baird et al. 1983, Pettitt, Aleck et al. 1988). The variations in disease risk couldn’t be completely explained by genetic transmission, since the differentiation between disease susceptibility among siblings born to the same mother is preserved. This meant that offspring born to a mother with gestational diabetes had higher risk than their siblings who were born to the same mother while not having gestational diabetes (Dabelea, Hanson et al. 2000).

Also, experimental studies on animal models suggested a causal relation between methylation and different nutritional components leading to T2D (Seki, Williams et al. 2012).

In summary, there were multiple studies with scientific evidence supporting the role of epigenetic alterations to the pathogenesis of T2D and familial aggregation. However, experimental studies to specifically determine the disease susceptibility loci were still needed (Drong, Lindgren et al. 2012).

1. 11.4 Characterization of T2D Epigenetic Risk

As discussed earlier, few studies have been published the characterization of epigenetic contribution to T2D risk. Most of these studies are based on investigating the methylation status of specific set of CpG sites within candidateT2D candidate genes (Drong, Lindgren et al. 2012). There was a small number of EWASs that have been conducted on T2D up to date (Barrès, Osler et al. 2009, Bell, Finer et al. 2010) (Toperoff, Aran et al. 2012, Volkmar, Dedeurwaerder et al. 2012, Dayeh, Volkov et al. 2014, Chambers, Loh et al. 2015).

A large EWAS was performed among approximately 1,150 T2D cases and controls using peripheral blood samples. This EWAS showed that differentially methylated sites were highly detected within previously reported T2D genes identified from genetic studies (Toperoff, Aran et al. 2012). An interesting region within BMI-associated gene (FTO) was identified in this study (Toperoff, Aran et al. 2012). This region showed high level of methylation in the first intron of FTO that was linked to a genotype close to BMI associated SNPs. Multiple studies had replicated the associations between

59

FTO risk allele and hyper-methylation (Bell, Finer et al. 2010, Almén, Jacobsson et al. 2012, Toperoff, Aran et al. 2012). However, methylation changes couldn’t clearly explain the causal link between T2D and FTO risk allele (Toperoff, Aran et al. 2012).

Recently, another large EWAS was carried out to study the methylation patterns associated with T2D among 1,074 UK Indian Asians compared to 1,141 Europeans using peripheral blood (Chambers, Loh et al. 2015). This study reported increased T2D risk among Indian Asians independent of other risk factors, such as physical activity, family history of T2D or glycemic measures compared to Europeans. Methylation patterns were detected at five genomic loci found to be linked to T2D risk and may explain the increased risk of T2D among Indian Asians. These five loci showed strong associations with differential methylation levels and T2D risk; the identified associations were within ABCG1, PHOSPHO1, SOCS3, SREBF1 and TXNIP. The underlying mechanism of epigenetic changes and T2D risk within these changes are still unknown. However, abnormal methylation patterns of TXNIP could be an early indication of glucose hemostasis impairment since expression of TXNIP was highly correlated with glucose concentration (Chambers, Loh et al. 2015, Lehne, Drong et al. 2015). In addition, this study showed that methylated regions within ABCG1, PHOSPHO1, SOCS3 and SREBF1 were closely linked to BMI and insulin level. Methylation measures at these loci could act as biomarkers of adiposity and insulin resistance (Chambers, Loh et al. 2015).

These recently published studies identified several T2D genes and discussed the importance of DNA methylation T2D risk. Moreover, further studies are still needed to study the role of DNA methylation on gene expression (Riggs 1975, Petersen, Zeilinger et al. 2014).

1.12 Genetic of Type 2 Diabetes in Qatar

It has been shown that T2D is growing rapidly in Qatar and the genetic basis of T2D risk among Qataris has not been established yet. Therefore, to better understand the genetic basis of T2D, extensive research using native Qatari samples is required. From the existing data, few studies have been conducted to outline the genetic and environmental factors contributing to the future risk of T2D. A study was conducted in 2005 to identify the association between T2D and risk factors of T2D risk. These factors included consanguineous marriages, obesity, smoking, physical activity and T2D related conditions (including high blood pressure, other risk factors such as smoking) (Bener, Zirie et al. 2005). In this study, a survey was carried out at the primary health care centers among 338 individuals with diabetes and 338 individuals with no diabetes. The survey included details about the participants, such as age, gender, anthropometric measures, smoking and lifestyle. In addition,

60

health condition was assessed through physical examination and laboratory investigations including blood glucose, full blood count and lipid profile. The study revealed that diabetes was more frequent among individuals who had other health issues, such as hypertension and had sedentary life style. Another study was conducted in 2006 by Dr. Ramin Badii in collaboration with Prof. Philippe Froguel to conduct the first study on the association between peroxisome proliferators-activated receptor γ (PPAR-γ2) gene and T2D among Qataris. The Pro12Ala polymorphism with the PPAR-γ2 has been reported in literature as one of the most common genetic variants identified for T2D in multiple ethnic groups. The allele frequency of this variant was varying among different ethnicities. It has been observed in the Caucasians from different countries that the frequency of Pro12Ala polymorphism of the PPAR-γ2 gene was higher compared to Asians (Beamer, Yen et al. 1998, Frederiksen, Brodbaek et al. 2002, Rosmond, Chagnon et al. 2003). The study was carried out among the Qatari population; 400 diabetic cases and 450 controls aged 35-60 years with a BMI range between 25 up to 35 kg/m² were recruited through primary health care centers. The study investigated the association of the Pro12Ala polymorphism of PPAR-γ2 with T2D among Qataris (Badii, Bener et al. 2008). That study reported absence of association of the Pro12Ala polymorphism with T2D, and low frequency of the Pro12Ala variant among both diabetics (5.5%) and controls (5.9%) compared to other ethnic groups (Badii, Bener et al. 2008).

61

1. 13 Aims and Objectives

1.13.1 Overall Aim

The majority of genetic determinants of T2D came from the study of common variants that were expected to explain T2D risk. However, missing heritability of T2D could be partially explained by rare variants, epigenetic changes and other genetic components. Rare genetic variations to better understand the genetic involvement of T2D risk are not fully identified. Such variants are expected to be enriched in inbred population. The project proposed here will use the state-of-the-art technologies to seek novel of known genetic sources of T2D among Qataris. Qatari families with increased high of T2D individuals will be investigated. During my PhD I applied the state-of-the-art technologies (genome-wide genotyping arrays, next generation sequencing and epigenome-wide methylation array) to seek new rare causative loci and DNA methylation changes contributing to T2D risk.

1.13.2 Objectives

1- To identify known or novel rare causative loci by investigating consanguineous Qatari families with high prevalence of T2D. The investigation of consanguineous families could facilitate the discovery of rare genetic variations due high inter-relatedness and low genetic heterogeneity.

2- To identify potential large rare copy number variants related to T2D among Qataris. Since the study population is inbred with high consanguinity rate, large de novo CNVs are expected to be found. Such CNVs usually overlap with disease relevant genes.

3- To identify differentially methylated loci associated with candidate T2D genes. The study power is limited; therefore replicating known methylation associations is the main objective.

62

CHAPTER 2

Material and Methods

63

1 Material and Methods

2.1 Ethical Approval

I obtained the ethical approval from the Institutional Review Board of Weill Cornell Medical College in Qatar in concordance with the Helsinki declaration of ethical principles for medical research involving human subjects. Ethical approval number for this study is 2012-003 and 2012-0025.

2.2 Study Subjects

2.2.1 Qatari Families with Early Onset of Type 2 Diabetes

Study subjects were identified and collected from the diabetic clinics at Qatar Diabetes Association (QDA), a member of Qatar Foundation for Education, Science and Community Development (QF). QDA is a secondary health care center, provides health services to diabetic patients and to those at risk with a mission of raising the awareness of healthy lifestyle to prevent and manage diabetes.

Diabetic probands were selected carefully from the diabetes clinics at QDA during their regular visits. I personally attended all the probands follow up visits and identified the subjects with the assistance of an endocrinologist. Clinical history was collected from all probands and family visits were scheduled for blood draw.

A team consisted of a clinician, a nurse and I visited all the families that were identified for the study at their homes. Family members selected for the study provided written consent forms prior to blood extraction. Also, anthropometric measures (including height in centimeter and weight in kilograms), phenotypic data (including age, gender, body mass index, disease onset and smoking) and lifestyle questionnaires were completed and collected from all study participants. Additionally, data on HbA1c was available for all individuals with history of diabetes. Other quantitative traits related to T2D, such as fasting blood sugar and insulin level were not available.

Eight Qatari multi-generational families with a total number of 40 subjects (including 27 cases and 13 controls) were recruited for the study. The aim was to recruit families with at least two consecutive generations and preferably with a history of first or second degree consanguinity. However, one family was consisted of siblings only due to missing information from parents and other family member. Each family had at least three diabetic individuals and at least two controls with a normal

64

Body Mass Index (BMI). Pedigrees were generated using Haplopainter (http://haplopainter.sourceforge.net/).

Following a standardized protocol, peripheral blood samples were collected by venipuncture from all cases and controls (blood extraction was performed by all the recruiting team, including myself) using butterfly needles and were collected in EDTA anticoagulant tubes and PAXgene tubes (PAXgene Blood RNA tubes) for potential future studies. Blood samples were transferred to WCMC- Q clinical research labs for DNA extraction and blood storage (all samples were stored at -80°C).

The set of Qatari families collected here have been analyzed to detect rare genetic variants, copy number variants associated with T2D and differentially methylated loci associated with T2D.

Details of subjects’ phenotypes are found in table 2.1.

2.2.2 Qatari Families with Type 2 Diabetes and Obesity

In collaboration with my colleague Mashael Al Shafai, another set of eight Qatari families with history of obesity was added to the Qatari diabetes families. Samples were selected from obesity clinics at Qatar Diabetes Association (QDA), following the same procedure explained above. A total of 16 Qatari families including 123 individuals were available. All participants provided written consent forms. Anthropometric measurements were collected by a trained nurse, and questionnaires about general health and life style were completed by all participants. Each of obesity family has at least three obese subjects (with a BMI ≥ 30) and two normal weight controls. Whole blood was extracted from all participants and collected in EDTA tubes.

Obese families were added to the diabetes families to increase the power of the analysis. These families were used to perform methylation association with T2D only. Methylation association with obesity was part of Mashael Al shafai PhD projects; and results of obesity methylation association were discussed in her thesis. Details of subjects’ phenotypes are found in table 2.1.

65

Table 2.1 Description of all samples used in this PhD project. The table includes the number of families, number of cases and controls. Mean and standard error of the mean (SEM) for the age, height, weight, bmi and HbA1c are also included. SEM was calculated in excel by dividing the standard deviation of the mean by the square root of n. The (*) indicates the samples that are collected by my colleague Mashael Al Shafai that she used for her PhD project. Obesity families have been combined with the diabetes families when running the methylation association test only. Three samples from the obesity families were diabetics.

Phenotype Number Cases Control Mean±SEM Mean±SEM of Mean±SEM of Mean±SEM of Mean±SEM of Available data of Age Height (cm) Weight (kg) BMI ((kg/m2) HbA1c Of Families T2D 8 30 15 47.1±5.5 168.4±4.7 64±4.1 26.6±6.1 5.6±2.41 Genotyping,

Sequencing and *Obesity 8 49 29 36.5±15.7 165.2±6.4 75±2.1 32.2±7.7 Not applicable methylation

66

67

Figure2.1. The pedigrees above represent eight Qatari families with T2D. The blue symbols show diabetic cases, the grey symbols represent the controls and the black is unknown

68

2.2.3 Replication Samples

A replication cohort consisting of 108 Qatari subjects was used to determine population-specific statistics on genetic variants. Available whole genome sequencing data for 108 subjects including diabetics and non-diabetics individuals were used to replicate results from mapping rare variants and copy number variants.

2.3 DNA extraction

DNA was extracted from fresh whole blood samples (2 mL was used for extraction) using QIAamp blood midi Kit (Qiagen, spin protocol) catalogue number (51183). DNA was extracted following the steps described by manufacturer protocol for midi kit; DNA was extracted using the recommended proteinase K provided in the kit. The process of DNA extraction included several steps of ethanol DNA washing, hydration and precipitation. Isolated DNA was quantified with NanoDrop (NanoDrop, Thermo Fischer Scientific Inc.) and Qubit 2.0 fluorometer (Invitrogen) using Qubit dsDNA Broad Range Assay Kit (catalogue numbers Q32850, Q32853) to ensure that all DNAs were of high quality.

See Appendix A1 for details of DNA concentration and quality.

2.4 Sample Preparation

The bioinformatics core at WCMC-Q had service contracts with Illumina Fast Track Services to carry out whole genome genotyping array, whole genome sequencing and whole genome methylation array. Taking the advantage of these contracts, all samples were shipped to Illumina to perform genotyping, sequencing and methylation profiling. According to Illumina requirements, DNA was prepared with a minimum concentration of 30ng/ul (total 3 ug) with a total volume of 100ul for genotyping and sequencing. For methylation arrays the minimum required concentration was 50ng/ul (total 5 ug) with a fixed volume of 40ul. DNA quantitation was measured using Qubit fluorometer (as described above) and the required amount of DNA was prepared accordingly. DNA samples were loaded into 96 well midi plates (supplied by Illumina) sealed with silicon caps and shipped frozen to Illumina.

2.5 Whole Genome Genotyping Arrays

Whole genome genotyping was performed by Illumina Fast Track Genotyping Services Infinium assay based on a fee-for-service (described above). Genotyping was performed using HumanOmni2.5-8M BeadChip for all available family members (including cases and controls). Around 2.5 million markers were featured in Illumina HumanOmni2.5-8 arrays that provided an optimal and comprehensive set of markers including both common and rare variants from the 1000 Genome Project. Illumina bead chip system offered a high throughput genotyping tool including SNP tagging and support full copy number variation (CNV) applications (www.illumina.com).

The Illumina Infinium assay protocol used a low concentration of DNA sample (about 200ng) and left for an overnight amplification. Also, the steps of genotyping include DNA enzymatic fragmentation, DNA suspension, and alcohol precipitation, hybridization, annealing and fluorescently stained.

2.6 Prediction of Copy Number Variants associated with T2D

This section included the methods that were used to generate CNV calls using PennCNV detection algorithms. This section included also the strategy that was followed to identify candidate T2D causing CNVs.

Figure 2.1 shows a flowchart of the study strategy. Results from this analysis will be described later in Chapter Four.

70

Figure 2.2 Detection of Copy Number Variants associated with T2D using Illumina 2.5M SNP arrays. The above flow chart illustrates the strategy of predicting T2D causing CNVs among Qatari diabetes families.

71

2.6.1 Generation of B allele frequency and Log R ratio using signal intensity files

Generating signal intensity data including Log R ratio (LRR) and B-allele frequency (BAF) was crucial for subsequent CNV prediction. The LRR is calculated as the ratio between normalized intensity of the experimental sample to the expected intensity of a probe of normal samples. BAF is the normalized ratio of the quantity of the B-allele to the total quantity of both alleles.

GenomeStudio genotyping module was used to export Log R ratio and B-allele frequency using signal intensity files to a text file. Following the steps described in PennCNV manual http://www.openbioinformatics.org/penncnv/penncnv_input.html, Illumina SNP array files including raw signal intensity files, cluster files and the sample sheet were loaded into GenomeStudio before data export. Then, the exported data were selected from GenomeStudio options to include LRR, BAF and genotype data for each subject. Also, the final exported intensity file included marker name, chromosome number and position. Signal intensity file should look like the figure 2.2 below,

Figure 2.3 shows a screenshot of signal intensities in GenomeStudio

72

The exported signal intensity file, containing the LRR and BAF information was then split into individual signal intensity files using the split option of the kcolumn.pl provided in PennCNV. The split LRR and BAF files for each individual contained the marker name, chromosome, position, individual identifier, B-allele frequency and log R ratio.

2.6.2 PennCNV algorithm for CNV prediction

PennCNV algorithm integrates a hidden Markov model (HMM), and is developed to predict CNVs from high-density Illumina SNP genotyping arrays (Wang, Li et al. 2007). Before running the PennCNV detection algorithm, population frequency of B-allele (PFB) for each marker is calculated from my dataset. PFB file is created from family members using compile_pfb.pl program in the PennCNV package to identify chromosome coordinates and PFB value for each marker. CNV prediction was run using the detect_cnv.pl script provided with the PennCNV package. PennCNV output file contained information about the chromosome region, the size of each CNV in base pair (bp) and the number of SNPs contained within each CNV. Also, the actual copy number estimates including deletions (CN=0 or 1) or duplications (CN=3 or 4) was provided in the CNV output file.

2.6.3 Filtration of Samples with low Quality Calls

In this step raw CNV calls generated by PennCNV were examined and CNV calls from low quality samples were excluded based on the LRR variance threshold. First, I used the PennCNV default threshold of 0.135, which was too stringent and multiple samples did not pass the default threshold. PennCNV, also recommended using a threshold of 0.24, 0.3 or even 0.35. The LRR variance was calculated with a threshold of 0.24; the range of LRR variance in this study was 0.12 to 0.3, so one sample was excluded from the analysis. The LRR variance of each sample was plotted and shown in the figure 2.3 below:

73

LRR Variance 0.3

0.25 0.2

0.15

0.1

0.05 LRR Variance perSample 0 0 10 20 30 40 50 Samples

Figure 2.4: The plot represents the calculated LRR variance per sample, illustrating the sample LRR variance range which is 0.12 to 0.3; one sample exceeded the threshold used (0.24) and was excluded from further analysis.

2.6.4 Merging adjacent CNV calls

PennCNV tends to split large CNVs, for example large CNVs of more than 500kb will be divided into smaller parts of two or three 150kb CNV calls. PennCNV package included the perl script clean_cnv.pl, which is used to merge adjacent small CNVs. Large CNV calls of ≥500kb are included in the analysis.

2.6.5 Filtering Large and Rare CNVs

Large CNV calls were filtered in this step based on the number of overlapping SNPs using the program filter_cnv.pl, developed by PennCNV. Large CNVs that overlap with <20 SNPs were excluded from the analysis to avoid the possibility of including false positive CNVs.

74

2.6.6 Exclude false CNV calls

A list of large and dense CNV calls was examined carefully to exclude CNVs at centromeric and telomeric regions. It is believed that centromeric and telomeric regions tend to include false CNV calls. This might be partially due to the low coverage of these regions using the available genotyping arrays (Glessner, Li et al. 2013).

CNV coordinates were examined and visualized using UCSC genome browser, Build hg 19, to detect CNV calls at centromeric and telomeric regions. All CNVs identified within centromeric and telomeric regions were excluded and removed from the analysis.

2.7 Assessment of Candidate CNVs

2.7.1 Identify rare CNVs

The analysis strategy was mainly focused on searching for rare T2D causing CNVs. This was done by further analysis of large CNVs that are found to be rare and highly penetrant, and more disease relevant. These large CNVs could be either homozygous or large heterozygous deletions/duplications.

Following the study criteria a list of candidate CNVs was generated. Potential CNVs were annotated to identify overlapping genes using the perl script scan_region.pl. Annotated CNVs were later examined to identify CNVs that overlap with interesting T2D genes; those CNVs were classified into two groups:

 CNVs shared between samples and overlapping with genes  CNVs that are not shared and overlap with genes.

2.7.2 Visualization of candidate CNVs Candidate CNV calls were visually examined by plotting the actual signal intensities (LRR and BAF) across the CNV region. This would allow confidence on the identified CNVs and exclude false-positive CNV prediction.

75

2. 7.3 Replicate CNVs in other dataset A list of large candidate CNVs were compared to other available dataset to check for potential replication of interesting CNVs. CNV calls from 108 Qatari genomes were examined carefully to identify potential overlap between my study CNVs and the CNVs predicted in the 108 Qatari genomes.

2.8 Identification of Rare T2D Variants

This part will include the strategy and the details about the methods used to map rare genetic variants associated with type 2 diabetes. Potential variants identified from this analysis will be presented and discussed in following chapters. Flowchart of the study strategy is shown in figure 2.4.

Figure 2.5: The above flow chart shows the strategy for the identification of rare genetic variants associated with T2D using genotyping arrays and whole genome sequencing.

76

2.8.1 Linkage Analysis

Linkage analysis was used to extract the inheritance information from the family pedigrees to identify chromosomal segments with SNPs for an allele segregates among family members. Here, two analyses were performed, non-parametric linkage analysis (NPL) and variance component linkage analysis (VC), which will be discussed in section 2.8.1.4 and Chapter Three section 3.4, on all available Qatari families.

Dense SNP panels cannot be used for linkage analysis as such markers are more likely to produce correlated data due to high linkage disequilibrium between them (Abecasis and Wigginton 2005). I compared the performance of two different subsets of markers for linkage analyses: a subset of SNPs was selected from the Illumina 2.5M panel and a subset of SNPs that was included in Illumina Panel V. The Illumina Panel V included a set of SNPs that have been selected for accurate linkage analysis by using the best markers with optimized information (Murray, Oliphant et al. 2004); these markers were designed and selected from multiplexed assay system. However, Panel V has been generated specifically for the Caucasian population; therefore I tried to select an alternative subset of the most informative SNPs in the Qatari families in this study.

Different quality control measures (will be explained later in this chapter) and marker selections were applied before performing linkage analysis. In order to implement these steps, two files should be generated to start the analysis. The two files were generated using GenomeStudio, these were pedigree file (PED) and mapping file (MAP file).

2.8.1.1 GenomeStudio

GenomeStudio genotyping module (Illumina) was used to generate PED and MAP files. The Pedigree file (PED) described the individuals’ relationships in the dataset and included all families’ phenotypic data and genotypes. The MAP file included all SNP markers including the marker name with the physical and genetic positions. Raw data including signal intensity file and cluster file were uploaded to GenomeStudio. The files required for linkage analysis (PED and MAP files) were exported in PLINK format.

77

2.8.1.2 SNP array Quality Control

Genotyping array data quality was crucial for accurate gene mapping and essential for obtaining better results from linkage analysis. Quality control (QC) thresholds were implemented on a large data set of more than 2 million SNPs (Illumina HumanOmni2.5M). Quality parameters such as gender errors, missingness, and minor allele frequency, and Hardy-Weinberg disequilibrium, Mendelian and non-Mendelian errors were applied to eliminate low quality of SNPs. Panel V included 6,056 SNPs, these SNP markers were constructed to provide the optimal solution to detect linkage regions based on the information content (discussed in section 2.8.1.3). The only quality measure implemented on panel V markers was non-Mendelian inheritance. Figure 2.6 showed a flowchart of all quality parameters implemented on both custom Illumina 2.5M markers and Illumina Panel V marker. Different publically-available computational tools were used to evaluate and examine the SNP data quality namely PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml) and MERLIN (Abecasis, Cherny et al. 2002) (http://www.sph.umich.edu/csg/abecasis/merlin/tour/disequilibrium.html).

2.8.1.2.1 PLINK

Starting key measures of data quality were performed using PLINK to generate lists of genotype missingness and individual missingness across the samples with a cut-off point of 5% missingness. Minor allele frequency (MAF) threshold of (<0.05) was implemented to rule out very rare variants (Goddard, Hopkins et al. 2000). Markers that showed deviations from Hardy-Weinberg equilibrium (HWE), which assumed that both allele and genotype frequencies would stay constant through generations in a given population regardless of external evolutionary influences, with a threshold of 1E-6 were excluded. (Neale and Purcell 2008). Since the study population was small and inbred that could violate Hardy-Weinberg assumptions, a stringent threshold was used. HWE included several assumptions, such as random mating, large sample size, and equal allele frequencies in both genders and non-overlapping generations. These starting quality filters were implemented on the markers obtained from Illumina 2.5M platform.

78

2.8.1.2.2 MERLIN

Mendelian errors (MEs) mean that an individual has not inherited an allele from his/her biological parents. This error could be due to genotyping errors, incorrect identification of the individuals as relatives or de novo mutations. MEs were tested using PEDSTATS software available with MERLIN package (http://www.sph.umich.edu.lcsg/abecasis/PedStats/).

PEDSTATS is a command-line program that offers the option to check the pedigree structures to ensure that all family members are consistent and connected (Wigginton and Abecasis 2005).

Non-Mendelian errors are usually referred to genetic inconsistencies that are not linked to Mendelian errors. All non-Mendelian errors detected in Illumina 2.5M and panel V markers are removed using MERLIN.

In addition, SNPs that are not uniquely mapped to a single position are excluded from the MAP file for both subsets of SNP (Illumina 2.5M and panel V). Some markers could have duplicated position because obligate recombination events occurred between two markers that have the same map position.

2.8.1.3 Marker Selection

After implementing the quality control and before running the linkage analysis, marker selection was performed to ensure that the most informative set of markers was included in the analysis.

2.8.1.3.1 Informativeness

Information content estimates the total inheritance information in the genome provided by the available genetic markers. Therefore, selecting informative markers for linkage analysis will measure the efficiency of extracting inheritance information, which is related to the power of identifying potential genes for the disease. Information content was calculated across all chromosomes using two different sets of markers (Illumina 2.5M and Illumina panel V) and the most informative set of markers extracted from the Illumina 2.5M was used for running linkage analysis. Illumina panel V markers were expected to provide the highest information (details in section 2.8.1.2). Information content was measured on a subset of SNP data extracted from Illumina 2.5M panel (around 994,351 markers) that were retained after quality steps as explained above, and on 2,056 SNP markers remained from panel V after QC. The information content of panel V was expected to be high with improved possibility of detecting a recombination event. The high informativness of these markers was due to density and increased heterozygosity. MERLIN was used to measure the information 79

content of each marker included in both panels (panel V and Illumina 2.5M). Then, the obtained values from the two sets of markers were compared and the average information content was plotted. The 25% of less informative SNPs were calculated for Illumina 2.5M markers and excluded.

Figure 2.5 shows a comparison of the average information content between the subset of SNPs extracted from the Illumina 2.5M array and Panel V after the quality control. From the figure, Panel V markers showed slightly lower average information content across the genome compared to Illumina 2.5M markers. This was expected since the total number of markers used in this panel was low compared to the subset of markers in Illumina 2.5M. Although, Illumina panel V markers were designed to provide the most reliable and informative SNPs, information content of these SNPs were lower than Illumina 2.5M markers. This observation might be due to low marker density and different ethnic group. Further analysis was carried out using the most informative subset of Illumina 2.5M markers.

0.78 0.76

0.74 0.72 0.7 0.68 0.66 0.64

information content 0.62 0.6 0.58 HumanOmni2.5 Panel V

Figure2.6.The plot shows the average information content of HumanOmni2.5M markers and Illumina panel V markers. It shows that HumanOmni2.5M markers have higher information content genome-wide with a value of 0.78, while panel V markers have a value of 0.65

80

2.8.1.3.2 LD Pruning

LD pruning was performed on the selected set of markers (Illumina 2.5M markers) to remove SNPs in high linkage disequilibrium, which meant excluding the markers that were more likely to be inherited together than to be inherited by chance. This was performed using pair-wise option with a threshold of 0.1 LD to ensure that high regions of LD will not be called as shared IBD segments (Purcell, Neale et al. 2007) (Abecasis and Wigginton 2005).

Figure2.7. The diagram illustrates QC steps showing the number of markers excluded at each step

2.8.1.4 Running Linkage Analysis using MERLIN

Multipoint non-parametric linkage analysis (NPL) and variance component linkage analysis (VC) were performed using MERLIN (Version 1.1.2) software, as mentioned above in section 2.8.1. NPL wass used to map candidate loci on specific chromosomal region associated with the disease phenotype. This method assumed no specific disease-gene model and examined whether the form of inheritance varied from the expected inheritance pattern. The analysis was performed with –npl option implemented in MERLIN. Also, VC analysis, which was usually referred to the amount of variation of a disease trait that was linked with disease phenotype, was performed to identify genetic involved in HbA1c variability. The families were assessed for their contribution to the linkage signals using MERLIN option --perFamily.

81

Logarithm of odds (LOD) score was estimated using MERLIN when running the analysis; LOD score assessed whether markers and disease co-segregation in a given pedigree was due to linkage or to chance. A significant linkage peak should have a LOD score >3, a LOD score of >1 was selected as a threshold to determine regions of interest in this study, given the limited samples size (details in Chapter Three section 3.4.1.3). Simulation test, which was used on simulated data to check how often the exact results could be obtained by chance, was performed to estimate empirical P-values.

2.8.1.5 Runs of Homozygosity (ROH)

Runs of homozygosity (ROH) referred to several homozygous SNP markers spanning a specific distance. This method has been found to be an efficient strategy to map rare genes involved in recessive traits among inbred populations (Génin and Todorov 2006). Homozygosity mapping was performed after running linkage analysis to narrow the regions identified through linkage analysis. The aim was to focus the analysis on identifying rare variants that might contribute to recessive form of disease of interest (McQuillan, Leutenegger et al. 2008), and to identify candidate regions overlapping between linkage and homozygosity regions. Also, to detect regions identical and homozygosity by descent (HBD) assuming a recessive disease model. The advantage of applying this method on an inbred population could easily detect segments of identity by descent (IBD) that was shared among affected individuals due to inheritance from a common ancestor. ROH was performed using the options described in PLINK software version v1.07 (http://pngu.mgh.harvard.edu/~purcell/plink/) to map disease susceptibility genes within homozygous regions assuming a recessive disease model. In this analysis, the linkage analysis subset SNPs that include the most informative markers was used to detect runs of homozygosity. PLINK default option for length of runs of homozygosity was selected and set at 5000kb. A window of 50 SNPs was selected that would slide across the genome allowing for 3 heterozygous SNPs and 15 missing SNPs. The proportion of overlapping homozygous windows was calculated at a threshold of 0.05 to identify SNPs at homozygous region. The option –group was used as well to find overlapping pools and matching homozygous regions.

82

2.8.2 Whole Genome Sequencing (WGS)

WGS was performed on all family members using Illumina Fast Track Sequencing Services. Whole genome sequencing was performed using Illumina hiSeq2500 platform at a minimum of 40x average coverage.

Processing of Illumina sequences consisted of three stages; these were library preparation, sequencing amplification and data processing of raw sequences. Illumina sequencing protocol required a minimum DNA concentration of 3 µg; and the protocol basically started with DNA shearing, enzymatic fragmentation and adapter ligation specific to sequencing platform. Illumina new sequencing technology implemented a PCR-free based method to reduce amplification bias due to effects of PCR processing. Therefore, PCR amplification step was discarded and replaced with bead size selection

2.8.2.1 Quality control measures for WGS

Quality control measurement was crucial to ensure the quality of samples sequences and enhance downstream analysis. Whole genome sequences were susceptible to quality problems, which included duplicate reads, low quality sequences, sequence artifacts and contamination. In addition, number of N bases in the sequence reads; this is referred to incorrect identification of base pair by the sequencer due to poor sample quality or the analysis has started before accurate sequence begins. Addressing these problems allowed the correction of sequencing reads that could decrease the computational time for downstream analysis.

Quality control was performed using two quality control tools FastQC and Qplot at two stages of data processing pipeline. These two stages were implemented at raw data level (FASTQ files) and aligned read data (BAM files) level. The FASTQ file was a text-based format file that included the nucleotide sequences and its quality score. The BAM file was a compressed binary version of the Sequence Alignment/Map (SAM) file, which was a text based format file that included aligned sequence reads.

83

2.8.2.1.1 QC of raw data using FASTQC

FastQC was a simple way to provide a quick overview of data quality. High throughput sequencers generated over than 2 million of sequences in a single run. Therefore, implementing a simple quality measure to check the quality of raw sequences would help in drawing biological conclusions. FASTQ files were tested to ensure the overall quality of the sequences value per base and per read, base composition and distribution of GC content.

The overall quality of sequence values per base assessed the confidence of the sequencer in order to recognize the correct base and to give the quality score, which was the Phred score. High quality scores were detected among the sequencing data of the Qatari families and the average quality score was about 38%. Also, no N content was detected in the sequences. If the quality of the read was low, high number of N will be given to the unrecognized bases. No base distribution bias was detected. The base distribution for each read determined by the overall base composition percentages. Base composition biases could be due to sequencing artifacts. Figure 2.7 showed the overall quality across all bases. No GC content biasness detected across the bases. The distribution of GC content showed the percentage of G or C base in a sequence. GC biases variation could be due to overrepresented sequences. Figure 2.8 showed the distribution of GC content.

84

Figure 2.8 shows the sequence quality score among all bases. The Y-axis represents the quality score. The green colour indicates the good quality scores. The orange determines the normal quality scores and the red colour shows the bad quality scores. It is expected to observe reduced quality scores at the end of the read due to high number of cycle runs.

85

Figure2.9. Shows the normal distribution of GC content. A little shift is observed created by systematic bias independent of base composition. Shifted distribution is considered as normal and not due to errors in GC distribution.

86

2.8.2.1.2 QC of aligned reads using QPLOT

QPLOT was used to examine the sequencing quality of sequence reads by assessing the percentage of sequence alignment to the reference genome. This program was used in R to generate the quality plot. It provided a summary statistics mainly the empirical Phred scores that were calculated based on the background mismatch (the difference of the sequenced bases from the reference genome). Figure 2.9 shows a graph of the obtained empirical Phred score.

Figure2.10. The above QPLOT shows that the reported Phred scores obtained for samples 1 and 2 are consistent with the calculated empirical Phred scores. This indicates that no base quality recalibration is needed. The observed drop in the score at the end is because of the high number of read cycles.

87

2.8.2.2 Generation of variant calls using GATK best practices pipeline

The raw data sequencing reads were aligned to the human reference genome build 37 using Illumina ELAND alignment tool. Illumina process sequencing reads using CASAVA pipeline. Illumina delivered the data in two files, BAM and VCF files. The marker and genotype data were stored in the Variant Call Format (VCF) file, which was a text based format file.

GATK best practices pipeline (GATK 2.4 version) was implemented in-house by the bioinformatics core. In this PhD project, GATK multiple-sample variant calling was used allowing the detection of more accurate genotypes. Multi-sample SNP calling provided the ability to distinguish between homozygous reference genotype and missing genotype from a cohort of multiple samples. Also, the ability to identify genotype calls at low coverage sites, where these sites in other samples included a confident variant (Kumar, Al-Shafai et al. 2014).

GATK pipeline involved several steps in order to create the annotated VCF files (DePristo, Banks et al. 2011). Bowtie2 (Langmead and Salzberg 2012) was used to align raw sequencing reads (FASTQ files) to the human reference genome build 37 and produced the Sequence Alignment Map (SAM files). Aligned SAM files were sorted and indexed using Novosort tool for sorting and indexing, and to create compressed and indexed BAM files. Then, replicated DNA sequencing reads would be removed using Picard tools (http://picard.sourceforge.net/command-live-overview.shtml). The initial alignment step created many mismatch errors and mapping artifacts. Furthermore, local realignment was implemented to realign sequence reads to Indels and removed initial realignment artifacts using GATK Indel Realigner. Base quality recalibration tools (GATK Base Score Recalibrator and print reads) were implemented to get accurate base quality and to remove score base errors. Variant calling including Indels and SNPs was performed using GATK UnifiedGenotyper for each family; GATK variant calls were recalibrated by Gaussian Mixture Model using GATK walker VariantScoreRecalibrator to ensure the selection of high quality genotypes. Variants were mapped and annotated using SNPeff and GATK walker VariantAnnotator that provided information on frequency and functionality of called variants. Figure 2.10 shows GATK pipeline steps.

88

Figure 2.11: the diagram illustrates the different steps used in GATK pipeline

SNPs concordance between Illumina HumanOmni2.5-8M array and sequencing data (using both CASAVA and GATK pipelines) were checked and showed very high per-chromosome concordance rate of about 99%. The figure 2.11 below showed a high concordance rate between Illumina sequencing variant calling and genotyping SNPs data. The concordance rate calculated the percentage of matched genotypes to the number of unmatched genotypes.

89

Figure 2.12: The figure represents the level of per-chromosome concordance rate (~99%) between sequencing and genotyping data. Low concordance rate is observed in chromosome Y; this could be due to the high number of repeated sequences in chromosome Y, which may lead to missing genotypes (Wei, Ayub et al. 2013).

2.8.2.3 Variants filtration and identification of rare variants using Qiagen Ingenuity Variant Analysis (IVA)

This part described the method and the strategy used to analyze WGS data. IVA was selected from Qiagen to analyze all variants extracted from GATK pipeline. IVA is a web-based application that allows the users to rapidly identify causal variants related to phenotype of interest with the focus on the biological significance of these variants using sequencing data (whole genome/whole exome). IVA software features different filtering characteristics to enable the users prioritize the disease causal variants; these features include function of variants based on the published biological evidence, variants frequency and other filters described later in section 2.8.2.4. Also, variants from different biological perspectives can be incorporated into the software to determine promising potential variants associated with phenotype of interest.

Licenses were obtained to upload sequencing data into IVA and different filtration steps were selected based on the strategy of this PhD project.

90

The study analysis strategy included three stages:

 Identifying known or novel mutations within known monogenic diabetes genes.  Focused analysis on variants within candidate regions. This included linkage and ROH regions, as well as overlapping regions between linkage and Homozygosity regions.  Focused analysis on rare variants. In this analysis, rare variants that were present with an allele frequency of <1% in 1000 genome project or complete genomics were analyzed. Also, the analysis will be focused on exonic variants that were predicted to be deleterious and related to disease phenotype.

2.8.2.4 IVA filtration options used at different WGS analysis stages

IVA program incorporated different filtration parameters that facilitate the identification of disease causing variants. Figure 2.9 showd an example of the different IVA filtering cascades that were applied in the analysis.

IVA offerd an option of using extra information and including data required for the analysis, for example using pedigree information or uploading a list of variants. Before implementing any filtration parameter, the pedigree information of all families was uploaded using the annotation option when opening a new analysis project. Also, candidate regions were arranged in different sets of tables and uploaded into IVA.

After starting a new analysis project, IVA parameters were adjusted according to the strategy explained above. Five different parameters suggested by IVA were adjusted for the analysis:

2.8.2.4.1 Confidence

Following the recommended default settings of these filtering parameters, the option was set to keep only variants that pass the minimum call quality value, calculated based on Phred- scaled scores, of at least 20 in cases and controls.

91

2.8.2.4.2 Frequency

According to the study hypothesis of looking for rare variants contributing to the development of T2D; the common variants filter was adjusted to exclude variants that are observed with an allele frequency of 1% in the 1000 Genome project and 1% in the public Complete Genomics genomes, as recommended by IVA default settings of the frequency filtering parameter.

2.8.2.4.3 Predicted Deleterious

Only variants that were identified by the software to be deleterious or observed to affect gene activity based on SIFT (Sorting Intolerant From Tolerant) function were selected. SIFT is a computational tool that allows users to prioritize gene of interest based on predicting whether protein function is affected by changes in the amino acid (Ng and Henikoff 2003). This option was adjusted to keep pathogenic or likely pathogenic variants that were experimentally related to T2D according to published literature. Also, to keep variants those were observed in literature to cause gain or loss-of-function in a gene.

2.8.2.4.4 Biological Context

This filter was adjusted to focus on variants that were known to affect or to have biological link to type 2 diabetes. The disease phenotype (such as, T2D for this study) was selected from the drop down list of phenotypes and only known T2D causing genes were considered in the analysis. Also, this filter allowed uploading a list of genes to identify variants within this list.

One of the project strategy analyses was to identify variants within known T2D monogenic genes, so a list of T2D monogenic genes that are published in literature was prepared by myself and uploaded to this filter (a list of monogenic genes was provided in the Chapter One, section 1.6.2, table 1.1).

2.8.2.4.5 Genetic Analysis

Here, the recommended settings for identifying variants with inferred gain or loss of function that were restricted to be transmitted in a family pedigree were selected. The analysis was focused on identifying variants that were homozygous, heterozygous and compound heterozygous or associated with gain of function.

92

Figure2.13. Above figure represents an example of the different filtering cascades that are implemented in IVA and used in this PhD project.

2.8.2.5 Validation of Candidate genes using PCR-based method

Primers for each candidate gene were designed using Primer 3 software (http://bioinfo.ut.ee/primer3-0.4.0/). The designed PCR forward and reverse primers were checked for their specificity and fast performance using UCSC In-Silico PCR (https://genome.ucsc.edu/cgi- bin/hgPcr). The length of primers was between 18 and 22 nucleotides; this length was enough for specificity and could easily bind to template DNA. The GC content in total was between 40 to 60%. The primers melting temperature (Tm) was between 56 to 63 °C. Details of the families sequenced and primers sequences were provided in Table 2.2.

93

All PCR reactions were prepared with a concentration of 10mmol of PCR primers and a total volume of 25μl of PCR reaction mix was used with 1μl of DNA (details of PCR reaction mix are described in Table 2.3). DNA gel electrophoresis was performed to examine all PCR products. First, 2% agarose gels were prepared with 1% tris-acetate EDTA (TAE) buffer, and 0.1ng/μl ethidium bromide was added to enable visualizing PCR products under the ultraviolet light. Gel electrophoresis was run and an electrical field was applied with 85V for one hour. DNA ladder (1 kb) was added to the gel to estimate the approximate size of PCR fragments.

Table 2.2: shows the details of primer sequences and the families sequenced

Candidate Gene Primer Sequences Family Sequenced Forward Primer Reverse Primer GCKR ctctgggcttgaggtgct ctaaaatcagaggccccaagg Family 1 TGM2 ctgctgtgtgggacagagtg cccatagcatcaccctcatc Family 2 IGFBP2 ttcagagccatggagaagct tctcttacagcacagcccc Family 5

Table 2.3: shows the PCR reaction mix for GCKR, TGM2 and IGFBP2 sequencing

Reagent Volume for one reaction Volume for 20 reactions Forward primer 0.5μl 10μl Reverse primer 0.5μl 10μl H2O 10.5μl 210μl Ampli-Taq 12.5μl 250μl Total 25μl 480μl

Touchdown PCR was used for all PCR reactions. This method was used to avoid amplification of non- specific sequences during a PCR reaction. The specificity of primer annealing was usually determined by the annealing temperature during the reaction. The upper limit of an annealing temperature was determined by the melting point of the forward and reverse primers. Therefore, very specific base pairing between the set of primers and the template will occur at temperatures just below the melting point. Primers would bind with less specificity at lower temperatures. PCR results would be obscured by non-specific primer binding as non-specific sequences during the early steps of

94

amplification would remove any specific sequences due to the exponential nature of the reaction amplification (De Smith, Walters et al. 2008).

The touchdown PCR protocol started with a high annealing temperature, which was decreased by 1°C subsequently for every two cycles. Accordingly, the primers would anneal at high temperatures and non-specific binding would be avoided. The first sequence amplified occurs at regions of primers high specificity that was most likely the sequence of interest. Further amplifications of these regions occur at lower temperatures of subsequent rounds. Touchdown PCR conditions are shown below.

Touchdown PCR Conditions

95°C for 15 minutes

95°C for 15 seconds

Tm+10°C for 45 seconds -1/cycle

72°C for 1 minute (if the PCR product size is < 1kb)

Go to 2 for 9 cycles

95°C for 45 seconds

Tm for 45 seconds

72°C for 1 minute

Go to 6 for 29 cycles

72°C for 10 minutes

Hold at 4°C forever

2.8.2.6 Sanger Sequencing

DNA sequencing method started with cleaning up the PCR product. The PCR product was cleaned-up using 2μl of EXOSAP (USB) for each 5μl of PCR product. 10μl of PCR product was cleaned with 4μl of EXOSAP; the reaction was placed in a thermocycler at 37°C for 15 minutes and at 80°C for another 15 minutes. Then, the samples were prepared for the sequencing by adding 1μl of forward and reverse sequencing primers to 4μl of cleared PCR product to get a total of 5μl of cleared sequencing PCR product.

95

All sequencing samples were submitted to Genomics Core lab at Weill Cornell Medical College in Qatar to run the Sanger sequencing. Sequencing was performed for all detected potential T2D causing genes (GCKR, TGM2 and IGFBP2). The analysis of sequencing results was performed using the software Mutation Surveyor (http://www.softgenetics.com/mutationSurveyor.html) (mutation Surveyor is user friendly software allows the detection of homozygous mutations, heterozygous mutations, Indels, heterozygous insertions and deletions).

2.9 Whole Genome Methylation Array

DNA methylation for all family members was performed using Illumina Infinium HumanMethylation450 BeadChip (450K) array. Illumina 450K BeadChip included 485,577 probes on bisulfate-treated DNA and targeted about 96% of CpG islands. Illumina Infinium 450K used a complex technology consisting of 2 different assays, Infinium I and Infinium II. The two probes used in Infinium I were designed to hybridize methylated (M) CpG site and unmethylated (U) CpG. Infinium I used one color channel for both methylated and unmethylated sites; the hypothesis was that the probes and targeted sites were equally methylated. Infinium II assay used one probe for each CpG site and implemented two color channels (green, red) for M and U sites, respectively. This method employed the same concept of Infinium I assay.

2.9.1 Data preprocessing

Illumina 450K BeadChip used two chemical assays that resulted in signal bias and technical inconsistency (as explained above). Therefore, choosing the appropriate pipeline was critical to provide a better adjustment for technical variability and probe bias elimination. Marabita et al. compared the performance of 6 different pipelines; based on this comparison Lumi: QN+BMIQ pipeline was selected (Marabita, Almgren et al. 2013). The pipeline was a lumi-based, which used lumi package in R followed by a quantile normalization and beta-mixture quantile normalization (BMIQ) that would be described later.

Initial filtering steps were implemented on the methylation data of 123 subjects using GenomeStudio methylation module to check for the distribution of the overall signal intensity, and the distribution of beta values (known as the ratio of methylated probe intensity and the total signal intensity) and M values (log2 ratio of methylated and unmethylated probe intensities). No samples were excluded at this stage since none have shown low or abnormal methylation profile (Figure 2.9).

96

Then, the percentage of detected sites was examined and all samples were observed to have an average of 99.5% of methylation detected sites, as shown below in figure 2.10. Therefore, no samples were filtered out based on these filtering criteria.

Figure 2.14: The plots show normal distribution of probe intensities (above) and methylation profiles (below).

97

% Detected sites (0.01)

100.0%

99.5%

99.0%

98.5%

98.0% 0 50 100 150 200 250

Figure 2.15: The plot shows a percentage of 99% of detected sites

GenomeStudio different filtration parameters were implemented to ensure the quality of the data. Illumina 450K methylation assay provided a total of 485,577 methylation probes from which 3091 probes were filtered out due to missing cg identifiers (CpH probes). Probes detected on X chromosome (11,135 probes) and Y chromosome (416 probes) were excluded to remove artifacts resulting from gender disproportion, in addition to 65 probes tag SNPs provided in the illumina manifest were also excluded. These SNPs were included in probes positions to add quality level and to allow DNA fingerprint. Then, 2398 probes were filtered out with a detection p-value of >0.01 in more than 5% of total number of samples and 468,472 probes remained. Finally, SNPs at probe binding sites were detected using WGS data and all methylation values were set to missing when detecting SNPs within a region of +/-110 base pairs of the CpG site (Zaghlool, Al-Shafai et al. 2015). This could avoid any interference of SNPs at probe binding sites.

98

2.9.2 Lumi: QN+BMIQ pipeline

As explained above, Illumina 450K arrays used 2 two bead types, Infinium I and Infinium II (Bibikova, Le et al. 2009, Bibikova, Barnes et al. 2011). The assay used two color channels to label the methylated and unmethylated probes. This complex assay designed results in color channel bias and probe type bias. Therefore, selecting an effective pipeline to eradicate the probe bias and color channel bias was vital to ensure the quality of the data (Marabita, Almgren et al. 2013) (Du, Kibbe et al. 2008).

The Lumi: QN+BMIQ pipeline was run using ‘lumi’ package from the Bioconductor. As mentioned above, illumina Infinium technology used two assays in which Infinium I used two bead types with a single color channel for both methylated and unmethylated sites. While Infinium II used one bead type with red and green color channels, and the color of the signal was depending on the methylation probe site (either methylated or unmethylated). Consequently, some of the sites would be measured in red channel, whereas others would be measured in the green channel. This resulted in different CpG sites measures. Inconsistencies of labeling efficiency and color channels could lead to differences between color channels, which could be detected with intensity measures. The variability of the Infinium technology produced color inequity as the assay used two different color channels to measure the probe intensities of methylated and unmethylated sites. Color imbalance adjustment was considered as a normalization method between the methylated and unmethylated probes before applying the quantile normalization (QN) (Marabita, Almgren et al. 2013). The density distribution of the two color channels within methylated and unmethylated probes intensities was assumed to be similar in all samples, but there were different numbers of probes within each color channel. Figure 2.11 illustrated the density distribution of probe intensities. Thus, SmoothQuantileNormalization method was selected as a replacement for the regular QN to adjust intensities imbalance.

QN adjusted the total CpG methylation variances between the samples and focus the signal between arrays to the middle. Adjustment of methylation differences was performed on the intensity signals in order to normalize the methylated and unmethylated probe intensities. An extra normalization step was added using Beta Mixture Quantile Dilation (BMIQ) to remove the probe bias as QN was didn’t adjust the bias completely (Marabita, Almgren et al. 2013). The figures 2.12 showed the results of color balance and intensity distribution after adjustments.

99

Figure 2.16: The scatter plot (left) shows color imbalance of methylated and unmethylated probe intensities. The plot (right) shows the differences in CpG-sites intensity distribution with obvious color imbalance

Figure 2.17: The scatter plot (left) shows balanced color distribution between methylated and unmethylated probe intensities after color adjustment. The plot (right) shows the distribution of CpG-sites intensities after QN.

100

As mentioned above, the differences between methylated and unmethylated probes in the two assays (Infinium I and Infinium II) resulted in density distribution imbalance of probe intensities among samples. BMIQ normalization used three states for the methylation probes. These were unmethylated, hemi-methylated and unmethylated sites. The methylation mixture differences between Infinium I and II were eliminated as shown in the plots below Figures 2.13 and 2.14.

Figure 2.18: The plots show the variation between Infinium I and II. Around 72% of the 485,577 probes are Infinium II (different color channels) while the remaining 28% are Infinium I (single color channel).

Figure 2.19: The plot shows the peaks of Infinium I and II after BMIQ normalization. The peaks of the Infinium assay are matching after normalization.

101

2.9.3 Quantification of Leukocytes Composition

Estimation of blood cells composition was crucial in DNA methylation studies to understand the variability in DNA methylation profiles and disease pathogenesis. Therefore, blood cell heterogeneity could act as a confounder factor when studying DNA differential methylation. Blood cell types can simply be measured using the available biological methods that were based on flow cytometry. Flow cytometry based approach required large amounts of fresh blood and labor intensive antibody tagging. MethylSpectrum (or the Houseman method), the method used in this project (Houseman, Accomando et al. 2012), was based on the concept of using differentially methylated regions (DMR) as markers to distinguish cell lineage and identify specific cell lineage. This approach was used to determine cell types by using DNA methylation signatures to identify an association with a disease. DNA methylation signature blood cell composition (including B cells, granulocytes, monocytes, Natural Killer cells, and T cells subset (CD8⁺-T-cells, and CD4⁺-T-cells) was fitting into framework models of measurements error resulting in biased estimates when using noisy markers. Model calibration and bias correction can be approached by obtaining internal or external validation data (Houseman, Accomando et al. 2012).

MethylSpectrum implemented a deconvolution of DNA methylation array into source contributions from different blood cell types by defining the white blood cells alignment from methylation data when assayed using whole blood. This depended on the DNA methylation signature of each of whole blood cell components. The DNA methylation signatures were used as surrogate of white blood cells distribution that could be used to estimate a disease state. Since the DNA methylation signature was believed to be associated to the distribution of white blood cells, using a noisy surrogate marker could lead to error models and bias estimation of disease association (Houseman, Accomando et al. 2012). This bias could be corrected using validation data.

102

2.9.4 Statistical Analysis

Association tests of DNA differential methylation and the disease phenotype was performed using a variance component framework. This would model the similarity between all family members. Variance component analysis to test the associations between DNA methylation and disease phenotype was performed using a linear mixed model where the total phenotypic variance was divided into polygenic and environmental variances. DNA methylation levels were treated as fixed effect, while the polygenic and environmental variances were treated as random effects. Different covariates were added as fixed effects including age, gender and the six different blood cell types. Two models were used in this study, for T2D the age, gender, BMI, and the six Houseman cell type coefficients (including B cells, granulocytes, monocytes, Natural Killer cells and T cells) were included in the model. For the methylation associations with the BMI the model included also age, gender, diabetes, and the six Houseman cell coefficients. The IlluminaHumanMethylation450k.db annotation package in R was used to annotate all DNA differentially methylated sites.

103

Chapter 3

Mapping Rare Genetic Variants of T2D among Qatari Families

104

3 Mapping Rare Genetic Variants of T2D among Qatari Families

3.1 Background

Some complex diseases and common phenotypes are found to be clustered in families and are thought to be influenced by both environmental and genetic factors. The genetic contribution to such common phenotypes has been partially explained (Manolio, Collins et al. 2009). Also, several studies are focused on the identification of common variants associated to complex diseases (Cirulli and Goldstein 2010). The effects of common variants on common phenotypes were relatively small and not well understood yet (Pritchard and Cox 2002, Gibbs, Belmont et al. 2003, Plomin, Haworth et al. 2009). It is believed that multiple rare variants contribute substantially to the risk of complex disorders and might have stronger effects more than common variants on disease risk (Pritchard 2001, Bodmer and Bonilla 2008, Schork, Murray et al. 2009). Investigation of families may facilitate the identification of rare variants and determine their associations with the disease phenotype, because affected relatives will share such variants at high frequency (Manolio, Collins et al. 2009). Elucidating the genetic components of familial form of specific condition could provide better understanding of disease biology, especially when investigating extended families from inbred population (Manolio, Collins et al. 2009). Such population with high degree of inter-relatedness with low genetic variations and decreased allelic heterogeneity can be useful in detecting rare variants contributing to disease phenotype.

This chapter describes the strategy implemented for the identification of known or novel rare variants within known monogenic and common genes associated with T2D. In this analysis eight Qatari families, partially consanguineous, were investigated using genome-wide linkage scan, homozygosity mapping and the state-of-the art whole genome sequencing (WGS). All the work presented in this chapter was performed by me.

3.2 SNP array Quality Control

Different quality control measures were implemented to ensure the quality of SNP calls and to obtain better results. These measures are described in details in Chapter Two, section 2.8.1.2.

Quality control was performed using two publically-available computational tools namely PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml) and MERLIN (Version 1.1.2) (http://www.sph.umich.edu/csg/abecasis/merlin/tour/disequilibrium.html). Eight native Qatari families were included in the analysis.

105

3.3 Identification of Candidate Regions

Genomic regions of interest were detected using two statistical methods, linkage analysis and homozygosity mapping. Using a total number of eight Qatari families, candidate regions were identified first using Linkage analysis. Linkage was followed by ROH to narrow the region of interest. Then, analysis of WGS data was performed to identify rare variants within candidate regions and within known monogenic T2D genes; global analysis looking for rare variants with low allele frequency (<1%) in 1000 genome and complete genomics was implemented also (details of the analysis strategy are described in Chapter two). All members of the families (including affected and non-affected) were included in the analysis.

3.4 Linkage Analysis

Linkage analysis was used to extract the inheritance information from the family pedigrees through the identification of chromosomal segments for an allele segregated among family members. MERLIN software was used to run linkage analysis with a set of markers extracted from Illumina 2.5M platform. Then, linkage was run using a set of markers extracted from panel V to compare between linkage peaks obtained from both sets of markers (details are described in Chapter Two, section 2.8.1).

Two analysis were performed, first non-parametric linkage analysis (NPL) was conducted to map candidate loci on specific chromosomal region associated with the disease phenotype. Second, variance component (VC) analysis was performed to assess the amount of variation of the disease quantitative traits that may have random-effects on disease phenotype.

Additionally, the contribution of each family to the linkage signals was assessed using per family option in MERLIN software. The families were genotyped using high dense SNPs generated by Illumina platform, from which a subset of genotypes was used to identify chromosomal regions segregated with the disease phenotype (Kruglyak, Daly et al. 1996).

106

3.4.1 Running Linkage Analysis using markers obtained from Illumina 2.5M platform

3.4.1.1 Non-parametric Linkage Analysis

Non-parametric multi-point linkage (NPL) analysis was performed using clusters of markers after removing the markers that were in high LD (Abecasis and Wigginton 2005). NPL analysis was implemented for affection status as discrete trait. Age and gender were taken into consideration as covariates, but no important differences were observed in linkage plots or LOD scores. I additionally used BMI as a covariate in a separate run, given the close interaction of BMI with T2D (Elbein, Das et al. 2009). Adding BMI as covariate didn’t have any impact on the observed LOD scores and linkage peaks.

3.4.1.2 Variance Component Linkage Analysis

Variance component analysis was performed using the same cluster of markers used in NPL analysis.

HbA1c was examined as a quantitative trait using variance component analysis (Chen and Abecasis 2006) after applying quantile normalization.

3.4.1.3 LOD Score Calculation

LOD scores were calculated through the NPL and variance component linkage analysis. Given the limited sample size, a LOD score of >1 was arbitrary chosen as threshold for the identification of regions to be further explored. Linkage signal boundaries for the regions of interest were calculated by (1-LOD drop support interval). From NPL analysis, relatively interesting LOD score peaks of 1.3 and 1.4 where identified on chromosomes 1, 2 and 6, respectively. When compared to the results from vc linkage analysis for HbA1c, the LOD score peaks of chromosomes 1, 2 and 6 showed low LOD scores of 0.8, 0.9 and 0.24, respectively. Also, vc linkage analysis showed higher linkage peaks on chromosome 13 with a LOD score of 1.5 and chromosome 18 with a LOD score of 1.4 (see figures 3.1) (see appendix B2 for details of linkage peaks). Then, empirical P-values were calculated by running a simulation test to check how often the same results could be obtained by chance. Results showed significant empirical P-values for all peaks obtained by both NPL and VC. Table 3.1 showed a summary of NPL and VC results.

107

Figures3.1. (a) Highest linkage peaks obtained through NPL on chromosomes 1, 2 and 6

Figures3.1. (b) show the highest linkage peaks obtained through VC in chromosomes 13 and 18 using the custom Illumina 2.5M map.

108

Table3.1 shows summary of linkage analysis results (NPL and variance component) for chromosomes 1, 2, 13 and 18 that showed higher LOD peaks.

Position 1 LOD Linkage Chromoso Position LOD Empirical SNP (GRCh37/ Confidence Position Trait me (Mb) Peak P-value hg19) Interval Coordinates

232,349,100- -3 1 244.291 rs780225 1q42.2 T2D 1.3 9.6 x 10 1q42.2-1q43 242,675,279

69,305,497- -3 2 103.918 rs1583148 2p12 T2D 1.4 4.8 x 10 2p13.3-2q14.3 124,035,792

145,582,592- -3 6 155.116 rs803447 6q25.1 T2D 1.3 8.5 x 10 6q24.1-6q27 164,665,735

24,215,762- -3 13 21.816 rs9512673 13q12.2 13q12.12- HBA1c 1.5 4.3 x 10 13q13.1 33,678,221

65,332,430- -3 18 100.209 rs9962600 18q22.2 18q22.1-18q2 HBA1c 1.4 4.8 x 10 2.3 70,068,454

3.4.1.4 Per-family Linkage Analysis

NPL and VC Linkage analysis were run per-family using MERLIN option –perFamily. The chromosomes with the highest LOD scores obtained from running NPL and VC linkage analyses for all families together were selected to perform the per family analysis. This run was performed to identify the families with the highest contribution to the strongest linkage peaks detected when running the analysis with all families together. Some of the LOD scores obtained within the chromosomes above (chromosomes 1, 2, 6, 13, 18) might be higher as a result of family-specific co- segregation of the linkage region. The per family NPL analysis showed that families 1, 2, 3, 5, 7 and 9 had positive contribution to linkage signals on chromosomes 1 and 2; and families 1, 2, 3, 5, 8 and 9 had positive contribution to linkage signals on chromosome 6. Also, VC per family analysis showed that families 1, 3, 5, 7, 8 and 9 had positive contribution to linkage signals on chromosome 13; and families 3, 5, 7, 8 and 9 had positive contribution to linkage signals on chromosome 18. The tables below 3.2 to 3.6 show a summary of the highest LOD score and the families contributing to these linkage signals.

109

Table 3.2: show a summary of LOD score detected in chromosome 1 at the highest linkage peak (rs780225) with NPL analysis per family

Chromosome Most significant SNP Family Trait LOD score within the peak 1 rs780225 5 T2D 0.897179 1 rs780225 8 T2D 0.41901 1 rs780225 3 T2D 0.366111 1 rs780225 1 T2D 0.31636 1 rs780225 7 T2D 0.312616 1 rs780225 9 T2D 0.201219

Table 3.3: show a summary of LOD score detected in chromosome 2 at the highest linkage peak (rs1583148) with NPL analysis per family

Chromosome SNP with highest peak Family Trait LOD score 2 rs1583148 5 T2D 0.668944 2 rs1583148 7 T2D 0.667325 2 rs1583148 8 T2D 0.437959 2 rs1583148 3 T2D 0.366102 2 rs1583148 1 T2D 0.31636 2 rs1583148 9 T2D 0.201219

Table 3.4: show a summary of LOD scores detected in chromosome 6 at the highest linkage peak (rs1583148) with NPL analysis per family

Chromosome SNP with highest peak Family Trait LOD score 6 rs803447 5 T2D 0.667349 6 rs803447 7 T2D 0.525256 6 rs803447 8 T2D 0.4444 6 rs803447 3 T2D 0.366114 6 rs803447 1 T2D 0.277065 6 rs803447 9 T2D 0.201217

110

Table 3.5: show a summary of LOD scores detected in chromosome 13 at the highest linkage peak (rs9512673) with VC linkage analysis per family

Chromosome SNP with highest peak Family Trait LOD score 13 rs9512673 3 HBA1c 1.36949 13 rs9512673 5 HBA1c 0.63132 13 rs9512673 7 HBA1c 0.48376 13 rs9512673 9 HBA1c 0.21814 13 rs9512673 8 HBA1c 0.05126 13 rs9512673 1 HBA1c 0.01405

Table 3.6: show a summary of LOD scores detected in chromosome 18 at the highest linkage peak (rs9512673) with VC linkage analysis per family

Chromosome SNP with highest peak Family Trait LOD score 13 rs9962600 3 HBA1c 1.09418 13 rs9962600 5 HBA1c 0.45741 13 rs9962600 7 HBA1c 0.13962 13 rs9962600 9 HBA1c 0.12075 13 rs9962600 8 HBA1c 0.05462

3.4.2 Running Linkage Analysis using Panel V markers

Linkage analysis was run also using genome-wide linkage map called panel V. This panel included optimized set of markers that were constructed to deliver the most accurate linkage results (details in Chapter Two). As discussed previously in Chapter Two, section 2.8.1.3, the average information content for panel V (0.65) was slightly lower than the average information content for Illumina 2.5M markers (0.78). Therefore, the highest linkage peaks obtained (chromosomes 1, 2, 6, 13 and 18) using Illumina 2.5M set of markers were compared to the linkage peaks obtained from Illumina panel V set of markers (panel V linkage peaks are available in Appendix B1). As expected, linkage analysis using Illumina panel V markers showed lower peaks than the Illumina 2.5M markers, thus suggesting that the Illumina 2.5M panel was better capturing the chromosomal inheritance pattern in the samples. Further analysis was focused on Illumina 2.5M markers.

111

3.5 Runs of Homozygosity

Homozygosity mapping is a common technique used to identify segments with homozygous SNPs for an allele segregates among affected members and discordant with unaffected family members. PLINK software version (v1.07) was used to identify homozygosity regions in all families included in the linkage analysis. More details about the parameters used in the analysis can be found in Chapter Two, section 2.8.1.5.

3.5.1 Detection of Runs of Homozygosity (ROH) PLINK was used to run homozygosity mapping, as explained Chapter Two. Different homozygous regions were detected in all chromosomes, and found to be shared between individuals from one family or different families. The identified homozygous and candidate linkage regions were compared together to identify the overlapping regions between them in order to help detecting recessive homozygous variants. Overlapping regions between linkage and homozygous stretches were found in chromosomes 1, 2 and 6. Enrichment was calculated for homozygous stretches that were present in cases more than controls and overlap with linkage regions following the equation below: Enrichment= (cases with ROH*controls without ROH) / (cases without ROH*controls with ROH)

Homozygous regions with enrichment value of >1 were considered in the analysis.

Table 3.7 represented a summary of the overlapping regions between linkage and homozygous segments.

112

Table 3.7 shows a summary of the overlapping regions between linkage and homozygous segments

Linkage Regions total No Homozygous Regions of cases controls cases controls chromosome trait start end subjects start end with with without without Enrichment Position Position with Position Position ROH ROH ROH ROH (bp) (bp) overlap (bp) (bp)

1 T2D 230863396 241878353 3 235148098 241878353 3 1 21 12 1.7 1 T2D 230863396 241878353 3 233422186 241878353 3 1 21 12 1.7 1 T2D 230863396 241878353 3 230863396 241878353 3 1 21 12 1.7 1 T2D 230863396 241878353 1 230863396 241878353 2 1 22 12 1.1 2 T2D 69305747 125362837 2 69305747 72206006 4 1 20 12 2.4 2 T2D 69305747 125362837 2 69305747 72206006 4 1 20 12 2.4 2 T2D 69305747 125362837 3 73368662 79018187 3 1 21 12 1.7 2 T2D 69305747 125362837 3 73368662 90237441 3 1 21 12 1.7 2 T2D 69305747 125362837 3 76610800 77827848 3 1 21 12 1.7 2 T2D 69305747 125362837 2 69305747 72206006 3 1 21 12 1.7 2 T2D 69305747 125362837 2 69305747 72206006 3 1 21 12 1.7 2 T2D 69305747 125362837 2 73368662 90237441 2 1 22 12 1.1 2 T2D 69305747 125362837 2 79458752 81668770 2 1 22 12 1.1 6 T2D 142679572 165963574 2 160560845 165963574 2 1 22 12 1.1 6 T2D 142679572 165963574 2 163371020 165963574 2 1 22 12 1.1

113

3.6 Identification of T2D Candidate Variants through Whole Genome Sequencing Analysis

The identified genomic regions were explored further to identify and prioritize a list of rare variants segregating within the families using WGS. This analysis was focused on T2D variants’ discovery within T2D genes within the candidate genomic regions (linkage and homozygosity regions). In addition to, analyzing candidate overlapping regions between linkage and homozygosity stretches. In this study WGS was carried out on all family members. After quality control and variant calling, sequencing data were analyzed using Qiagen Ingenuity Variant Analysis (IVA). Different filtering parameters provided by IVA software were implemented according to the study strategy (more details in Chapter Two, section 2.8.2.3) in order to prioritize a list of T2D variants.

3.7 Qiagen Ingenuity Variant Analysis (IVA) After data quality, the identification of rare variants within T2D genes was performed using IVA (http://www.ingenuity.com/products/variant-analysis). IVA is a web-based application offers different analytical options to prioritize variants of interest. A list of rare variants was prioritized according to the study analysis strategy. The analysis strategy included the identification of known or novel variants within known T2D monogenic genes, within candidate regions and rare variants with a frequency of <1% in 1000 genomes (details are in Chapter Two).

3.8 Candidate Variants within T2D genes

Three variants were detected in total after setting the IVA filtration parameters according to the study strategy. Three different analyses were performed to prioritize the list of potential T2D variants. One novel variant was identified within known T2D monogenic genes, a variant within GCKR. Two variants were identified within homozygous regions, variants within TGM2 and IGFBP2 in chromosomes 20 and 3, respectively. No variants were detected in overlapping regions or linkage regions. Table 3.8 represented a summary potential candidate variants resulted from each analysis.

114

Table 3.8 shows a summary of candidate variants identified from each analysis.

Case Control SIFT Reference Sample Gene Gene Protein Samples Samples Translation Chromosome Position Function Assessment Allele Allele Region Symbol Variant With With Impact Prediction Variant Variant

Variants Identified within Known T2D genes

Likely 2 27726416 G A Exonic GCKR p.R227Q 3 1 missense Tolerated Pathogenic

Variants Identified within Homozygous regions

Likely 20 36776379 G A Exonic TGM2 p.R222Q 2 0 missense Damaging Pathogenic

Likely 3 185407162 A G Exonic IGF2BP2 p.T220A 2 0 missense Damaging Pathogenic

115

3.9 Variants Identified from each analysis

3.9.1 Variants within Monogenic Genes of T2D

3.9.1.1 Filtering options

This analysis was focused on the identification of known or novel variants within T2D monogenic genes. The IVA filtration parameters (including confidence, frequency, predicted deleterious, biological context and genetic analysis) were adjusted according to the aim of this analysis (details of filtering options in IVA are described in Chapter Two, section 2.8.2.4). The biological context filter was designed to detect variants with biological relevance to disease of interest. Also, it allowed the analysis of specific genes of interest. In this analysis, a list of known T2D monogenic genes (including MODY and neonatal genes) were uploaded and the filter was set to only keep variants within the list of uploaded genes. Figure 3.2 shows an example of the biological context filter applied in IVA.

Figure 3.2 shows an example of options provided in the biological context filter provided by IVA

116

3.9.1.2 Variants Identified within T2D monogenic genes

After filtration and prioritization steps, four variants were identified in total. The four variants were examined carefully to select T2D-causing variants within known T2D monogenic genes; variants that found to be pathogenic and segregated in cases more than controls were selected. According to these selection criteria, one variant was selected to be assessed further.

The variant detected in GCKR, glucokinase regulatory protein, a monogenic T2D-causing gene. GCKR is one of the maturity onset diabetes of the young genes (MODY), type 2 (MODY 2). The GCKR variant is in the exonic region of chromosome 2 at 27726416 bp position. It is a missense, heterozygous mutation within the protein variant R227Q that is likely to be pathogenic according to the biological assessment provided by IVA.

The R227Q mutation was novel and was not detected in dbSNP or the 1000 genomes. Table 3.8 shows a summary of potential variant within T2D monogenic genes.

The mutation was detected in three cases and one control in family 1 transmitted over three generations (from the grandmother to daughter to grandson and granddaughter). All samples passed the minimum selected call quality value (at least 20); all samples scored 255 call quality. Also, the read depth per-samples carrying the mutation were ranging between 30 and 50 passing the minimum required value. This mutation was looked for within the WGS data of 108 Qataris and was not identified.

The identified variant mutation was examined manually within the results obtained from linkage analysis and homozygosity mapping. As shown above, a significant linkage peak was detected in chromosome 2 with a LOD score of 1.4, and family 1 contributed positively to this linkage peak with a LOD score of 0.32. The identified variant was not located within the linkage region in chromosome 2. Also, was not located in any homozygous pool within chromosome 2.

117

3.9.2 Variants within Candidate Regions

Having identified the candidate regions from the linkage analysis, homozygosity mapping and regions overlapping between linkage and homozygous stretches, the analysis here was focused on identifying rare variants within these candidate regions. First, the overlapping regions between linkage and homozygosity regions were analyzed to prioritize potential rare variants.

Second, the analysis was focused on identifying rare variants within the linkage regions. As described above, there were five significant linkage regions detected in chromosomes 1, 2, 6, 13 and 18. Significant peaks associated with T2D were identified in chromosomes 1, 2 and 6; and two peaks associated with HBA1c trait were detected in chromosomes 13 and 18 (details of linkage results described above, section 3.4.1). Finally, rare variants within homozygous regions were investigated.

3.9.2.1 Filtering options

The different IVA filtration parameters (including confidence, frequency, predicted deleterious, biological context and genetic analysis) were adjusted according to the aim of this analysis (details of filtering options in IVA are described in Chapter Two, section 2.8.2.4). After creating a project and setting the filtration selections, IVA offered an option to creating lists and uploading more columns to the analysis. A list of overlapping regions was created and each region was given a specific number randomly (indicated as score in tables from 3.9 to 3.11) to link the identified variants with candidate regions. Also, lists of candidate linkage regions and homozygous regions were created and each region was given a score. All lists were uploaded into IVA and the analysis was focused on regions specified in each list.

Table 3.9, 3.10 and 3.11 showed all lists uploaded to IVA including overlapping candidate regions, linkage and homozygous regions.

118

Table 3.9 presents the identified overlapping regions between linkage and ROH in each chromosome. The regions are defined by starting and end position of the chromosomes.

Chromosome ChromStart ChromEnd Empty Score

1 235148098 241878353 2

2 73368662 90237441 9 6 160560845 165963574 15

Table 3.10 includes the five candidate linkage regions. Each region is given a different number or score. Chromosome ChromStart ChromEnd Empty Score

1 230863396 241878353 1

2 69305747 125362837 2

6 142679572 165963574 6

13 24535111 33307709 13 18 64850490 70068204 18

119

Table 3.11 shows an example of homozygous regions identified in each chromosome. Detailed table is provided in appendix A2.

Chromosome ChromStart ChromEnd Empty Score 1 235148098 249210707 1 2 43386134 44900846 10 3 101613557 111631274 4 4 74983 7384439 21 5 70307464 84621404 14 6 23490117 25060346 7 7 12745968 13900784 12 8 18668608 21176834 10 9 126926107 129293900 5 10 52910580 54525678 6 11 40143676 42656681 12 12 37858073 40264547 15 13 40984601 55668296 2 14 47236994 48647168 15 15 46747377 48539026 23 16 53540295 65856443 1 17 4448756 6021541 7 18 20781212 21850617 7 19 46118127 48511888 1 20 252757 6255188 1 21 22393029 24810710 11 22 22597173 48705291 4

120

3.9.2.2 Variants Identified within overlapping regions

After implementing the filtration and prioritization steps based on the analysis strategy, no variants with biological relevance were detected. Figure 3.3 showed the number of variants detected in each filtering parameter.

Figure 3.3 shows the output from each filtering parameter. Figure (A) shows that no variants with biological relevance are found within the overlapping regions. Figure (B) shows no rare recessive variants within regions of interest.

A) B)

121

3.9.2.3 Variants Identified within Linkage regions

This analysis was focused on identifying rare variants within the linkage regions. A described above, there were five significant linkage regions detected in chromosomes 1, 2, 6, 13 and 18. Significant peaks associated with T2D were identified in chromosomes 1, 2 and 6; and two peaks associated with HBA1c trait were detected in chromosomes 13 and 18 (details of linkage results described above, section 3.4.1).

The list of variants was prioritized according to the analysis strategy and filtration criteria. One variant was detected within the linkage region. The variant detected in SLC22A1, solute carrier family 22 (organic cation transporter), member 1. This gene is essential for metformin transportation into the liver by the organic cation transporter 1. The SLC22A1 variant was in the exonic region of chromosome 6 at 160543080 bp position. It was a missense, heterozygous mutation within the protein variant G38D that had a damaging effect according to SIFT function and likely pathogenic according to IVA assessment. The R227Q mutation was novel and was not detected in dbSNP or the 1000 genomes.

The variant was detected in one case and one control in family 9. The selection criteria of candidate variants was based on detecting variants that segregated in cases more than controls. Since the SLC22A1 variant was detected in one case and one control, further assessment of this variant was not required. Table 3.12 shows a summary of variant identified in SLC22A1.

Table 3.12 presents a summary of variant identified in SLC22A1 within linkage region.

Variant With Variant Chromosome Position Reference Allele AlleleSample RegionGene Gene Symbol Protein Case Samples With Variant Control Samples Translation Impact SIFT Function Prediction Assessment

Damagin Likely 6 160543080 G A Exonic SLC22A1 p. G38D 1 1 missense g Pathogenic

122

3.9.2.4 Variants Identified within Homozygosity regions

Different homozygous regions have been detected in each chromosome and the analysis here was focused on detecting rare variants within these identified regions. A range between three and ten homozygous regions has been identified in each chromosome. All homozygous regions were included in this analysis.

Following the study analysis strategy the filtration parameters have been adjusted and a list of rare variants within the candidate regions has been prioritized. Seventeen variants were detected in homozygous regions and only two variants were selected for further assessment (variants enriched in cases more than controls were selected as potential candidate variants). The first variant detected in TGM2, Transglutaminase 2. The TGM2 variant was found in the exonic region of chromosome 20 at 36776379 bp position. It was a missense, heterozygous mutation within the protein variant R222Q that had a damaging effect according to SIFT function and likely pathogenic according to IVA assessment. The R222Q mutation was known and was detected in dbSNP (rs200551434). The mutation was identified in two cases from family 2 and no controls found to have the mutation.

TGM2 variant was found in a homozygous region within chromosome 20 that ran from SNP rs2424794 position 29507776 bp to SNP kgp5392677 position 45881388 bp. Table 3.8 showed a summary of variant identified in TGM2.

The second variant detected in IGF2BP2, Insulin-Like Growth Factor 2 MRNA-Binding Protein 2. The IGF2BP2 variant was found in the exonic region of chromosome 3 at 185407162 bp position. It was a missense, heterozygous mutation within the protein variant T220A that had a damaging effect according to SIFT function and likely pathogenic according to IVA assessment. The T220A mutation was novel and was not detected previously in dbSNP or the 1000 genome. The mutation was identified in two cases from family 5 and no controls found to have the mutation.

IGF2BP2 variant was found in a homozygous region within chromosome 3 that ran from SNP rs9869582 position 183786678 bp to SNP kgp8031547 position 187942700 bp. Table 3.8 shows a summary of variant identified in IGF2BP2.

123

3.10 Variants from Global Analysis

Here, the analysis was focused on the discovery of exonic rare variants with biological relevance to disease phenotype, and have had a frequency of <1% in 1000 genome project or complete genomics datasets. Filtration parameters were adjusted to exclude frequent variants, control samples with homozygous and heterozygous genotypes, and to keep pathogenic variants (details of filtering options are discussed in Chapter Two, section 2.8.2.4).

Four variants were detected in this analysis. According to our selection criteria, one variant was found to be enriched in cases more than controls and found to have functional role in T2D. The variant was identified in TGM2 with SNP rs200551434. The same variant identified and discussed above in the previous analysis (variants identified within homozygous region).

3.11 Validation of Variants within Candidate T2D genes

The identified candidate variants from the WGS data were validated and sequenced using Sanger sequencing. A set of forward and reverse primers were designed to sequence each of the candidate genes. Details of the primer designs and PCR conditions used for sequencing were described in Chapter Two, section 2.8.2.5.

3.11.1 Known monogenic T2D genes

One known monogenic T2D gene was identified, GCKR, and sequenced. The GCKR variant mutation was detected in family 1 among three cases and one control. The mutation was a novel heterozygous mutation with damaging effect according to SIFT function. This mutation had not been reported previously in Qatari families. All affected subjects have been diagnosed with early onset T2D (2 subjects were diagnosed at an average age of 35 and one was diagnosed at age of 21) by an endocrinologist. All cases carrying the heterozygous mutation were non-obese, and according to WHO classifications all affected subjects have been classified as normal weight with a body mass index (BMI) of <25 (kg/m2), (http://apps.who.int/bmi/index.jsp?introPage=intro_3.html ).

124

The GCKR variant was validated using Sanger sequencing in all family members, but DNA amplification with polymerase chain reaction (PCR) and gel electrophoresis were performed prior to Sanger sequencing. The PCR product size was 500bp and PCR products from all subjects showed the expected band size on gel electrophoresis (see figure 3.4, a.). Then, samples were prepared and submitted to genetics core lab at WCMC-Q to run Sanger sequencing. The Results from Sanger confirmed the existence of the p.R227Q mutation in all the three cases and the control. Also, one case has been found to have the mutation by Sanger sequencing (no WGS data available for this sample). Therefore, a total of four cases and one control were carrying the mutation (see figure 3.4, c.). Figure 3.4 shows the results of PCR, pedigree for the family carrying the mutation and Sanger sequencing results.

125

a.

b.

c.

Figure 3.4: The figures show the results of PCR product on gel electrophoresis of all family 1 (a), pedigree of family 1 with subjects carrying GCKR heterozygous mutation; the star indicates the subjects carrying the mutation (b), and a chromatogram of the mutation showing nucleotide G from the normal allele and nucleotide A from the mutant allele.

126

3.11.2 Potential T2D genes

Two variants within known T2D genes were identified, TGM2 and IGFBP2, and sequenced. The TGM2 variant mutation was detected in two cases of family 2. The identified mutation was a heterozygous form of allele mutation with damaging effect according to SIFT function. This mutation was the first time to be reported in Qatar. Diagnosis and clinical assessment of T2D in the two affected cases have been made by an endocrinologist. Cases with the heterozygous mutation were non-obese and classified as normal weight with a BMI < 25, according to WHO classifications (http://apps.who.int/bmi/index.jsp?introPage=intro_3.html ).

The identified TGM2 variant was present in dbSNP with a SNP id rs200551434 (present in dbSNP with a minor allele frequency of 0.0008). The mutation was validated using Sanger sequencing for all members of family 2; DNA amplification with polymerase chain reaction (PCR) and gel electrophoresis were performed prior to Sanger sequencing. The PCR product size was 400bp and PCR products from all subjects showed the expected band size on gel electrophoresis (see figure 3.5, a.). Then, Sanger sequencing was performed by genetics core lab at WCMC-Q. The Results from Sanger confirmed the existence of the p.R222Q mutation among cases identified through WGS. Figure 3.5 presents an overview of PCR product results, pedigree of family 2 with cases carrying the mutation and Sanger sequencing results.

The IGFBP2 variant was identified in two individuals in family 5. IGFBP2 was a known T2D genes discovered through GWAS. The detected variant was a novel mutation and was not reported previously in Qatari population. The subjects carrying the mutation were diagnosed with early onset T2D, non-obese and had BMI of 24.5 and 23. DNA amplification, gel electrophoresis and Sanger sequencing were conducted on all family 5 members. The p.T220A mutation was confirmed among the individuals that were identified by WGS data. Figure 3.6 presents an overview of PCR product results, pedigree of family 5 with cases carrying the mutation and Sanger sequencing results.

127

a.

b.

c.

Figure 3.5: The figures show results of PCR product on gel electrophoresis of all family 2 (a), pedigree of family 2 with subjects carrying TGM2 heterozygous mutation; the star indicates the subjects carrying the mutation (b), and a chromatogram of the mutation showing nucleotide G from the normal allele and nucleotide A from the mutant allele (c).

128

a.

b.

c.

c.

Figure 3.6: figures show results of PCR product on gel electrophoresis of all family 5 (a), pedigree of family 2 with subjects carrying IGFBP2 heterozygous mutation, the star indicates the subjects carrying the mutation (b), and a chromatogram of the mutation showing nucleotide A from the normal allele and nucleotide G from the mutant allele (c).

129

3.12 Discussion

Studies about type 2 diabetes over the past years have identified few genes that contribute to the development of this disease. Most of the identified genes using different gene mapping approaches have been replicated in several GWAS and Meta-analysis studies (Jafar-Mohammadi and Mccarthy 2008, Jafar-Mohammadi and McCarthy 2008, Prokopenko, McCarthy et al. 2008). Although these studies have been successful in T2D gene discovery, these genes have shown modest effect sizes leaving a substantial hidden heritability (Auer and Lettre 2015).

Different study designs and analytical approaches have been implemented in order to identify rare genetic variants with large effect sizes to explain the hidden heritability of complex genetic phenotypes (Manolio, Collins et al. 2009). These study designs included, the study of population isolates (Helgason, Sigurðardóttir et al. 2000, Helgason, Hickey et al. 2001), sampling of extreme phenotypes (Kryukov, Shpunt et al. 2009, Li, Lewinger et al. 2011) and family studies. It has been proven that family studies with multiple affected members played an important role in the identification of T2D-causing genes. These genes have shown modest effect sizes on a small fraction of familial aggregation when studying common complex traits (Jafar-Mohammadi and McCarthy 2008, Drong, Lindgren et al. 2012)

This chapter presented a full description of eight Qatari families using linkage analysis combined with application of the state-of-the art technology of WGS for the discovery of T2D rare variants. Implementation of these different analytical methods led to the detection of three candidate T2D variants, which were GCKR, TGM2 and IGFBP2.

3.12.1 Linkage Analysis and Runs of Homozygosity

Linkage analysis was used to look for co-segregation of disease and marker loci through analysing a set of markers in families. The analysis was performed with eight native Qatari families in order to identify rare genetic variants that might be of etiological association with T2D. The best genetic model for linkage analysis was difficult to be chosen due to the interaction between genetic and environmental factors leading to T2D. Therefore, non-parametric linkage model was selected because this model generates results without assuming degree of penetrance (Strauch, Fimmers et al. 2000).

Multi-point NPL and VC analysis using (HbA1c) as a trait was implemented, discussed above. The obtained linkage peaks and LOD scores observed with NPL and VC analysis were considered as interesting regions to be explored further. Also, different patterns of linkage results were observed 130

using Illumina panel V markers. Although, Illumina panel V have been designed with high informative set of markers that proved to provide more reliable linkage results (Murray, Oliphant et al. 2004), however the observed linkage results using panel V showed low quality of linkage peaks (see appendix B3). This might be due to the lower informativeness of panel V markers compared with the study custom map extracted from the Illumina 2.5M panel.

As mentioned above, interesting peaks on chromosomes 1, 2, 6 and 13 were observed. Location of these peaks was looked for at published literature to find regions or variants that were associated with T2D and overlap with the results of this study. Interesting LOD peak of 1.3 was observed on chromosome 1 at 1q42.2 cytogenic location, which has been reported in a previous linkage study (Avery, Freedman et al. 2004). In this study, multipoint linkage analysis was performed on 3,153 hypertensive patients presented with T2D from different ethnicities; the results of this study have shown a suggestive LOD score of 1.9 at chromosome 1 located on the same cytogenic location (1q42.2) observed in my analysis (Avery, Freedman et al. 2004). There were 2 genes associated with type 2 diabetes have been reported on this location, which are ACTN2 (actinin, alpha 2) and AGT (angiotensinogen). ACTN2, a cytoskeletal protein with multiple roles in different cell types, has been identified and replicated in GWAS studies. Four SNPs reported and replicated on Mexican American have shown strong association with T2D at P value <0.001 and located in ACTN2 gene (Hayes, Pluzhnikov et al. 2007). AGT, a protein encoded gene act as blood pressure and body fluid regulator, previous studies have reported functional association of AGT with T2D with small effect size (Inoue, Nakajima et al. 1997, Rotimi, Cooper et al. 1997). However, a study was conducted by Conen et. al to assess the role of AGT in the pathogenesis of T2D could not confirm any association between AGT and T2D (Conen, Glynn et al. 2008).

A further promising locus was on chromosome 13 (13q13.1), where vc linkage analysis showed an interesting LOD score of 1.5. The locus observed was reported in association studies performed on East Asian population and proved to be associated with T2D (Shu, Long et al. 2010, Ali, Chopra et al. 2013).

Per-family linkage analyses have shown little differences in LOD scores in chromosomes 1, 2, 6 and 13. LOD scores obtained from per-family analysis were lower in these chromosomes. This could be due to family specific co-segregation.

Despite the study has low power to detect significantly linkage regions for a complex disease, because of limited sample size. I believe that linkage signals obtained through NPL and variance component linkage analysis were more likely to underlie variants for T2D in the Qatari samples and would be used to prioritize regions for the analysis of rare variants using the NGS data. 131

Since linkage regions tended to be large, runs of homozygosity was performed to narrow down the candidate linkage regions to focus on analysing homozygous regions that may harbour potential rare variants. Three homozygous stretches were found to overlap with linkage regions in chromosomes 1, 2 and 6. These candidate regions were located between 235148098 bp and 241878353 bp in chromosome 1, 73368662 bp and 90237441 bp in chromosome 2, and 160560845 bp and 165963574 bp in chromosome 6.

3.12.2 Variant detection through WGS

Following the identification of candidate regions resulting from family-based co-segregation analysis (using linkage analysis and runs of homozygosity), whole genome sequencing was performed. The advances of next generation sequencing technologies and bioinformatics analytical tools allowed the application of WGS to identify rare genetic variants of different complex traits with affordable costs (Kumar, Al-Shafai et al. 2014).

WGS data from a total of eight Qatari T2D pedigrees were analysed using three different analytical strategies as explained above. All affected and healthy individuals in the eight Qatari families were sequenced using Illumina hiSeq2500 platform with a deep coverage of 40x (Kumar, Al-Shafai et al. 2014). The analysis of all family members could allow the identification of shared variants mutations among affected taking into consideration the relation between family members. GATK multi-sample variant calling was applied to call variants from multiple samples in each family.

In this study, high coverage WGS was achieved for the purpose of identifying T2D-causing rare variants. Even though all family members were sequenced and analysed, only three variants were identified and found to be enriched in cases more than controls. Although, the study statistical power was low, initial results were positive and will be discussed further.

According to the strategy of this study, the candidate linkage regions were narrowed down using linkage analysis and homozygosity mapping. Candidate regions including linkage, homozygosity regions and regions overlapping between them were the main focus of the WGS analysis. From the linkage results, one candidate region on chromosome 1q42.2 cytogenic location was found to have a linkage signal with T2D (Avery, Freedman et al. 2004). This region was highlighted first when analysing variants within linkage regions.

132

Then, the analysis of variants within other linkage and candidate regions was prioritised. The analysis was divided into three stages, analysis of variants within candidate regions, variants within known T2D monogenic genes and rare variants from global analysis.

First, from the analysis focused on known T2D monogenic genes one variant was detected within GCKR in family 1. This was a novel GCKR mutation (p.R227Q) and the first to be reported in Qatari and Arab population. GCKR, also known as glucokinase (hexokinase 4) regulator, was a known T2D monogenic gene, and it was a form of maturity onset diabetes in the young (MODY). GCKR was a protein expressed in hepatocytes and bound to glucokinase enzyme during glucose metabolism (Ling, Li et al. 2011). Glucokinase (GCK) played a major role in carbohydrate metabolism and a key glucose phosphorylation enzyme that catalysed the first step of the glycolytic pathway, which included the transformation of glucose to glucose 6-phosphate. Also, GCK was involved in the regulation of glucose metabolism in the hepatocytes and stimulated insulin secretion from pancreatic beta cells (Matschinsky 1996). GCKR regulated the activity of GCK depending on fructose 6-phosphate and fructose 1-phosphate (SCHAFTINGEN 1989, Chu, Fujimoto et al. 2004). Deficiency of GCKR in mice showed a reduction in the level of GCK protein and its activity in liver cells. Also, it displayed an impairment of postprandial glycaemic control (Farrelly, Brown et al. 1999, Grimsby, Coffey et al. 2000). A study published by Slosberg et al. reported a significant improvement of insulin sensitivity and glucose tolerance in mice associated with over-expression of GCKR gene through hepatic cells (Slosberg, Desai et al. 2001).

The identified mutation in family 1 was detected in three cases and one control using WGS data. After running Sanger sequencing for all individuals in family 1, one more case was found to have the mutation as well. All subjects carrying the mutation were heterozygous and showed a Mendelian mode of inheritance as shown in figure 3.4(b). Multiple forms of MODY genes exhibited Mendelian inheritance and were transmitted by a defect in a single gene (Fajans, Bell et al. 2001, Rich, Norris et al. 2008). Since all subjects with T2D were found to have normal BMI and diagnosed with the disease at an average age between 21 and 35, this variant could be a potential variant causing T2D in this family.

From the analysis of candidate regions, rare causal variants within overlapping regions between linkage and homozygous regions were difficult to be identified. Also, analysis of candidate linkage regions was focused first on variants within chromosome 1q42.2 since it was reported previously with T2D. Unfortunately, no potential T2D rare variants were identified in this region. One variant mutation R227Q in SLC22A1 was detected in chromosome 6 located within the linkage signal in chromosome 6. This variant was excluded from further analysis because it was identified in one case

133

and one control, which didn’t meet the study selection criteria of potential T2D causing variants. The discovery of candidate T2D variants within the study candidate regions was difficult. This could be explained by different reasons including sample size, strict filtration criteria and no significant linkage signals detection.

However, two variants were identified within homozygous regions in TGM2 and IGFBP2 genes. These genes were known T2D genes and have been reported previously to have functional role in T2D.

The TGM2 mutation variant had a SNP number rs200551434. It was a damaging heterozygous mutation identified in two cases from family 2. A novel missense mutation (p.R222Q) was identified in TGM2 and the first to be reported in Qatari population. TGM2, also known as transglutaminase 2, was one gene from the transglutaminases family which included nine genes. It was a protein coding gene and calcium-dependant enzyme. TGM2 had multiple functions that catalysis the intracellular G protein signalling through an acyl-transfer reaction. TGM2 knock-out mice showed an impairment of insulin secretion and glucose intolerance, which suggested a key biological role for this gene in pancreatic beta cell (Bernassola, Federici et al. 2002). Also, TGM2 was the only transglutaminase that was highly expressed in the human pancreatic beta cells (Porzio, Massa et al. 2007), which suggested an interesting relation between the reduction of TGM2 activity and glucose metabolism disorders (Vaxillaire and Froguel 2013).

The other variant was identified in a known T2D gene, IGFBP2 that was discovered through GWAS studies (Rajpathak, Gunter et al. 2009). The mutation was a novel missense heterozygous mutation with a damaging effect identified with two cases in family 5. IGFBP2, also known as insulin-like growth factor binding protein 2, was a protein coding gene that altered the developmental rate of insulin growth factor. The insulin-like growth factor (IGF) proteins including IGFBP-1 to 6 were involved in the regulation processes of cell growth, proliferation and cell survival (Rajpathak, Gunter et al. 2009). IGFBP2 decreased the availability of insulin-like growth factor-I (IGF-I). Down regulation of IGFBP2 increased the free IGF-I levels in association with insulin resistance, which was related to the risk of developing T2D.

The biological role for all the above T2D causing genes has been well explained in literature. This suggested that identified variants within these genes might play a role in the pathology of T2D. The etiological relevance of these variants was not known. Therefore, further assessment to confirm the co-segregation of these variant mutations using larger samples from each family carrying the mutation was needed. Also, expression analysis of the identified novel mutation will demonstrate if the identified novel protein coding T2D variants were expressed in human tissues.

134

The 108 Qatari genomes were used as replication cohort and to look for the identified novel T2D variants among unrelated Qatari individuals. None of the above variant mutations were detected in the 108 Qatari genomes. Larger replication cohorts were needed to assess the frequency of the potential T2D variants among Qataris and other population.

3.13 Summary

Potential rare causative T2D variants may exist in the Qatari population, but detailed assessment is required to confirm this existence. The identification of mutations within GCKR, TGM2 and IGFBP2 genes suggested that the development of T2D might be driven by known monogenic genes in some Qatari families and common genes in others. This could explain the pathogenesis of T2D in this inbred population by looking for other novel genes specific to this population. Evaluation of rare variants co-segregating in affected within larger number of Qatari families will facilitate the identification of recessive forms of T2D transmitted in families. Also, it may provide new insights on the genetic determinants of T2D.

The analysis of WGS data in this study enabled a substantial investigation of novel or known variants mutations co-sharing with affected individuals in the Qatari families. This approach allowed the identification of new T2D susceptibility genes.

3.14 Limitations

There were different factors limiting the study:

 The small sample size with the small number of affected individuals in each family. This affected the significance of linkage peaks.  Affected samples in the families might have not been well phenotyped, in which some might have been misdiagnosed with type 1 diabetes T1D. This might be due to obtaining the history of the disease directly from the participants and results of anti-GAD test (A Glutamic Acid Decarboxylase Autoantibodies test that is performed to diagnose T1D) were not available.  Candidate variants might be missed when adjusting for filtration options or might be incorrectly called.

135

3.15 Future work

 Increase the sample size by recruiting more consanguineous families and increase the number of individuals in the previous and new families. Also, include the most distant co-affected relatives in each family to facilitate the discovery of overlapping novel mutation.

 Identification of underlying mechanisms of the potential T2D genes identified in this chapter. Preliminary evaluation of the possible functional effects of these variants can be done by using the publically-available bioinformatics tool such as PolyPhen functional annotation tool. Additionally, experimental-based functional analysis can be performed to study the mechanism of the potential variants by implementing different techniques, such as messenger RNA (mRNA) quantification using microarrays. This can be performed by carrying out real-time PCR (RT-PCR). In this method, sequence-specific DNA nucleotides are labelled with fluorescent. Hybridization of theses DNA nucleotides allows its complementary sequence to measure the mRNA.

136

Chapter 4

Assessment of Copy Number Variants among Qatari Families with Type 2 Diabetes

137

4 Assessments of Copy Number Variants among Qatari Families with Type 2 Diabetes

4.1 Background

CNVs are usually referred to DNA segments of one kilobase or larger (Redon, Ishikawa et al. 2006, Stankiewicz and Lupski 2010, BoonPeng and Yusoff 2013). CNVs gained substantial attention as a key source of genetic diversity (Redon, Ishikawa et al. 2006, Stankiewicz and Lupski 2010). The contribution of CNVs in developing complex diseases have been widely investigated, but further studies are still needed (Kidd, Cooper et al. 2008). There are different approaches that have been used to identify the missing heritability of type 2 diabetes and other complex diseases (BoonPeng and Yusoff 2013). One promising approach was through the investigation of large CNVs with a size of 500kb or larger that were enriched in cases more than controls. Large CNVs are usually rare and highly penetrant (Girirajan, Campbell et al. 2011).

Different technical approaches can be applied to look for CNVs, such as comparative genomic hybridization (aCGH) (de Smith, Walters et al. 2008) and using high density SNP arrays genome-wide (Illumina HumanOmni 2.5M) (Matsunami, Hadley et al. 2013). CNVs can be predicted using several CNV calling algorithms such as, PennCNV, cnvHAP and QuantiSNP (Colella, Yau et al. 2007, Wang, Li et al. 2007, Coin, Asher et al. 2010). Further details about CNVs are described in Chapter 1.

Full descriptions of large CNVs are discussed in this chapter. These CNVs are expected to disrupt known or novel T2D genes that could provide better understanding of the underlying genetic component of T2D in the Qatari population. Findings described in this chapter are the result of analysing T2D families only. All the work presented here is performed by me.

4.2 CNV Prediction Algorithms

CNV detection was performed using one of the commonly used CNV prediction algorithms, PennCNV (Wang, Li et al. 2007, Eleftherohorinou, Andersson-Assarsson et al. 2011). CNV prediction was generated from signal intensity data including log R ratio calculation and B allele frequency normalization (detailed methods are described in Chapter Two, section 2.6).

This chapter describes the analysis strategy and results generated using the PennCNV algorithms in identifying potential T2D causing genes.

138

4.3 CNV Prediction using PennCNV

4.3.1 Data Quality Control

The overall sample-level quality measure has been determined using a 0.24 threshold of LRR variance as recommended by PennCNV manual. All samples passed this threshold were included in the analysis, only one sample was eliminated because it exceeded the selected threshold (0.24). The LRR variance was calculated for each sample; and the LRR variance ranged from 0.12 to 0.3. Figure 4.1 illustrates the LRR variance for each sample. Raw CNV calls have been processed following different filtering steps, as recommended by PennCNV manual. The aim of applying filtering steps was to exclude samples with low quality calls and to ensure the quality of CNVs. The details of sample quality control and filtering strategy were described in Chapter two, section 2.6. The table (table 4.1) below summarizes the different quality measure used to identify the candidate CNVs (an illustrative figure is presented in chapter two, figure 2.1).

Figure 4.1 per-sample LRR variance (details are in chapter two, section 2.6.3)

Table 4.1 Filtering strategy to identify candidate CNVs

Filtering Step Number of CNVs Raw CNVs 2420 Good CNVs 2386 Merged CNVs 2334 Large CNVs (>500kb) 947 Dense CNVs (>20 SNPs) 21 CNVs found in Centromeres 5 Candidate CNVs 16 139

4.3.2 Description of Large and Dense CNVs used in the Analysis

Briefly, PennCNV algorithm was applied to call CNVs from Illumina high-density SNP genotyping data. Different filtering steps were implemented to detect large rare CNVs, in order to identify potential T2D causing genes. In this part, description of large CNVs is provided.

4.3.2.1 Size Distribution of Copy Number

After excluding CNVs from samples with low quality calls, neighbouring CNV calls were merged together to form a large single call. The total number of CNVs detected among cases and controls after merging was 2334. Figure 4.2 and 4.3 shows CNV distribution within cases and controls. Table 4.2 describes mean and median sizes of each CNV class within cases and controls.

Size Distribution of CNV classes in Cases 700

600

500

400 <100kb 300 100kb-500 Numberof CNVs 200 500kb-5MB 100

0 CN=0 CN=1 CN=3 CN=4 Frequency of CNV Classes

Figure4.2. Distribution of CNV sizes among cases. The figure shows the size range and number of CNVs of the different copy number class (CN=0, CN=1, CN=3, CN=4) detected in all cases.

140

Size Distribution of CNV classes in Controls

600

500

400

300 <100kb

100kb-500 Numberof CNVs 200 500kb-5MB

100

0 CN=0 CN=1 CN=3 CN=4 Frequency of CNV Classes

Figure 4.3 Distribution of CNV sizes among controls. The figure shows the size range and number of CNVs of the different copy number class (CN=0, CN=1, CN=3, CN=4) detected in all controls. was plotted

Table 4.2 Mean and Median sizes for each CNV class found in cases and controls

CN=0 CN=1 CN=3 CN=4 Mean Median Mean Median Mean Median Mean Median (bp) (bp) (bp) (bp) (bp) (bp) (bp) (bp) Cases 9,473 2,979 32,155 10,512 62,794 24,103 17,189 16,206 Control 16,940 4,425 48,462 10,738 50,912 25,102 33,271 33,271

141

4.3.2.2 Frequency of Copy Number Variants

After merging the neighbouring CNVs, a total of 2334 CNVs were detected among cases and controls. The frequency of each copy number state was different in cases versus controls. The figure below (figure 4.4) represents the frequency of each copy number class between cases and controls.

Frequency of CNVs 800 700 672 603 580

600 500 401 400 Cases 300 Controls Numberof CNVs 200

100 37 30 9 2 0 CN=0 CN=1 CN=3 CN=4 Type of CNV

Figure4.4. Frequency of CNVs determined in all cases and controls. The figure shows a general description of copy number class distribution and number of CNVs detected in cases and controls. The numbers above each bar refer to the number of CNVs that are found in each class. The number of CNVs found in cases is more than the number of CNVs found in controls.

4.4 Analysis of Large and Dense CNV calls

The contribution of large CNVs to genetic and phenotypic variability has been investigated in previous studies (Itsara, Cooper et al. 2009). Large CNV segments are usually rare, highly penetrant, occur de novo or transmitted within a pedigree (Stankiewicz and Lupski 2002). Previous studies have reported that enrichment of rare CNVs among affected individuals could suggest pathogenicity of these variants (Sebat, Lakshmi et al. 2007, Walsh, McClellan et al. 2008).

142

4.4.1 Merging adjunct CNVs using PennCNV

CNV calling in all subjects, including cases and controls, was performed using PennCNV algorithm (details in chapter two, section 2.6.2). Large CNV calls are usually difficult to be predicted from raw CNV calls; accurate estimations of the total number of large CNVs were difficult because PennCNV split large CNV calls into several smaller CNV calls. Therefore, the clean_cnv.pl script (provided by PennCNV) was used to merge the small neighbouring CNV calls.

To avoid the inclusion of false positive calls, large CNV calls with a minimum number of 20 probes or more were included for further downstream analysis. Also, CNVs within centromeric and telomeric regions were excluded from the analysis. Five CNVs were found within centromeric region and were removed. Table 4.3 shows the CNVs found within centromere.

Table 4.3 describes the five CNVs found to overlap with centromeric region

Chromosom e Start Position (hg19) End Position (hg19) Number of SNPs (bp) Size class CNV Phenotype Genomic Region 5 46174864 49530087 185 3,355,224 deletion Control centromere 5 46370657 49530087 61 3,159,431 deletion Case centromere 11 50476268 51373124 165 896,857 deletion Control centromere 6 58751305 61941992 32 3,190,688 deletion Control centromere 7 62046766 62729224 287 682,459 duplication Case centromere

4.4.2 Identification of Large CNVs

The analysis was focused on the identification of large CNVs with a size of more than 500kb. There were 947 large CNVs (>500kb) detected in the data after merging the small adjacent CNVs. Twenty- one CNVs were large and contained more than 20 probes. Five CNVs found within centromeric region were removed (see above), 16 large CNVs remained for further analysis. All CNVs included in the analysis were found in heterozygous state and no homozygous CNVs were detected. All CNVs were heterozygous duplication. Detailed description of identified large CNVs and the overlapping genes are discussed below. Table 4.4 includes the details of all large dense CNVs.

143

Table 4.4 Details of CNVs large (> 500kb) and dens (>20 SNPs) CNVs identified in T2D Qatari subjects

r of

Chromosome Start position (hg19) End position (hg19) Numbe SNPs (bp) Size class CNV Phenotype 19 782378 2480622 1326 1,698,245 duplication Case

1 838260 1876015 593 1,037,756 duplication Case

16 1079743 1587511 434 507,769 duplication Case

14 105986579 106948749 132 962,171 duplication Control 14 106067375 107029183 136 961,809 duplication Control

14 106350394 107029183 111 678,790 duplication Case

14 106350394 106891524 67 541,131 duplication Case

14 106350394 106891524 67 541,131 duplication Case

14 106350394 106891524 67 541,131 duplication Case

14 106350394 106851960 60 501,567 duplication Case 14 106515973 107056354 125 540,382 duplication Control 6 110883633 111591563 402 707,931 duplication Case 6 110883633 111591563 402 707,931 duplication Case 6 110887169 111591563 397 704,395 duplication Case

9 139084388 140414051 924 1,329,664 duplication Case

8 144790997 145761188 471 970,192 duplication Case

4.5 CNV Calling using PennCNV

As mentioned previously, PennCNV algorithm was used for CNV calling (Wang, Li et al. 2007). In this project, large and rare copy-number variants were detected with the aim of identifying the contribution of large CNVs to type 2 diabetes. High-resolution CNV detection was applicable using whole-genome SNP genotyping arrays from eight Qatari families. Copy number variants were called using log R ratio (LRR) and B allele frequency (BAF); LRR and BAF were calculated from the signal intensity data generated using Illumina Infinium platform (details in Chapter Two, section 2.6.1).

144

4.6 Assessment of candidate CNVs

Assessment of candidate CNVs was performed at different stages aiming to identify biologically relevant T2D genes. Figure 4.5 describes the different analyzing steps of candidate copy-number variants. Following the filtering criteria (as described in Chapter Two, section 2.6); CNVs were divided into two categories, compared with different dataset and examined visually to ensure the quality of the candidate CNVs. Then, candidate CNVs were prioritized based on the function of the overlapping genes.

Figure 4.5 Flow chart of Candidate CNV assessment. The flow chart describes the steps through which 21 candidate CNVs were examined to prioritize the most promising CNVs after excluding five CNVs found to overlap with centromere region.

145

4.6.1 CNVs shared between samples and overlap with genes

There were 11 CNVs found to be shared between two samples and more. All observed CNVs found to overlap with genes. CNVs were prioritized and considered as candidate CNVs based on the biological relevance of the overlapping genes. All 11 CNVs were heterozygous duplication and the size range of these CNVs was between 500kb to more than 900kb.

Homozygous CNVs are usually rare and since the study is of an inbred population, it is expected to find homozygous CNVs in the study samples. In this data, no homozygous CNVs are detected.

Table 4.5 describes the details of identified CNVs.

Table4.5. Details of CNVs detected in two or more samples

Subject Phenotype Chromosome Number Start End Size (bp) CNV class ID of SNPs Position Position (hg19) (hg19) 2004 Control 14 136 106067375 107029183 961,809 duplication 7006 Case 14 111 106350394 107029183 678,790 duplication 5003 Case 14 67 106350394 106891524 541,131 duplication 7005 Case 14 67 106350394 106891524 541,131 duplication 5004 Case 14 67 106350394 106891524 541,131 duplication 9004 Case 14 60 106350394 106851960 501,567 duplication 7001 Control 14 125 106515973 107056354 540,382 duplication 3003 Case 6 402 110883633 111591563 707,931 duplication 3001 Case 6 402 110883633 111591563 707,931 duplication 3002 Case 6 397 110887169 111591563 704,395 duplication

146

4.6.2 CNVs in one sample and overlap with genes

Five CNVs were observed in single samples. These five CNVs found to overlap with genes. Following the selection criteria of candidate CNVs, CNVs were prioritized based on the functional role of the overlapping genes. All five CNVs are heterozygous duplicated; the size range of these CNVs was between 500kb to more than 1Mb. In this data, no homozygous CNVs were detected and all identified CNVs were heterozygous.

Table 4.6 describes the details of identified CNVs.

Table 4.6 Details of CNVs detected in one sample

Subject Phenotype Chromosome Number Start End Size (bp) del/dup

ID of SNPs Position Position (hg19) (hg19) 3003 case 9 924 139084388 140414051 1,329,664 duplication 3003 case 8 471 144790997 145761188 970,192 duplication 3003 case 19 1326 782378 2480622 1,698,245 duplication 3002 case 1 593 838260 1876015 1,037,756 duplication 3001 case 16 434 1079743 1587511 507,769 duplication

4.6.3 Visualization of candidate CNVs

Shared and single CNVs were examined visually to check if CNV calls were reliable or not, and to remove CNVs that were incorrectly merged. Images of CNV calls were generated automatically using the visualize_cnv.pl program provided in PennCNV software. Visual examination of candidate CNVs was based on plotting signal intensities of all subjects. A total of 16 large CNVs were visualized and no CNVs were excluded. The 16 candidate CNVs were heterozygous duplications. The figures below (figure 4.6, 4.7, 4.8) show examples of the different CNV classes observed in the data.

147

Figure 4.6 shows an example of CNV call; the plot is generated using the signal intensities (LRR/BAF). The example given here is for CN=0, it is observed that LRR is dropped to a very low level and BAF is randomly distributed between 0 and 1. The CNV region is marked by the two grey vertical lines.

148

Figure 4.7 shows an example of CNV call; the plot is generated using the signal intensities (LRR/BAF). The example given here is for CN=1, it is observed that LRR is dropped to around -0.5 and BAF is clustered at 0 and 1. The CNV region is marked by the two grey vertical lines.

149

Figure 4.8 shows an example of CNV call; the plot is generated using the signal intensities (LRR/BAF). The example given here is for CN=3, it is observed that LRR is around 0 and BAF is clustered between 0.4 and 0.8. The CNV region is marked by the two grey vertical lines.

150

4.6.4 Candidate CNVs overlapping with CNVs in other dataset

Here, the list of seventeen candidate CNVs were examined in other dataset to find any overlap between this dataset and the other. A list of copy number variants from 108 Qatari genomes was available. The 108 Qatari genomes included phenotypic information about 55 subjects with T2D (including 26 female and 29 male) and 53 controls (including 28 female and 25 male). There were six CNVs from the study dataset overlap with CNVs from the 108 genome. Table 4.7 shows the details of the study candidate CNVs that overlap with CNVs from 108 dataset.

Table 4.7 Summary of CNVs shared between samples and overlapping CNVs in other dataset

Chromosome trait Number Start End Size (bp) del/ 108 cases of SNPs Position Position dup and controls 9 case 924 139084388 140414051 1,329,664 dup case 1 case 593 838260 1876015 1,037,756 dup 1 case 1control 14 Control 136 106067375 107029183 961,809 dup control 14 Case 67 106350394 106891524 541,131 dup control 14 Case 67 106350394 106891524 541,131 dup control 14 Case 67 106350394 106891524 541,131 dup control

4.6.5 Identification of potential T2D genes within candidate CNVs

Genes overlapping with CNVs identified above were examined for their functional role in association with T2D development. Candidate CNVs were examined to identify potential T2D candidate genes. Here, all genes overlapping with candidate CNVs were examined to identify their biological role. Then, genes with potential role in developing T2D will be prioritized for further investigation.

4.6.5.1 CNVs shared between samples and overlap with genes

The function of all overlapping genes within the study candidate CNVs were examined and checked using GeneCards (http://www.genecards.org/). Also, the biological relevance of the detected genes with T2D was searched through literature. The biological role for most of the identified genes has not been well studied yet, except for KIAA0125 that showed functional association with cancer (Long Noncoding RNA KIAA0125 Potentiates Cell Migration and Invasion in Gallbladder Cancer).

151

Also, a CNV shared between three samples found to overlap with AMD1. A study has shown an association of a variant within AMD1 with obesity and plasma level of leptin, and adult metabolic disorders, for example type 2 diabetes. There is no direct evidence reported in the study with adult type 2 diabetes and further investigation is required to confirm such possibility (Tabassum, Jaiswal et al. 2012). No potential T2D candidate genes were identified. Summary of identified CNVs and overlapping genes are described in (table 4.8).

Table 4.8 Summary of CNVs shared between samples and overlapping with genes

Number Chr Start End SNPs Size CNV class Genes of Position Position (average range Samples (hg19) (hg19) number) (bp) 7 14 106350394 107029183 90.4 501,567- duplication KIAA0125, 961,809 LINC00221 LINC00226 3 6 110883633 111591563 400 704,395- duplication AMD1, 707,931 CDK19, GSTM2P1, KIAA1919 GTF3C6, RPF2

4.6.5.2 CNVs in single samples and overlap with genes

Genes overlapping with five CNVs were examined using GeneCards (http://www.genecards.org/) to look for their functional association with the study phenotype. Also, the biological relevance of all detected genes with T2D was searched through literature. The biological role for of the identified genes has not shown direct or indirect associations with T2D development. Among the genes detected, few genes were related to different types of cancer, such as DAZAP1 which plays an important a key role in gene splicing expressed in cancer cell lines and inhibits cancer cell divisions (Choudhury, Roy et al. 2014). The identified genes were not prioritized as candidate T2D causing genes and downstream analysis of these genes was not required. Table 4.9 describes a summary of CNVs found within single samples and overlap with genes.

152

Table 4.9 Summary of CNVs found in single samples and overlapping with genes

Number Chr Start End number Size (bp) del/ Genes of of SNPs dup Samples 1 9 139084388 140414051 924 1,329,664 dup AS1, DKFZP434A062, ENTPD2,EXD3,FA M166A,FAM69B SSNA1, SAPCD2, RNF208 1 8 144790997 145761188 471 970,192 dup TONSL, MROH1, CPSF1, OPLAH, LRRC24,KIAA187 5,LRRC24, SPATC1, FAM203A, TPRN 1 19 782378 2480622 1326 1,698,245 dup ABHD17A,ADAT3 ,ARID3A, AP3D1,ATP5D,AZ U1,CNN2, DAZAP1,EFNA2, GAMT, IZUMO4, JSRP1,LINGO3, PLEKHJ1, UQCR11 1 1 838260 1876015 593 1,037,756 dup ACAP3,ANKRD65, ATAD3A,ATAD3B, ATAD3C,B3GALT 6,CCNL2,CPSF3L, DVL1 ,GLTPD1,HES4,KL HL17,LOC14841, MIR200A,PUSL1, SAMD11 1 16 1079743 1587511 434 507,769 dup BAIAP3, C1QTNF8, CCDC154, GNPTG, PTX4, TELO2, UNKL

153

4.7 Discussion

Recently, copy number variants have been studied to identify the genetic variation of complex disease in order to gain knowledge about disease heritability (Freeman, Perry et al. 2006, Redon, Ishikawa et al. 2006). Different studies reported the contribution of CNVs to the risk of disease development, such as autism, schizophrenia, Parkinson disease, obesity, osteoporosis and rheumatoid arthritis (Yang, Chen et al. 2008, Pinto, Pagnamenta et al. 2010, Bronstad, Wolff et al. 2011, Pankratz, Dumitriu et al. 2011, Chew, Dastani et al. 2012, Kirov, Pocklington et al. 2012). Successful studies of CNV associations with T2D were relatively small because genetic markers were rarely used for the investigation of T2D-related CNVs. However, few studies confirmed CNV associations with T2D in CAPN10 Indel19 (Plengvidhya, Chanprasert et al. 2012). Also, CNV associations within chromosome 4p16.3 associated with early onset T2D among Japanese patients was reported (Ye, Niu et al. 2010, Kudo, Emi et al. 2011). Another association was reported in leptin receptor gene (LEPR) (Jeon, Shim et al. 2010). LEPR is a known obesity gene that is involved in the regulation of satiety and energy expenditure. Leptin receptors are mainly found in the hypothalamus and in other tissues, such as pancreatic beta-cells. It is considered as candidate T2D gene because of its role in the inhibition of insulin secretion.

In this PhD project, Illumina HumanOmni2.5-8 BeadChip including more than 2.5 million probes was used for CNV prediction. Small and large CNVs were detected using PennCNV prediction algorithm (Wang, Li et al. 2007). A large number of raw CNV calls (n=2420) was generated. Since the study was focused on the analysis of large rare CNVs, small adjacent raw CNVs were merged to have large CNVs of more than 500kb. The results of the study showed different size distribution for each copy number type (see above). Also, the overall number of large CNVs with a size of >500kb were observed in cases more than controls. All detected variants were analysed further to identify candidate CNVs.

There were sixteen candidate CNVs detected in this dataset after implementing several filtration steps. The use of array-based technologies for copy number detections, for example Illumina 2.5M arrays platform enabled the identification of copy number variants at higher resolution (Wang, Li et al. 2007). The application of different filtration parameters was necessary to ensure the quality of CNV calls because PennCNV algorithm was likely to generate false-positive or false-negative CNV calls (Sebat, Lakshmi et al. 2004). CNVs included in the analysis were large (>500 kb) and included a minimum of 20 probes, this ruled out the inclusion of false-positive calls in downstream analysis. The observed candidate CNVs were found to be heterozygous duplications and no deletions were identified. Also the detected CNVs were assessed carefully to identify potential T2D causing genes

154

within the candidate CNV regions. All genes included within the candidate CNV regions in this dataset were examined to identify their biological role in relation to T2D. Unfortunately, evidence of diabetes relevance of the identified genes was not reported. This could be due to the small number of study subject. Therefore, it was necessary to use large-scale samples in order to identify significant CNV associations with T2D CNVs (Bae, Cheong et al. 2011). Also, it has been suggested that large scale family-based CNV calling could result in better understanding of T2D genes overlapping with detected CNVs (Bae, Cheong et al. 2011). In this study, the generated CNVs were individual-based CNV calls and not family-based CNV calls, which were expected to provide more accurate results. It was difficult to collect data from parents due to different reasons including cultural background and the age of the parents. In addition, PennCNV algorithm had a relatively small bias in calling large CNVs. This could result in generating CNV calling errors including either false-positive or false- negative errors. Therefore, it is suggested to use different CNV prediction algorithms in order to identify more accurate candidate CNVs followed by estimating the concordance/discordance between the different methods (Eckel-Passow, Atkinson et al. 2011).

Multiple studies have shown the contribution of CNVs to different complex human disease such as schizophrenia and autism. However, the number of studies reporting CNV associations with T2D was relatively small and more investigations will be still required (Craddock, Hurles et al. 2010). This could also partially explain the difficulty in the identification of potential T2D causing genes in this study.

155

4.8 Summary

CNVs may contribute to the risk of several complex human disorders and influence the development of multiple phenotypes such as, neuropsychiatric disorders, obesity and diabetes. In this study, large (>500kb) rare CNVs are detected, which are expected to characterize the risk of T2D. Unfortunately, the identified CNVs didn’t overlap with known T2D genes or potential T2D genes that were of biological importance. It is necessary to use large-scale samples or family-based algorithm for CNV calling in order to identify more reliable CNVs and to gain better understanding of the detected CNV role. Further investigation is required to demonstrate the disease susceptibility by increasing the sample size and using other CNV calling algorithm, such as cnvHap (Coin, Asher et al. 2010)

4.9 Limitations

There are some limitations in this study, these include:

 Small sample size. PennCNV algorithm usually generates less false-positive calls when using larger sample size.  CNV calls are generated on individual-basis ignoring the family structure. PennCNV provides the option to include the family structure to generate accurate CNV calls, but this option is available to carry out trio-based and quartet based CNV prediction only.  As the study dataset included extended families with missing data of one or both parents in most of the families, trio-based CNV calling was not applicable.

156

4.10 Future work

4.10.1 Samples

More families will be recruited with the focus on collecting data from parents in order to generate better CNV calls (as discussed above).

4.10.2 Trio-based CNV calling using PennCNV

PennCNV offers the possibility of CNV calling considering the family structure to generate more reliable CNV calls and correlates CNV information from related individuals who are more likely to share the same CNV region. PennCNV family-based CNV calling is designed to generate CNV calls from trios (father, mother, and offspring) only. After collecting more families with their parent, the extended families will be divided into trios and quartet for CNV calling. Then CNV calls will be united together into one combined calls.

4.10.2 CNV calling using another prediction algorithm

Another CNV prediction algorithm will be implemented to evaluate the agreement of candidate rare CNVs that are detected using different algorithms. Genome-wide association of copy number variants will be implemented using family-based calling algorithm, such as famCNV algorithm (Eleftherohorinou, Andersson-Assarsson et al. 2011). Family-based association test of copy-number variants will be performed aiming to identifying the CNV associations with T2D risk. An integrative population-based CNV algorithm will be used, such as cnvHap. This algorithm is established by Lachlan Coin form the Department of Genomics of Common Disease at Imperial College London. CnvHap algorithm predicts CNV calls from the signal intensity data provided with the genotyping arrays at each probe position. This approach is based on a hidden Markov model to construct a continuous model of CNVs among the genome using the signal intensity data at each marker. Also, it incorporates data from various samples and platforms in order to identify genuine CNVs. The combination of data generated from different algorithms will enhance the sensitivity of detected CNVs (Coin, Asher et al. 2010).

157

Chapter 5

Epigenetic Associations of Type 2 diabetes among Qatari Families

158

5 Epigenetic Associations of Type 2 Diabetes among Qatari Families

5.1 Background

Type 2 diabetes (T2D) is a common complex disorder with rapidly increasing figures worldwide, mainly within urbanising countries such as Qatar. The increased prevalence of T2D due to the rapid demographic transition of populations occurred during a short period of time and observed among two to three generations. The demographic transition partially explained the risk of T2D as a result of environmental changes, which could be intervened by epigenetic modifications. Different epigenetic changes including DNA methylation and histone modification are involved in chromatin structure that regulated gene expression (Riggs 1975, Jaenisch and Bird 2003). These epigenetic events contributed to the associations between environmental changes and disease phenotypes. Also, these epigenetic changes could partially explain the hidden heritability of complex disorders (Manolio, Collins et al. 2009). Recently, investigating the role of DNA methylation in disease risk has been improved due to the available advances of array technologies combined with bisulfide treatment (Thompson, Fazzari et al. 2010).

A small number of epigenome-wide association studies (EWASs) has been reported with T2D up to date (Barrès, Osler et al. 2009, Toperoff, Aran et al. 2012, Volkmar, Dedeurwaerder et al. 2012, Dayeh, Volkov et al. 2014, Chambers, Loh et al. 2015); most of these EWASs have been conducted on European populations. Translation of findings from these studies to populations from different ethnicities and genetic backgrounds is not clear yet.

This chapter describes results of an epigenome-wide DNA methylation association with T2D among Qataris. The analysis included 123 subjects from sixteen native Qatari families. Since the statistical power of this study is limited, a full EWAS study was not applicable. Therefore, the primary goal was to replicate known CpG loci in Qatari population using data from the TwinUK that had different environmental and genetic backgrounds.

The analysis described here is a collaborative work with my colleague Mashael Al Shafai to identify differentially methylated loci associated with T2D and BMI. Results of methylation associations with BMI are presented in Mashael Al Shafai’s thesis, which is focused on studying sources of genetic variations of obesity among Qatari families.

This chapter is focused on the results obtained from the association of differentially methylated loci with T2D only. The analyses presented here are performed by me, Mashael Al Shafai and my colleague Dr Alessia Visconti, who ran the meta-analysis (more details are described later)

159

5.2 Methylation Analysis Pipeline

Whole genome methylation array was performed using Illumina Infinium HumanMethylation450 BeadChip (450K) array. The Lumi: quantile normalization and Beta Mixture Quantile dilation (BMIQ) pipeline was selected to reduce the technical variability of Illumina assay and to correct for the probe type bias as recommended by Marabita et al (Marabita, Almgren et al. 2013). Different pre- processing steps were implemented using GenomeStudio before running the pipeline. The pipeline included different normalization steps and all the details are described in Chapter Two, section 2.9.2.

5.3 Qatari and TwinsUK Cohort

Qatari subjects and methylation data generated for this analysis are described in details in Chapter Two, section 2.2. The TwinsUK cohort was first introduced in 1992 including both monozygotic (MZ) and dizygotic (DZ) twins (Moayyeri et al. 2013). Subjects included in the TwinsUK cohort were healthy Caucasians; the majority of them were female with an age range between 16 and 98 years old. There were more than 13,000 twin participants included in the twin cohort recruited from different regions across the United Kingdom. DNA methylation measures were available for 810 individuals that were included in the analysis; all of them were female Caucasians and details about diabetes phenotypes and BMI were available. Of the 810 subjects, 32 were previously identified with T2D and the average BMI was 27.8 kg/m2. DNA methylation measurements were generated using the Infinium HumanMethylation450 BeadChip (Illumina Inc, San Diego, CA). The experimental approaches and the technical biases corrections that were applied to the TwinsUK cohort were similar to the approaches implemented on the Qatari dataset; the details are described in Chapter Two, section 2.9.2. A number of DNA methylation probes were excluded because they were incorrectly mapped or mapped to multiple locations in the reference sequence. Additional probes were also excluded due to missing p-values having a detection p-value of more than 0.05. Adjustments of blood cell type coefficients from the methylation data were performed using the method described by Houseman et al (Houseman, Accomando et al. 2012)

160

5.4 Estimation of Differentially Methylated Loci

After quality control a total number of 468,472 probes included in the analysis to estimate the differentially methylated loci associated with T2D. The association analysis of T2D and DNA methylation levels was evaluated using a linear mixed model within a variance component framework (details are described in Chapter Two, section 2.10). The association model included age, gender, and the six blood cell type coefficients (including B cells, granulocytes, monocytes, Natural Killer cells and T cells). In addition, BMI was included in the association test as a confounder.

The IlluminaHumanMethylation450k.db annotation package in R was used to annotate all DNA differentially methylated sites. As mentioned earlier, the methylation analysis described here was focused on replicating known differentially methylated loci associated with T2D. As the statistical power was limited due to the small sample size, a preliminary epigenome-wide association study was performed to avoid any possible inflation of the association statistics that could result from running the analysis on selected CpG sites. Lambda inflation coefficient was calculated and quantile- quantile (QQ) plot was used to plot the P-values, to ensure that the data was not inflated. A negligible inflation of p-value statistics has been shown when plotting the Q-Q plot (the genomic inflation factor was 1.10); this could be possibly due to the existence of hidden confounders. Figure 5.1 shows a QQ plot with an insignificant inflation.

161

Figure 5.1: The figure above shows the QQ-plot for T2D association results of each differentially methylated site. The genomic inflation f a c t o r = 1.10. The replicated CpG site cg19693031 in TXNIP had a detection p-value of 5.3x10⁻⁶. From the association results and as shown in the figure, one CpG site cg06721411 within DQX1 has reached genome-wide significant with a detection p-value of 2.04x10-10. No biological role has been reported for this gene in relation to T2D.

162

5.5 CpG Sites Selected for Replication

Two large EWASs for T2D were available (Chambers, Loh et al. 2015, Kulkarni, Kos et al. 2015) at the time this methylation analysis was conducted. The main focus of the analysis presented here was to replicate the most significantly associated CpG sites with T2D for all genes reported in these EWASs. There were a total of eight CpG sites that had reached genome-wide significance associated with T2D; seven CpG sites (PROC, C7orf29, SRABF1, PHOSPHO1, SOCS3) were reported by Chambers et al (Chambers, Loh et al. 2015), and one (CPT1A) was reported by Kos et al (Kulkarni, Kos et al. 2015), and two (TNXIP1 and ABCG1) were reported in both studies. Correction for multiple testing was implemented using the Bonferroni method, considering a genome-wide significance level of the replicated associations when their p-values were lower than 6.25x10-3 (0.05/8). For the EWAS, false discovery rate (FDR) using Story q-values (Storey 2002) was conducted to correct for multiple testing and to control the expected type I and type II error rates (Benjamini and Hochberg 1995)

Here, I attempted to replicate each of the reported genes that had the most significant association with T2D.

Table 5.1 contains the previously reported CpG sites that are considered in the replication analysis.

5.6 Qatari and TwinsUK Meta-analysis

Meta-analysis was conducted combining results obtained from Qatari samples with results from the TwinsUK. The aim of running the meta-analysis was to replicate findings of this study and to determine sources of agreement between both datasets. The meta-analysis test was performed using the publically available software namely GWAMA (Genome-Wide Association Meta-Analysis) (Magi and Morris 2010). The meta-analysis test was carried out using a fixed-effect model with reverse variance to combine the regression coefficients and their standard errors from both Qatari and TwinsUK methylation data. Heterogeneity was estimated using Cochran’s Q test by calculating the amount of inconsistency between the Qatari and the TwinsUK, which is explained by heterogeneity (I2 estimates) (Higgins et al. 2003), both are implemented in GWAMA software. Heterogeneity detected in the methylation probe was also combined in the meta-analysis using a random effect model.

This part of the analysis was performed in UK by my colleague Dr Alessia Visconti.

163

Table 5.1: Known CpG site reported in previous EWASs among Caucasians and their associations with T2D in the Qatari family study.

Probe Chromosome Position Gene Gene Name P value Beta References (hg19) Symbol cg19693031 1 145441552 TXNIP Thioredoxin interacting protein 5.3x10-6 -2.55 (Chambers, Loh et al. 2015) cg06500161 21 43656587 ABCG1 ATP-bindng cassette, sub-family G (WHITE), 0.087 2.56 (Kulkarni, Kos et member 1 al. 2015) cg00574958 11 68607622 CPT1A Carnitine Palmitoyltransferase 1A 0.040 -4.19 (Kulkarni, Kos et al. 2015) cg11024682 17 17730094 SREBF1 Sterol regulatory element binding transcription 0.060 2.42 factor 1 (Chambers, Loh cg09152259 2 128156114 PROC Protein C (inactivator of coagulation factors Va 0.088 -1.19 et al. 2015) and VIIIa) cg02650017 17 47301614 PHOSPHO1 Phosphatase, orphan 1 0.231 -3.15 cg04999691 7 150027050 C7orf29 Chromosome 7 Open Reading Frame 29 0.313 1.41 cg18181703 17 76354621 SOCS3 Suppressor of cytokine signaling 3 0.550 -0.54

164

5.7 T2D-related Differential Methylation

5.7.1 Replication of T2D associated sites in the Qatari Families

A total of 123 samples were used to carry out the differential methylation analysis for T2D. After quality control, about 468,472 CpG sites were included in the epigenome-wide analysis. A total number of 249 and 31894 differentially methylated CpG sites associated with T2D were detected at 1% and 5% false discovery rate, respectively.

The association between methylation and T2D was replicated in the Qatari family dataset at one

CpG site (cg19693031) in TXNIP locus with a detection p-value of 5.3x10-6 (Table 5.1). The methylation values for cg19693031 were plotted using boxplot (Figure 5.2), and histogram (Figure 5.3) to check for the distribution of these values. The mean methylation beta-value of cg19693031 in TXNIP locus among subjects with diabetes (6.01%) was slightly lower compared to subjects with no diabetes. Interestingly, four of the non-replicated CpG associations with T2D still presented nominal or indicative levels of significance. One association at cg00574958 within CPT1A had a significance level of p-value <0.05, and three associations at cg11024682 (SREBF1), cg06500161 (ABCG1), and cg09152259 (PROC) had a significance level of p-value<0.1. The directions of these non-replicated associations were consistent with previous EWASs.

As mentioned earlier in this chapter that the study had a limited power to carry out a full EWAS due to the small sample size of the Qatari families. The most significant association identified by the

EWAS was obtained at cg06721411 in DQX1 locus with a detection p-value of 2.04×10-10. Looking into literature this gene has not been reported with T2D. Since the DQX1 locus was the only site reached genome-wide significance, it was replicated in the TwinsUK (p-value 9x10⁻ᶟ).

5.7.2 Replication of T2D association in the TwinsUK

The replicated association at cg19693031 in the Qatari families was tested for its association in the TwinsUK cohort, and a meta-analysis was carried out to combine the effects of these associations. The results of meta-analysis showed a substantial existence of heterogeneity between the two studies for the association result at cg19693031 in TXNIP locus (I²=93.7%; Cochran’s heterogeneity statistic's p-value=6.86x10¯⁵). Distributions of methylation beta-value at cg19693031 probe in the two datasets indicated that the modifications in the levels of background methylation was not the source of the heterogeneity; Wilcoxon test was performed and showed non-significant results (p- value= 0.79). The meta-analysis results from the fixed effect showed a combined p-value of

165

3.03x103, details are shown in table 5.2. As TXNIP indicated heterogeneity between the populations of Qatar and UK, another meta-analysis was carried out using a random effect model. The random effect meta-analysis showed a non-significant combined p-value (p-value=0.19). This was due to the strong heterogeneity between the two different populations.

Figure5.2. Boxplot of methylation beta-values at cg19693031 in TXNIP locus against the diabetes state. The medians of the data are represented by the middle lines while the boxes represent the 25 to 75 percentiles. The whiskers extend to include 99% of the data; the circles represent the outliers. Beta -value distributions at cg19693031 in the diabetic and non-diabetic subjects showed a difference in the background of methylation level (Wilcoxon test p-value=2.14x10-3).

166

A

B

Figure 5.3 The above figures show the distribution of methylation values for the replicated probe cg19693031 in TXNIP locus. A) The histogram shows the distribution of methylation b-values for cg19693031. The distribution is skewed left which could be due to the differences in the b-value distribution between cases and controls. B) The plot shows the distribution of beta-values at cg19693031 for the Qatari (red) and TwinsUK (blue) cohorts. The distribution of the methylation values is in the same direction between the two populations.

167

Table 5.2: Meta-analysis of the CpG site (cg19693031) replicated in the Qatari families with the results obtained from the TwinsUK results. The results presented in the table below were obtained using a fixed-effect model with reverse variance to combine the regression coefficients and their standard errors from both dataset. The table includes results of p-values, effect sizes (Beta) and their standard errors (se) obtained from Qatari, TwinsUK and meta-analysis. The meta-analysis results also includes upper/lower 95% CI for beta (Beta 95U /95L), and heterogeneity estimates (I²). The CpG locus cg19693031 showed heterogeneity and was combined in a meta-analysis using a random effect model, and heterogeneity between the Qatari and Caucasian populations resulted in combined non-significant p-value (p-value=0.19).

Qatari Cohort TwinsUK Cohort Meta-analysis

Beta se p-value Beta se p-value Beta se p-value I2 (95U/95L)

-2.55 0.53 5.28×10-6 -0.39 0.13 1.72×10-3 -0.51 0.12 3.03×10-5 93.7 (-0.77/-0.27) %

168

5.8 Discussion

The genetic and epigenetic basis of T2D among Arab population has not been established yet; the rise in T2D prevalence among Qatari population provoked the introduction of such studies in this region. To the best of my knowledge, this association analysis between methylation levels and T2D was the first to be conducted among Arab population. Given the limited statistical power of this analysis, a full EWAS was difficult to be implemented. Therefore, previously known CpG sites related to T2D were attempted for replication; these CpG sites included (TNXIP1, PROC, C7orf29, SRABF1, PHOSPHO1, SOCS3, ABCG1 and CPT1A). One association was replicated in TXNIP, and four non-replicated associations at (CPT1A, SREBF1, ABCG1, and PROC) have shown nominal or indicative significance. Although this chapter was mainly focused on replicating T2D methylation associations, three differentially methylated CpG sites were replicated with BMI. These CpG sites were replicated in ABCG1, CPT1A and SOCS3 (details of these replicated sites are presented in my colleague Mashael Al Sahfi PhD thesis).

Mechanisms linking DNA methylation of TXNIP with T2D are not fully understood yet. Though, this gene has been functionally related to diabetes metabolic phenotypes. Interestingly, two of the non- replicated associations at (CPT1A and ABCG1) have been also linked to metabolic phenotypes.

TXNIP is a protein coding gene that acts as metabolism regulator and antioxidant thioredoxins inhibitor. Also, this genes acts as a pro-apoptotic beta-cell factor, and knockout mice of beta-cell specific TXNIP have shown decreased beta-cell apoptosis of about 50-fold (Parikh, Carlsson et al. 2007, Chen, Hui et al. 2008). A study has been published lately described the involvement of TXNIP in glucose regulation through the control of insulin sensitivity in human peripheral body. In addition, the expression level of TXNIP is increased in the skeletal muscles of patients with T2D (Parikh, Carlsson et al. 2007). The results presented in this chapter were concordant, and methylation level of TXNIP was low among individuals with T2D. Therefore, suggesting elevated expression profiles of TXNIP among these participants.

Remarkably, the replicated association in TXNIP in addition to two of non-replicated associations in ABCG1 and CPT1A have been strongly linked with alpha-hydroxybutyrate (AHB) in a recent EWAS by (Petersen, Zeilinger et al. 2014) between methylation and serum metabolic traits (cg19693031 in TXNIP, P=7.2x10-8; cg06500161 in ABCG1, P=7.8x10-6; cg00574958 in CPT1A, P=1.3x10-10). AHB is pre- diabetic biomarker that is produced from the ketone metabolism; high levels of AHB indicate potential insulin resistance (Gall et al. 2010). The concentration of AHB biomarker is measured in QuantoseTM test used for the assessment of pre-diabetes state (Tripathy et al. 2015).

169

T2D-specific biomarkers and pre-diabetes have been reported previously in association with methylation changes in TXNIP, ABCG1 and CPT1A; these biomarkers included glycine (Haeusler, Astiarraga et al. 2013), 3-methyl-2-oxovalerate (Menni, Fauman et al. 2013), various lipid traits, such as phosphatidylcholines (PCs) (Suhre, Meisinger et al. 2010), chylomicrons, VLDL and IDL cholesterol elements. Diabetes effect direction of all these biomarkers was coherent with the effect direction detected in this analysis (methylation measures were low with T2D).

The association direction of these metabolic biomarkers was in agreement with the associations at TXNIP and CPT1A being in the same direction (low level of methylation values are associated with T2D) and the association at ABCG1 was in the opposite direction. These remarks supported the claim that reduced level of methylation values at CPT1A and TXNIP and elevated methylation levels at ABCG1 related to diabetes-specific metabolic phenotypes, which were reflected by the link of these associations with the clinical phenotypes.

The association replicated at TXNIP in this analysis was also confirmed in the TwinsUK. Results of meta-analysis showed robust heterogeneity of the effect sizes for TXNIP (I2=93.7%), though association at TXNIP showed concordant directions, and background methylation level was comparable between the two populations. Heterogeneity of the effects at TXNIP CpG locus between the two studies suggested that the underlying mechanisms could be influenced by variations of the genetic background and environmental factors.

There was a number of limiting factors existed in this analysis that I was aware of. First of all, a subset of methylation CpG sites was targeted by using Illumina Infinium HumanMethylation450K arrays. Illumina array-based technology could result in technical artifacts due to the presence of genetic SNP variants at the probe binding site. The best way to address this issue was by excluding these probes based on the SNP annotations provided by Illumina annotation manifest. Though, the annotated data were based on tagging SNP technologies that could not provide full information, particularly among these under presented Arab populations. The advantage of this analysis was the ability to exclude any confounding effects induced by such genetic variants due to the availability of deep coverage WGS data from all the study subjects. Second, DNA methylation profiling was performed using DNA extracted from whole blood as it was the only accessible tissue sample and could not be representative of the most relevant tissues to the disease phenotype, for example pancreatic cells. Also, the estimated methylation measures could be biased because blood cells contain different cell types (including B cells, Natural Killer cells, monocytes, granulocytes and T cells subset). This problem was corrected by estimating the blood cell type coefficients using a known algorithm described by Houseman et al (Houseman, Accomando et al. 2012), more details about this

170

method were described in Chapter Two, section 2.9.3.. Therefore, the association results presented here were strong against the presence of such artefacts.

Since the epigenome-wide technologies are still in a primary stage, a small number of EWASs has been published with T2D. Nevertheless, the number of epigenetic studies is projected to accelerate given the available epigenome-wide technology alongside the required computational tools. The novelty of the analysis discussed here is the study of methylation changes with T2D among this new isolated population.

171

5.9 Summary

Genetic and Epigenetic studies in Qatar are still in their infancy. Though, there is an evidence for the contribution of epigenetic changes to the development of T2D among the Qatari population. The replicated association in TXNIP confirmed its importance in T2D risk regardless of population ethnicity. Although, the additional associations within ABCG1 and CPT1A were not replicated with T2D, these two hits are supported by their metabolic phenotype on diabetes.

A full EWAS with large number of individuals will enhance the discovery of more CpG associations with T2D among Qataris. In addition, other reported associations might be detected also within Qataris.

5.10 Limitations

There are different factors limiting the study, most of them have been addressed in the discussion section 5.6. Here to add one more limiting factor:

 As mentioned earlier in this chapter, the study has limited statistical power to conduct a full EWAS due to the small sample size.

5.11 Future work

 Identify parent-of-origin effect of KCNQ1

Parent-of-origin effects (POE) are basically defined as the transmission of genetic variation from the father or the mother (Niemitz 2014). The parent-specific inheritance effects could be due to genomic imprinting, intrauterine environmental effects or due to the genes inherited from the maternal mitochondrial chromosome (Rampersaud, Mitchell et al. 2008). Figure 5.4 represents the mechanisms of POE. POE are commonly linked with the idea of imprinting; this means that the effect of an allele is silenced (occur through epigenetic modifications for example DNA methylation) when it is transmitted from one parent and expressed when the effect is transmitted from the other (Rampersaud, Mitchell et al. 2008).

172

Figure5.4. Describes the mechanisms of POE. (A) The effect of the maternal intrauterine environment on gene expression patterns, (B) exclusive inheritance of maternal functional genes in the mitochondrial genome, and (C) differential expression of gene copies depending on from which parent, mother or father, they are inherited (paternally-dependent expression shown). Pathways corresponding to parent-of-origin effects indicated by dashed arrows.

Variants associated with T2D have been mainly identified through GWAS, therefore the discovery of susceptibility variants of specific parental origin has been limited (Rampersaud, Mitchell et al. 2008). This could be due to missing parental information in these studies.

Multiple variants associated with T2D have been reported to have parental effects and detected within imprinted regions of specific genes such as KCNQ1 (Hanson, Guo et al. 2013). KCNQ1 is a known T2D candidate gene that is located within an imprinted domain in chromosome 11p15.5. The most common variants associated with T2D in KCNQ1 with evidence of POE were rs231362, rs2237892, rs2237895, rs2237897 and rs2299620.

The association analysis performed in this PhD project has shown a strong association within KCNQ1 at cg26417642 located at 2466095bp with a p-value of 5.2x10-7. A correlation analysis was performed to assess the correlation between T2D and inherited parental methylation values. The results have shown strong paternal correlation (p-value of 0.03064) and no significant maternal correlation was observed (p-value of 0.9464). Figures 5.4 shows scattered plots for the correlation of methylation values with maternal and paternal diabetic individuals.

173

A

B

Figure5.5. Scattered plots showing the correlation of methylation probe (cg26417642) in KCNQ1 with paternal (A) and maternal (B) methylation values. Figure (A) shows a linear correlation between the total methylation value and paternal methylation values, while the correlation between total methylation value and maternal methylation value is flat (B).

174

Then, methylation data of diabetic versus non-diabetic was plotted using boxplot command in R. The data was plotted after regressing out the effects of all covariates including age, gender, BMI and the six blood cell types. Figure 5.5 shows the association of cg26417642 methylation values with diabetes. From the boxplot the mean methylation of KCNQ1 among individuals with diabetes (0.042) is higher than the mean of methylation observed among individuals with no diabetes (0.036).

Figure5.6. Association of methylation in KCNQ1 with T2D. The methylation b-value observed in diabetic subjects (0.042) is higher than values observed in subjects with no-diabetes (0.036). This is concordant with other studies that reported an elevated methylation profile among diabetic individuals.

The preliminary results showed an evidence of association between paternal methylation and T2D, which suggested that the locus is imprinted. The variants with the most significant association found in the region of the methylation probe against methylation levels were rs2283169 and 2580299. The link between this locus and T2D has to be identified; a potential variant in the maternal haplotype may confer the risk as the paternal is methylated and not expressed. To better understand this association, the following will be performed:

175

 The SNPs around the identified markers have to be extracted to run POE test in QTDT and detect the POE effects of these SNPs.

 Haplotype re-construction using MERLIN to follow the inheritance of the risk allele of the selected SNPs in the pedigree, and identify whether the risk allele is inherited from the father or the mother.

 The parental origin of risk alleles of associated SNPs will be validated through haplotype reconstruction using other publically available tools.

176

Chapter Six

Conclusion

177

6 Conclusions

Although a major progress in identifying the genetic basis of common T2D has been achieved, genetic involvement in human variation and disease development remains a major challenge (Doria, Patti et al. 2008). Most of our knowledge about the genetic variants contributing to the risk of T2D has been primarily limited to the European population and less is known about other populations, such as the population of Qatar. However, these variants have explained less than 15% of T2D inheritance leading to question how to find the remaining hidden heritability. Studies have been shifted lately to identify sources of genetic variations that could explain the missing heritability of complex diseases (Manolio, Collins et al. 2009). In particular, the investigation of rare genetic variants, copy number variants and epigenetic alterations. The analyses described in this thesis involved the investigation of these different genetic variations using multi-generational families from Qatar.

The first aspect of this thesis was the investigation of rare genetic variants predisposing to T2D, and results were discussed in chapter three. In that chapter, I presented genome-wide genotyping arrays (Illumina HumanOmni2.5-8M) and whole-genome sequencing (Illumina hiSeq2500) using multi- generational Qatari families to map novel and/or known T2D variants. Linkage analysis was initially implemented to identify candidate regions that might harbour promising rare variants followed by ROH and WGS. Genome-wide linkage scan of the Qatari families have generated five linkage regions. Interestingly, two linkage regions were reported previously and have been replicated among the families of this project; these regions were 1q42.2 (Avery, Freedman et al. 2004, Avery, Freedman et al. 2004) and 13q13.1 (Shu, Long et al. 2010, Ali, Chopra et al. 2013). The analysis was followed by the identification of overlapping linkage region with ROH; this could indicate the existence of potential rare variants. Three overlapping regions were detected within chromosomes 1, 2 and 6. In addition, multiple ROH regions were identified. Further analysis of candidate regions was carried out using the WGS data. The analysis was performed at three steps; identifying potential rare variants within candidate linkage, homozygous and overlapping regions. In addition to, elucidating rare variants within known monogenic T2D genes. Rare variants have been mapped using WGS data after implementing different filtering parameters that were explained in details in Chapter two, section 2.8.2.4; briefly variants with call quality rate of at least 20 (based on Phred score) were kept. Also, variants with an allele frequency of 1% or more were excluded. Given the advantage of the possibility to incorporate additional data into IVA, information regarding candidate regions were added (more details are described in Chapter two, section 2.8.2.4).

178

Three promising rare variants were identified within known T2D genes that might have functional impact in the pathogenesis of T2D. Of the three variants, one novel rare heterozygous missense mutation (p.R227Q) was identified in GCKR, a known monogenic T2D gene. This mutation was identified in family 1 among four individuals with diabetes and one with no diabetes. GCKR is expressed in the liver and pancreatic β-cells, which reduces the affinity of GCK for glucose (Bonetti, Trombetta et al. 2011). GCKR variations have been found to be involved in β-cell functions and associated with T2D phenotypes (Bonetti, Trombetta et al. 2011). This mutation is transmitted over three generations from the grandmother to daughters and grandchildren. Additionally, the identified mutation is found to be transmitted among normal weight T2D patients, which may exclude the potential role of obesity in developing T2D. A heterozygous form of the mutant allele was identified among all subjects carrying the mutation. The subject who is presented as a control (with no diabetes) was carrying the same heterozygous mutation; this subject could be possibly in a pre- diabetic state or even diabetic. A pilot study has been published recently in Qatar among 191 individuals (the majority of them were Qataris) to examine the risk of T2D with elevated HbA1c. The study has shown that 20% of the study respondents, who presented themselves as normal with no diabetes, had elevated HbA1c (Mook-Kanamori, Selim et al. 2014). It is difficult at this stage to conclude whether the p.R227Q mutation is the cause of diabetes among this family. Further assessment to investigate the impact of the identified mutation on GCKR to T2D is essential and more members of the family have to be examined.

From the analysis of candidate regions, two novel mutations were detected within homozygous regions. A gene of interest was identified in family five in two individuals. A novel rare missense mutation (p.T220A) was identified within known T2D gene, IGFBP2- this gene was discovered through GWAS. This gene is expressed in the liver, skeletal muscle and pancreatic β-cells, and has been shown to be associated with T2D (Parikh, Lyssenko et al. 2009). Another rare heterozygous mutation was identified within TGM2 among two affected subjects from family 2. This gene was found to be associated with T2D; reduced activity of TGM2 had a potential effect on glucose metabolism causing impaired insulin secretion (Porzio, Massa et al. 2007). Also, results from the analysis of candidate linkage region have shown one variant within chromosome 6; this variant was not of interest as the variant was present in a single case and control. Analysis of other reported linkage regions have shown no variants. Furthermore, results from analysing the overlapping regions (between candidate linkage and ROH) have not identified any variants.

Another aspect of this thesis was the investigation of eight Qatari families to identify large copy number variants associated with T2D. In Chapter four, I presented a genome-wide analysis of large and rare copy number variants using Illumina HumanOmni2.5M arrays. PennCNV prediction

179

algorithm was used and detailed description of results is provided in that chapter. In order to select the candidate CNVs, each one of these CNVs were examined. CNVs with a size of >500kb and included >20 SNPs were excluded from subsequent analysis. Candidate CNVs were annotated; overlapping and neighbouring genes were identified. These genes were prioritized according to their functional relevance to T2D. Most of these genes have been reported previously with cancer and none of these genes have shown biological significance to T2D risk. Relatively small number of studies has been successful in identifying CNVs contributing to T2D (Lee, Moon et al. 2014). In addition, larger sample size will increase the statistical power to identify T2D-related CNVs. It was essential to highlight that the analysis conducted here was based on individual-based CNV calling. The family structure was ignored and individual-based CNV calls were generated due to missing data from one or both parents in most of the families. Family-based CNV calling is required and PennCNV algorithm can be implemented in the future. PennCNV family-based CNV calling is designed to generate CNV calls from trios (father, mother, and offspring) only. Thus, the extended families will be divided into trios and quartet for CNV calling. Then CNV calls will be united together into one combined calls.

In Chapter five, I presented what was to the best of my knowledge the first epigenome-wide association study (EWAS) between DNA methylation and T2D in 123 native Qatari individuals. The investigation of epigenetic modifications in disease risk, such as T2D has been widely implemented in recent years. Epigenetic alterations involving the mechanisms of DNA methylation and histone modifications resulted in changes of gene expression, and subsequent regulation of multiple biological processes. Therefore, the investigation of such epigenetic patterns could provide new insights to the pathogenesis of T2D. About six large EWASs have been published lately reporting strong associations between T2D with the degree of DNA methylation (Barrès, Osler et al. 2009, Toperoff, Aran et al. 2012, Volkmar, Dedeurwaerder et al. 2012, Dayeh, Volkov et al. 2014, Chambers, Loh et al. 2015). Though, these EWASs have been carried out in Caucasians. This has led to the question of whether these associations can be identified in other populations from different ethnicities and genetic background. Since a full EWAS was not applicable due to the limited sample size, the analysis described in chapter five was focused on replicating the most significant T2D associated CpG loci.

From the most recent EWASs (Chambers, Loh et al. 2015, Kulkarni, Kos et al. 2015) eight CpG loci (TNXIP1, PROC, C7orf29, SRABF1, PHOSPHO1, SOCS3, ABCG1 and CPT1A) that are significantly associated with T2D have been selected for the replication analysis. Of the eight most significantly

180

associated CpGs, one locus (cg19693031) within the TNXIP1 has shown a significant association with T2D in the Qataris. Interestingly, methylation of cg19693031 in the TNXIP1 has been shown to be strongly linked with alpha-hydroxybutyrate (pre-diabetic biomarker) in a recent study by (Suhre, Meisinger et al. 2010). In addition, methylation of another two CpG sites (ABCG1 and CPT1A) is linked to the same metabolic biomarker. However, these loci have shown significant associations between DNA methylation and BMI instead of T2D in the Qatari dataset. These associations are supported by findings of the metabolic phenotype on diabetes emphasised the potential contribution of these methylated loci to T2D susceptibility regardless of ethnic background.

The findings of this project highlighted the potential role of genetic and epigenetic variations to the risk of T2D in the Qatari population. However, the analysis described in the thesis was carried out with some difficulties due to the presence of potential caveats. For instance, the sample size was relatively small and has affected the power of the study. This was particularly obvious when analysing rare variants and copy number variants. A small number of rare mutations were detected in these analyses due to the limited sample size, in addition to the restricted analysing criteria. However, these variants have been found within known genes that have vital role on T2D development. To address this problem, all families used in this project are being contacted currently to collect samples from missing family members. Also, the sample size will be increased by recruiting more consanguineous families with a larger number of individuals. In addition, the most distant co- affected relatives in each family will be included to facilitate the discovery of overlapping novel mutation. More phenotypic data will be collected from all individuals including HbA1c, insulin level and oral glucose tolerance test (OGTT).

Moreover, to follow up the results presented in this thesis and to improve the understanding of the underlying genetic variations contributing to T2D in the Qatari population, further analysis is still required. After recruiting more families, and collecting the data from missing parents and other family members. The impact of the identified novel mutations will be investigated to confirm their contribution to T2D risk. Initial evaluation of the functional role of these variants will be examined by using the publically-available bioinformatics tool such as PolyPhen functional annotation tool. In addition, experimental-based functional analysis will be implemented, such as messenger RNA (mRNA) quantification using microarrays. This can be carried out using real-time PCR (RT-PCR).

Following the results obtained from CNV analysis, family-based CNV calling will be implemented using PennCNV algorithm. Family members are more likely to share the same CNV regions; since CNV information can be borrowed and correlated from all family members more accurate CNVs are

181

expected to be generated. As I mentioned above, PennCNV family-based CNV calling is designed to generate CNV calls from trios. Therefore, all families will be separated into trios and quartet for CNV calling. Also, enrichment analysis to look for enriched T2D genes within CNV candidate regions that are mainly shared between cases could be conducted. The availability of high-coverage WGS data for all family members enables the identification of more accurate CNVs through the application of sequencing-based CNV calling. The development of next generation sequencing technologies will allow a combined detection of large and small pathogenic CNVs (<50 bp in size), including deletions and duplications that are not detected with microarray technologies (Boone, Campbell et al. 2013, Hehir-Kwa, Pfundt et al. 2015). Such data can provide better CNV characterization with high levels of accuracy given that content and positional information are available. The explosion of NGS technologies offers a high capacity of identifying all forms of genetic variations from SNPs to different forms of structural variations (Boone, Campbell et al. 2013, Hehir-Kwa, Pfundt et al. 2015).

Likewise, the families that are being recruited enable the performance of full EWAS using Qatari samples.

The availability of various genomic data generated from this thesis, particularly WGS data provides potential opportunities for further T2D research. One option could be the investigation of metagenomics. The human gut microbiome is referred to colonies of micro-organisms in the human gut, containing enormous numbers of genes compared to the human genome that are associated with changes in health and human metabolism (Ji and Nielsen 2015). Reports have shown that gut microbiome is considered as a potential source of genetic diversity contributing to the risk of complex diseases, such as T2D (Guarner 2015). Evidences of associations between gut microbiome and their genes with T2D have been shown in a recent metagenomic-wide association study of European woman with T2D (Karlsson, Tremaroli et al. 2013); significant associations of genes within particular gut microbiome (such as, Roseburia species and Faecalibacterium prausnitzii) with T2D have been reported (Karlsson, Tremaroli et al. 2013). Further investigations will be required to understand the interactions between variation in the gut microbiome and the human genes (Guarner 2015).

Despite the study has low power, the combined analysis of rare variants, copy number variants and DNA methylation have identified potential sources of genetic associations with T2D among Qataris. The novelty of this project was the investigation of both genetic and epigenetic basis of T2D using Qatari family samples. The population of Qatar is genetically isolated showing a high level of inter- relatedness and subsequent reduction in genetic diversity. The investigation of such population

182

could enhance the identification of rare variants related to complex diseases and provide better understanding of disease biology. Since, the contribution of genetic components to T2D risk among Qataris is unknown. In addition, replicating known variants identified through the study of Europeans could sometimes lead to negative results, such as the negative results produced from validating PPARγ among Qataris (details are in Chapter one). Also, the genetic variants associated with T2D might be specific to the Qatari population and require extensive analysis. It is essential to investigate the underlying genetic causes of T2D among this under-presented population.

Given the early state in the field of genetic research in Qatar, the different aspects of this thesis provided to the best of my knowledge the first evidence of genetic involvement in T2D among this underexplored population. This thesis could open exciting avenues for further genetic research in the future in Qatar. Also, encourages deeper understanding of disease pathogenesis through the identification of genetic loci that are associated to complex diseases.

Genetic research is moving very fast in Qatar and this thesis supports Qatar vision to become a world-class hub for genetic research in diabetes. Different institutes such as Qatar Biomedical Research Institute (QBRI) and Hamad Medical Corporation (HMC) have started their pilot studies in genetics of T2D. Also, the Qatar Biobank with availability of large number of Qatari samples will improve the research in Qatar by providing human biological samples for the discovery of new causative loci and for replication studies.

183

Refrences Bae, J. S., et al. (2011). "The genetic effect of copy number variations on the risk of type 2 diabetes in a Korean population." PloS one 6(4): e19091. BoonPeng, H. and K. Yusoff (2013). "The utility of copy number variation (CNV) in studies of hypertension-related left ventricular hypertrophy (LVH): rationale, potential and challenges." Mol Cytogenet 6(8). Bronstad, I., et al. (2011). "Genome-wide copy number variation (CNV) in patients with autoimmune Addison's disease." BMC Med Genet 12: 111. Chew, S., et al. (2012). "Copy number variation of the APC gene is associated with regulation of bone mineral density." Bone 51(5): 939-943. Choudhury, R., et al. (2014). "The splicing activator DAZAP1 integrates splicing control into MEK/Erk- regulated cell proliferation and migration." Nature communications 5. Coin, L. J., et al. (2010). "cnvHap: an integrative population and haplotype–based multiplatform model of SNPs and CNVs." Nature methods 7(7): 541-546. Colella, S., et al. (2007). "QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data." Nucleic Acids Res 35(6): 2013- 2025. Craddock, N., et al. (2010). "Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls." Nature 464(7289): 713-720. de Smith, A. J., et al. (2008). "Human genes involved in copy number variation: mechanisms of origin, functional effects and implications for disease." Cytogenet Genome Res 123(1-4): 17-26. Eckel-Passow, J. E., et al. (2011). "Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform." BMC Bioinformatics 12: 220. Eleftherohorinou, H., et al. (2011). "famCNV: copy number variant association for quantitative traits in families." Bioinformatics 27(13): 1873-1875. Freeman, J. L., et al. (2006). "Copy number variation: new insights in genome diversity." Genome Res 16(8): 949-961. Girirajan, S., et al. (2011). "Human copy number variation and complex genetic disease." Annual review of genetics 45: 203-226. Itsara, A., et al. (2009). "Population analysis of large copy number variants and hotspots of human genetic disease." The American Journal of Human Genetics 84(2): 148-161. Jeon, J. P., et al. (2010). "Copy number variation at leptin receptor gene locus associated with metabolic traits and the risk of type 2 diabetes mellitus." BMC genomics 11: 426. Kidd, J. M., et al. (2008). "Mapping and sequencing of structural variation from eight human genomes." Nature 453(7191): 56-64. Kirov, G., et al. (2012). "De novo CNV analysis implicates specific abnormalities of postsynaptic signalling complexes in the pathogenesis of schizophrenia." Mol Psychiatry 17(2): 142-153. Kudo, H., et al. (2011). "Frequent loss of genome gap region in 4p16.3 subtelomere in early-onset type 2 diabetes mellitus." Exp Diabetes Res 2011: 498460.

184

Matsunami, N., et al. (2013). "Identification of rare recurrent copy number variants in high-risk autism families and their prevalence in a large ASD population." PloS one 8(1): e52239. Pankratz, N., et al. (2011). "Copy number variation in familial Parkinson disease." PloS one 6(8): e20988. Pinto, D., et al. (2010). "Functional impact of global rare copy number variation in autism spectrum disorders." Nature 466(7304): 368-372. Plengvidhya, N., et al. (2012). "Identification of copy number variation of CAPN10 in Thais with type 2 diabetes by multiplex PCR and denaturing high performance liquid chromatography (DHPLC)." Gene 506(2): 383-386. Redon, R., et al. (2006). "Global variation in copy number in the human genome." Nature 444(7118): 444-454. Sebat, J., et al. (2007). "Strong association of de novo copy number mutations with autism." Science 316(5823): 445-449. Sebat, J., et al. (2004). "Large-scale copy number polymorphism in the human genome." Science 305(5683): 525-528. Stankiewicz, P. and J. R. Lupski (2002). "Genome architecture, rearrangements and genomic disorders." TRENDS in Genetics 18(2): 74-82. Stankiewicz, P. and J. R. Lupski (2010). "Structural variation in the human genome and its role in disease." Annual review of medicine 61: 437-455. Tabassum, R., et al. (2012). "Genetic variant of AMD1 is associated with obesity in Urban Indian Children." Walsh, T., et al. (2008). "Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia." Science 320(5875): 539-543. Wang, K., et al. (2007). "PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data." Genome research 17(11): 1665-1674. Yang, T. L., et al. (2008). "Genome-wide copy-number-variation study identified a susceptibility gene, UGT2B17, for osteoporosis." Am J Hum Genet 83(6): 663-674. Ye, Z. Q., et al. (2010). "Analyses of copy number variation of GK rat reveal new putative type 2 diabetes susceptibility loci." PloS one 5(11): e14077. Abecasis, G. R. and J. E. Wigginton (2005). "Handling Marker-Marker Linkage Disequilibrium: Pedigree Analysis with Clustered Markers." Am J Hum Genet 77(5): 754-767.

Ali, S., et al. (2013). "Replication of type 2 diabetes candidate genes variations in three geographically unrelated Indian population groups." PloS one 8(3): e58881. Auer, P. L. and G. Lettre (2015). "Rare variant association studies: considerations, challenges and opportunities." Genome medicine 7(1): 16. Avery, C. L., et al. (2004). "Linkage Analysis of Diabetes Status Among Hypertensive Families The Hypertension Genetic Epidemiology Network Study." Diabetes 53(12): 3307-3312. Avery, C. L., et al. (2004). "Linkage analysis of diabetes status among hypertensive families: the Hypertension Genetic Epidemiology Network study." Diabetes 53(12): 3307-3312.

185

Bernassola, F., et al. (2002). "Role of transglutaminase 2 in glucose tolerance: knockout mice studies and a putative mutation in a MODY patient." The FASEB journal 16(11): 1371-1378. Bodmer, W. and C. Bonilla (2008). "Common and rare variants in multifactorial susceptibility to common diseases." Nature genetics 40(6): 695-701. Chen, W. M. and G. R. Abecasis (2006). "Estimating the power of variance component linkage analysis in large pedigrees." Genet Epidemiol 30(6): 471-484. Chu, C. A., et al. (2004). "Rapid translocation of hepatic glucokinase in response to intraduodenal glucose infusion and changes in plasma glucose and insulin in conscious rats." American Journal of Physiology-Gastrointestinal and Liver Physiology 286(4): G627-G634. Cirulli, E. T. and D. B. Goldstein (2010). "Uncovering the roles of rare variants in common disease through whole-genome sequencing." Nature Reviews Genetics 11(6): 415-425. Conen, D., et al. (2008). "Renin–angiotensin and endothelial nitric oxide synthase gene polymorphisms are not associated with the risk of incident type 2 diabetes mellitus: a prospective cohort study." J Intern Med 263(4): 376-385. Drong, A. W., et al. (2012). "The genetic and epigenetic basis of type 2 diabetes and obesity." Clin Pharmacol Ther 92(6): 707-715. Elbein, S. C., et al. (2009). "Genome-wide linkage and admixture mapping of type 2 diabetes in African American families from the American Diabetes Association GENNID (Genetics of NIDDM) Study Cohort." Diabetes 58(1): 268-274. Fajans, S. S., et al. (2001). "Molecular mechanisms and clinical pathophysiology of maturity-onset diabetes of the young." New England Journal of Medicine 345(13): 971-980. Farrelly, D., et al. (1999). "Mice mutant for glucokinase regulatory protein exhibit decreased liver glucokinase: a sequestration mechanism in metabolic regulation." Proceedings of the National Academy of Sciences 96(25): 14511-14516. Gibbs, R. A., et al. (2003). "The international HapMap project." Nature 426(6968): 789-796. Grimsby, J., et al. (2000). "Characterization of glucokinase regulatory protein-deficient mice." Journal of Biological Chemistry 275(11): 7826-7831. Hayes, M. G., et al. (2007). "Identification of type 2 diabetes genes in Mexican Americans through genome-wide association studies." Diabetes 56(12): 3033-3044. Helgason, A., et al. (2001). "mtDNA and the islands of the North Atlantic: estimating the proportions of Norse and Gaelic ancestry." The American Journal of Human Genetics 68(3): 723-737. Helgason, A., et al. (2000). "Estimating Scandinavian and Gaelic ancestry in the male settlers of Iceland." The American Journal of Human Genetics 67(3): 697-717. Inoue, I., et al. (1997). "A nucleotide substitution in the promoter of human angiotensinogen is associated with essential hypertension and affects basal transcription in vitro." Journal of Clinical Investigation 99(7): 1786. Jafar-Mohammadi, B. and M. I. McCarthy (2008). "Genetics of type 2 diabetes mellitus and obesity-a review." Annals of medicine 40(1): 2-10. Jafar-Mohammadi, B. and M. I. Mccarthy (2008). "Genetics of type 2 diabetes mellitus and obesity - a review." Annals of Medicine 40(1): 2-10.

186

Kruglyak, L., et al. (1996). "Parametric and nonparametric linkage analysis: A unified multipoint approach." American Journal of Human Genetics 58(6): 1347-1363. Kryukov, G. V., et al. (2009). "Power of deep, all-exon resequencing for discovery of human trait genes." Proceedings of the National Academy of Sciences 106(10): 3871-3876. Kumar, P., et al. (2014). "Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance." BMC research notes 7(1): 747. Li, D., et al. (2011). "Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies." Genetic epidemiology 35(8): 790-799. Ling, Y., et al. (2011). "Associations of common polymorphisms in GCKR with type 2 diabetes and related traits in a Han Chinese population: a case-control study." BMC Med Genet 12(1): 66. Manolio, T. A., et al. (2009). "Finding the missing heritability of complex diseases." Nature 461(7265): 747-753. Matschinsky, F. M. (1996). "A lesson in metabolic regulation inspired by the glucokinase glucose sensor paradigm." Diabetes 45(2): 223-241. Murray, S. S., et al. (2004). "A highly informative SNP linkage panel for human genetic studies." Nature methods 1(2): 113-117. Plomin, R., et al. (2009). "Common disorders are quantitative traits." Nature Reviews Genetics 10(12): 872-878. Porzio, O., et al. (2007). "Missense mutations in the TGM2 gene encoding transglutaminase 2 are found in patients with early‐onset type 2 diabetes." Human mutation 28(11): 1150-1150. Pritchard, J. K. (2001). "Are rare variants responsible for susceptibility to complex diseases?" The American Journal of Human Genetics 69(1): 124-137. Pritchard, J. K. and N. J. Cox (2002). "The allelic architecture of human disease genes: common disease–common variant… or not?" Human molecular genetics 11(20): 2417-2423. Prokopenko, I., et al. (2008). "Type 2 diabetes: new genes, new understanding." Trends in Genetics 24(12): 613-621. Rajpathak, S. N., et al. (2009). "The role of insulin‐like growth factor‐I and its binding proteins in glucose homeostasis and type 2 diabetes." Diabetes/metabolism research and reviews 25(1): 3-12. Rich, S. S., et al. (2008). "Genes Associated With Risk of Type 2 Diabetes Identified by a Candidate- Wide Association Scan As a Trickle Becomes a Flood." Diabetes 57(11): 2915-2917. Rotimi, C., et al. (1997). "Hypertension, serum angiotensinogen, and molecular variants of the angiotensinogen gene among Nigerians." Circulation 95(10): 2348-2350. SCHAFTINGEN, E. (1989). "A protein from rat liver confers to glucokinase the property of being antagonistically regulated by fructose 6‐phosphate and fructose 1‐phosphate." European Journal of Biochemistry 179(1): 179-184. Schork, N. J., et al. (2009). "Common vs. rare allele hypotheses for complex diseases." Current opinion in genetics & development 19(3): 212-219. Shu, X. O., et al. (2010). "Identification of new genetic risk variants for type 2 diabetes." PLoS genetics 6(9): e1001127.

187

Slosberg, E. D., et al. (2001). "Treatment of type 2 diabetes by adenoviral-mediated overexpression of the glucokinase regulatory protein." Diabetes 50(8): 1813-1820. Strauch, K., et al. (2000). "Parametric and nonparametric multipoint linkage analysis with imprinting and two-locus–trait models: application to mite sensitization." The American Journal of Human Genetics 66(6): 1945-1957. Vaxillaire, M. and P. Froguel (2013). Monogenic diabetes in the young, pharmacogenetics and relevance to multifactorial forms of type 2 diabetes, Endocrine Society. Abecasis, G. R., et al. (2002). "Merlin—rapid analysis of dense genetic maps using sparse gene flow trees." Nature genetics 30(1): 97-101.

Abecasis, G. R. and J. E. Wigginton (2005). "Handling Marker-Marker Linkage Disequilibrium: Pedigree Analysis with Clustered Markers." Am J Hum Genet 77(5): 754-767.

Agostini, M., et al. (2004). "Tyrosine agonists reverse the molecular defects associated with dominant-negative mutations in human peroxisome proliferator-activated receptor γ." Endocrinology 145(4): 1527-1538.

Ahlqvist, E., et al. (2011). "Genetics of type 2 diabetes." Clinical chemistry 57(2): 241-254.

Ali, S., et al. (2013). "Replication of type 2 diabetes candidate genes variations in three geographically unrelated Indian population groups." PloS one 8(3): e58881.

Almén, M. S., et al. (2012). "Genome wide analysis reveals association of a FTO gene variant with epigenetic changes." Genomics 99(3): 132-137.

Almind, K., et al. (1993). "Aminoacid polymorphisms of insulin receptor substrate-1 in non-insulin- dependent diabetes mellitus." The Lancet 342(8875): 828-832.

Altshuler, D., et al. (2008). "Genetic mapping in human disease." Science 322(5903): 881-888.

Altshuler, D., et al. (2000). "The common PPARγ Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes." Nature genetics 26(1): 76-80.

Ambros, V. (2004). "The functions of animal microRNAs." Nature 431(7006): 350-355.

Ambros, V. (2004). "The functions of animal microRNAs." Nature 431(7006): 350-355.

Arslan, E., et al. (2014). "The effect of calpain-10 gene polymorphism on the development of type 2 diabetes mellitus in a Turkish population." Endokrynologia Polska 65(2): 90-95.

Auer, P. L. and G. Lettre (2015). "Rare variant association studies: considerations, challenges and opportunities." Genome medicine 7(1): 16.

188

Avery, C. L., et al. (2004). "Linkage Analysis of Diabetes Status Among Hypertensive Families The Hypertension Genetic Epidemiology Network Study." Diabetes 53(12): 3307-3312.

Avery, C. L., et al. (2004). "Linkage analysis of diabetes status among hypertensive families: the Hypertension Genetic Epidemiology Network study." Diabetes 53(12): 3307-3312.

Baccarelli, A., et al. (2009). "Rapid DNA methylation changes after exposure to traffic particles." American journal of respiratory and critical care medicine 179(7): 572-578.

Badii, R., et al. (2008). "Lack of association between the Pro12Ala polymorphism of the PPAR-gamma 2 gene and type 2 diabetes mellitus in the Qatari consanguineous population." Acta Diabetol 45(1): 15-21.

Bae, J. S., et al. (2011). "The genetic effect of copy number variations on the risk of type 2 diabetes in a Korean population." PloS one 6(4): e19091.

Bae, J. S., et al. (2011). "The genetic effect of copy number variations on the risk of type 2 diabetes in a Korean population." PloS one 6(4): e19091.

Barres, R., et al. (2009). "Non-CpG methylation of the PGC-1alpha promoter through DNMT3B controls mitochondrial density." Cell Metab 10(3): 189-198.

Barrès, R., et al. (2009). "Non-CpG methylation of the PGC-1α promoter through DNMT3B controls mitochondrial density." Cell Metab 10(3): 189-198.

Barroso, I., et al. (1999). "Dominant negative mutations in human PPARγ associated with severe insulin resistance, diabetes mellitus and hypertension." Nature 402(6764): 880-883.

Beamer, B. A., et al. (1998). "Association of the Pro12Ala variant in the peroxisome proliferator- activated receptor-gamma2 gene with obesity in two Caucasian populations." Diabetes 47(11): 1806- 1808.

Beckmann, J., et al. (2008). "CNVs and genetic medicine (excitement and consequences of a rediscovery)." Cytogenetic and genome research 123(1): 7.

Bell, C. G., et al. (2010). "Integrated genetic and epigenetic analysis identifies haplotype-specific methylation in the FTO type 2 diabetes and obesity susceptibility locus." PloS one 5(11): e14040.

Bell, J. T. and T. D. Spector (2011). "A twin approach to unraveling epigenetics." TRENDS in Genetics 27(3): 116-125.

Bell, J. T. and T. D. Spector (2012). "DNA methylation studies using twins: what are they telling us." Genome Biol 13(10): 172.

189

Bell, J. T., et al. (2012). "Epigenome-wide scans identify differentially methylated regions for age and age-related phenotypes in a healthy ageing population." PLoS genetics 8(4): e1002629. Bener, A., et al. (2007). "Consanguineous marriages and their effects on common adult diseases: studies from an endogamous population." Medical Principles and Practice 16(4): 262.

Bener, A., et al. (2007). "Consanguineous marriages and their effects on common adult diseases: studies from an endogamous population." Med Princ Pract 16(4): 262-267.

Bener, A., et al. (2005). "Genetics, obesity, and environmental risk factors associated with type 2 diabetes." Croat Med J 46(2): 302-307.

Bener, A., et al. (2009). "Prevalence of diagnosed and undiagnosed diabetes mellitus and its risk factors in a population-based study of Qatar." Diabetes Res Clin Pract 84(1): 99-106.

Benjamini, Y. and Y. Hochberg (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological): 289- 300.

Bentley, D. R., et al. (2008). "Accurate whole human genome sequencing using reversible terminator chemistry." Nature 456(7218): 53-59.

Bernassola, F., et al. (2002). "Role of transglutaminase 2 in glucose tolerance: knockout mice studies and a putative mutation in a MODY patient." The FASEB journal 16(11): 1371-1378.

Bibikova, M., et al. (2011). "High density DNA methylation array with single CpG site resolution." Genomics 98(4): 288-295.

Bibikova, M., et al. (2009). "Genome-wide DNA methylation profiling using Infinium(R) assay." Epigenomics 1(1): 177-200.

Bibikova, M., et al. (2009). "Genome-wide DNA methylation profiling using Infinium® assay." Epigenomics 1(1): 177-200. Bittles, A. (2001). "Consanguinity and its relevance to clinical genetics." Clin Genet 60(2): 89-98. Bock, C., et al. (2008). "Inter-individual variation of DNA methylation and its implications for large- scale epigenome mapping." Nucleic Acids Res 36(10): e55-e55. Boden, G. (1997). "Role of fatty acids in the pathogenesis of insulin resistance and NIDDM." Diabetes 46(1): 3-10. Bodmer, W. and C. Bonilla (2008). "Common and rare variants in multifactorial susceptibility to common diseases." Nature genetics 40(6): 695-701. Boks, M. P., et al. (2009). "The relationship of DNA methylation with age, gender and genotype in twins and healthy controls." PloS one 4(8): e6767.

190

Bollati, V., et al. (2007). "Changes in DNA methylation patterns in subjects exposed to low-dose benzene." Cancer research 67(3): 876-880. Bollati, V., et al. (2009). "Decline in genomic DNA methylation through aging in a cohort of elderly subjects." Mechanisms of ageing and development 130(4): 234-239.

Bonetti, S., et al. (2011). "Variants of GCKR Affect Both β-Cell and Kidney Function in Patients With Newly Diagnosed Type 2 Diabetes The Verona Newly Diagnosed Type 2 Diabetes Study 2." Diabetes Care 34(5): 1205-1210.

Bonnefond, A., et al. (2010). "Molecular diagnosis of neonatal diabetes mellitus using next- generation sequencing of the whole exome."

Bonnefond, A., et al. (2010). "Molecular diagnosis of neonatal diabetes mellitus using next- generation sequencing of the whole exome." PloS one 5(10): e13630.

Bonnefond, A. and P. Froguel (2015). "Rare and Common Genetic Events in Type 2 Diabetes: What Should Biologists Know?" Cell metabolism 21(3): 357-368.

Bonnefond, A., et al. (2014). "Highly sensitive diagnosis of 43 monogenic forms of diabetes or obesity through one-step PCR-based enrichment in combination with next-generation sequencing." Diabetes Care 37(2): 460-467.

Boomsma, D., et al. (2002). "Classical twin studies and beyond." Nature Reviews Genetics 3(11): 872- 882.

Boone, P. M., et al. (2013). "Deletions of recessive disease genes: CNV contribution to carrier states and disease-causing alleles." Genome research 23(9): 1383-1394.

BoonPeng, H. and K. Yusoff (2013). "The utility of copy number variation (CNV) in studies of hypertension-related left ventricular hypertrophy (LVH): rationale, potential and challenges." Mol Cytogenet 6(8).

Bouatia-Naji, N., et al. (2009). "A variant near MTNR1B is associated with increased fasting plasma glucose levels and type 2 diabetes risk." Nature genetics 41(1): 89-94.

Bouchard, C., et al. (1987). "Inheritance of the amount and distribution of human body fat." International journal of obesity 12(3): 205-215.

Bourc'his, D., et al. (2001). "Dnmt3L and the establishment of maternal genomic imprints." Science 294(5551): 2536-2539.

Bronstad, I., et al. (2011). "Genome-wide copy number variation (CNV) in patients with autoimmune Addison's disease." BMC Med Genet 12: 111.

191

Brown, C. J. and J. C. Chow (2003). Beyond sense: the role of antisense RNA in controlling Xist expression. Seminars in cell & developmental biology, Elsevier.

Buchanan, T. A., et al. (1990). "Insulin sensitivity and B-cell responsiveness to glucose during late pregnancy in lean and moderately obese women with normal glucose tolerance or mild gestational diabetes." American journal of obstetrics and gynecology 162(4): 1008-1014.

Burton, P. R., et al. (2007). "Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls." Nature 447(7145): 661-678.

Carey, D., et al. (1996). "Genetic influences on central abdominal fat: a twin study." International journal of obesity and related metabolic disorders: journal of the International Association for the Study of Obesity 20(8): 722-726.

Carter, N. P. (2007). "Methods and strategies for analyzing copy number variation using DNA microarrays." Nature genetics 39: S16-S21.

Cedar, H. and Y. Bergman (2009). "Linking DNA methylation and histone modification: patterns and paradigms." Nat Rev Genet 10(5): 295-304.

Chambers, J. C., et al. (2015). "Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case- control study." The Lancet Diabetes & Endocrinology.

Chen, J., et al. (2008). "Thioredoxin-interacting protein deficiency induces Akt/Bcl-xL signaling and pancreatic beta-cell mass and protects against diabetes." The FASEB journal 22(10): 3581-3594.

Chen, M., et al. (1988). "Insulin Resistance and β-Cell Dysfunction in Aging: The Importance of Dietary Carbohydrate*." The Journal of Clinical Endocrinology & Metabolism 67(5): 951-957.

Chen, W. M. and G. R. Abecasis (2006). "Estimating the power of variance component linkage analysis in large pedigrees." Genet Epidemiol 30(6): 471-484.

Chew, S., et al. (2012). "Copy number variation of the APC gene is associated with regulation of bone mineral density." Bone 51(5): 939-943.

Chinnusamy, V. and J.-K. Zhu (2009). "Epigenetic regulation of stress responses in plants." Current opinion in plant biology 12(2): 133-139.

Chong, S. and E. Whitelaw (2004). "Epigenetic germline inheritance." Current opinion in genetics & development 14(6): 692-696.

192

Choudhury, R., et al. (2014). "The splicing activator DAZAP1 integrates splicing control into MEK/Erk- regulated cell proliferation and migration." Nature communications 5.

Chow, J. and C. Brown (2003). "Forming facultative heterochromatin: silencing of an X chromosome in mammalian females." Cellular and molecular life sciences CMLS 60(12): 2586-2603.

Christensen, B. C., et al. (2009). "Aging and environmental exposures alter tissue-specific DNA methylation dependent upon CpG island context." PLoS Genet 5(8): e1000602.

Chu, C. A., et al. (2004). "Rapid translocation of hepatic glucokinase in response to intraduodenal glucose infusion and changes in plasma glucose and insulin in conscious rats." American Journal of Physiology-Gastrointestinal and Liver Physiology 286(4): G627-G634.

Cirulli, E. T. and D. B. Goldstein (2010). "Uncovering the roles of rare variants in common disease through whole-genome sequencing." Nature Reviews Genetics 11(6): 415-425.

Cirulli, E. T., et al. (2010). "Common genetic variation and performance on standardized cognitive tests." European Journal of Human Genetics 18(7): 815-820.

Coe, B. P., et al. (2014). "Refining analyses of copy number variation identifies specific genes associated with developmental delay." Nature genetics 46(10): 1063-1071.

Coin, L. J., et al. (2010). "cnvHap: an integrative population and haplotype–based multiplatform model of SNPs and CNVs." Nature methods 7(7): 541-546.

Colella, S., et al. (2007). "QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data." Nucleic Acids Res 35(6): 2013- 2025.

Collins, F. S., et al. (1997). "Variations on a theme: cataloging human DNA sequence variation." Science 278(5343): 1580-1581.

Conen, D., et al. (2008). "Renin–angiotensin and endothelial nitric oxide synthase gene polymorphisms are not associated with the risk of incident type 2 diabetes mellitus: a prospective cohort study." J Intern Med 263(4): 376-385.

Conrad, D. F., et al. (2006). "A high-resolution survey of deletion polymorphism in the human genome." Nature genetics 38(1): 75-81.

Conrad, D. F., et al. (2010). "Origins and functional impact of copy number variation in the human genome." Nature 464(7289): 704-712.

Consortium, D. S. D., et al. (2014). "Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility." Nature genetics 46(3): 234-244.

193

Cox, N. J., et al. (1999). "Loci on chromosomes 2 (NIDDM1) and 15 interact to increase susceptibility to diabetes in Mexican Americans." Nature genetics 21(2): 213-215.

Craddock, N., et al. (2010). "Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls." Nature 464(7289): 713-720.

Craddock, N., et al. (2010). "Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls." Nature 464(7289): 713-720.

Crosthwaite, S. K. (2004). "Circadian clocks and natural antisense RNA." FEBS letters 567(1): 49-54.

Dabelea, D., et al. (2000). "Intrauterine exposure to diabetes conveys risks for type 2 diabetes and obesity: a study of discordant sibships." Diabetes 49(12): 2208-2211.

Daxinger, L. and E. Whitelaw (2010). "Transgenerational epigenetic inheritance: more questions than answers." Genome research 20(12): 1623-1628.

Dayeh, T., et al. (2014). "Genome-wide DNA methylation analysis of human pancreatic islets from type 2 diabetic and non-diabetic donors identifies candidate genes that influence insulin secretion." PLoS Genet 10(3): e1004160.

Dayeh, T., et al. (2014). "Genome-wide DNA methylation analysis of human pancreatic islets from type 2 diabetic and non-diabetic donors identifies candidate genes that influence insulin secretion." PLoS genetics 10(3): e1004160. de Smith, A. J., et al. (2007). "Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex diseases." Human molecular genetics 16(23): 2783-2794.

De Smith, A. J., et al. (2008). "Small deletion variants have stable breakpoints commonly associated with alu elements." PloS one 3(8): e3104. de Smith, A. J., et al. (2008). "Human genes involved in copy number variation: mechanisms of origin, functional effects and implications for disease." Cytogenet Genome Res 123(1-4): 17-26. de Vries, B. B., et al. (2005). "Diagnostic genome profiling in mental retardation." The American Journal of Human Genetics 77(4): 606-616.

Deeb, S. S., et al. (1998). "A Pro12Ala substitution in PPARγ2 associated with decreased receptor activity, lower body mass index and improved insulin sensitivity." Nature genetics 20(3): 284-287.

DeFronzo, R. A. (1979). "Glucose intolerance and aging: evidence for tissue insensitivity to insulin." Diabetes 28(12): 1095-1101.

194

Den Dunnen, J., et al. (1989). "Topography of the Duchenne muscular dystrophy (DMD) gene: FIGE and cDNA analysis of 194 cases reveals 115 deletions and 13 duplications." American journal of human genetics 45(6): 835.

DePristo, M. A., et al. (2011). "A framework for variation discovery and genotyping using next- generation DNA sequencing data." Nat Genet 43(5): 491-498.

Dina, C., et al. (2007). "Variation in FTO contributes to childhood obesity and severe adult obesity." Nat Genet 39(6): 724-726.

Djupedal, I. and K. Ekwall (2009). "Epigenetics: heterochromatin meets RNAi." Cell Res 19(3): 282- 295.

Doria, A., et al. (2008). "The emerging genetic architecture of type 2 diabetes." Cell Metab 8(3): 186- 200.

Doria, A., et al. (2008). "The emerging genetic architecture of type 2 diabetes." Cell Metab 8(3): 186- 200.

Drong, A., et al. (2012). "The genetic and epigenetic basis of type 2 diabetes and obesity." Clinical Pharmacology & Therapeutics 92(6): 707-715.

Drong, A. W., et al. (2012). "The Genetic and Epigenetic Basis of Type 2 Diabetes and Obesity." Clinical Pharmacology & Therapeutics 92(6): 707-715.

Drong, A. W., et al. (2012). "The genetic and epigenetic basis of type 2 diabetes and obesity." Clin Pharmacol Ther 92(6): 707-715.

Du, P., et al. (2008). "lumi: a pipeline for processing Illumina microarray." Bioinformatics 24(13): 1547-1548.

Dudbridge, F. and A. Gusnanto (2008). "Estimation of significance thresholds for genomewide association scans." Genetic epidemiology 32(3): 227.

Duggirala, R., et al. (1999). "Linkage of type 2 diabetes mellitus and of age at onset to a genetic location on chromosome 10q in Mexican Americans." The American Journal of Human Genetics 64(4): 1127-1140.

Duncan, S. A., et al. (1998). "Regulation of a transcription factor network required for differentiation and metabolism." Science 281(5377): 692-695.

Dupuis, J., et al. (2010). "New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk." Nature genetics 42(2): 105-116.

195

Eckel-Passow, J. E., et al. (2011). "Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform." BMC Bioinformatics 12: 220.

Eijk-Van Os, P. G. and J. P. Schouten (2011). Multiplex ligation-dependent probe amplification (MLPA®) for the detection of copy number variation in genomic sequences. PCR Mutation detection protocols, Springer: 97-126.

Elbein, S. C., et al. (2009). "Genome-wide linkage and admixture mapping of type 2 diabetes in African American families from the American Diabetes Association GENNID (Genetics of NIDDM) Study Cohort." Diabetes 58(1): 268-274.

Eleftherohorinou, H., et al. (2011). "famCNV: copy number variant association for quantitative traits in families." Bioinformatics 27(13): 1873-1875.

Fain, J. N., et al. (2004). "Comparison of the release of adipokines by adipose tissue, adipose tissue matrix, and adipocytes from visceral and subcutaneous abdominal adipose tissues of obese humans." Endocrinology 145(5): 2273-2282.

Fajans, S. S., et al. (2001). "Molecular mechanisms and clinical pathophysiology of maturity-onset diabetes of the young." New England Journal of Medicine 345(13): 971-980.

Fanciulli, M., et al. (2007). "FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity." Nat Genet 39(6): 721-723.

Farrelly, D., et al. (1999). "Mice mutant for glucokinase regulatory protein exhibit decreased liver glucokinase: a sequestration mechanism in metabolic regulation." Proceedings of the National Academy of Sciences 96(25): 14511-14516.

Feil, R. and M. F. Fraga (2011). "Epigenetics and the environment: emerging patterns and implications." Nat Rev Genet 13(2): 97-109.

Feil, R. and M. F. Fraga (2012). "Epigenetics and the environment: emerging patterns and implications." Nature Reviews Genetics 13(2): 97-109.

Ferreira, M. A. (2004). "Linkage analysis: principles and methods for the analysis of human quantitative traits." Twin Research 7(05): 513-530.

Flanagan, J. M., et al. (2006). "Intra-and interindividual epigenetic variation in human germ cells." The American Journal of Human Genetics 79(1): 67-84.

Flegal, K. M., et al. (1991). "Prevalence of diabetes in Mexican Americans, Cubans, and Puerto Ricans from the Hispanic Health and Nutrition Examination Survey, 1982-1984." Diabetes Care 14(7): 628- 638.

196

Florez, J. C., et al. (2004). "Haplotype structure and genotype-phenotype correlations of the sulfonylurea receptor and the islet ATP-sensitive potassium channel gene region." Diabetes 53(5): 1360-1368.

Florez, J. C., et al. (2007). "A 100K genome-wide association scan for diabetes and related traits in the Framingham Heart Study: replication and integration with other genome-wide datasets." Diabetes 56(12): 3063-3074.

Fraga, M. F. and M. Esteller (2007). "Epigenetics and aging: the targets and the marks." TRENDS in Genetics 23(8): 413-418.

Franks, P., et al. (2008). "Replication of the association between variants in WFS1 and risk of type 2 diabetes in European populations." Diabetologia 51(3): 458-463.

Frayling, T. M., et al. (2007). "A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity." Science 316(5826): 889-894.

Frazer, K. A., et al. (2007). "A second generation human haplotype map of over 3.1 million SNPs." Nature 449(7164): 851-861.

Frederiksen, L., et al. (2002). "Comment: studies of the Pro12Ala polymorphism of the PPAR-gamma gene in the Danish MONICA cohort: homozygosity of the Ala allele confers a decreased risk of the insulin resistance syndrome." J Clin Endocrinol Metab 87(8): 3989-3992.

Freeman, J. L., et al. (2006). "Copy number variation: new insights in genome diversity." Genome Res 16(8): 949-961.

Froguel, P., et al. (1992). "Close linkage of glucokinase locus on chromosome 7p to early-onset non- insulin-dependent diabetes mellitus." Nature 356(6365): 162-164.

Froguel, P. and G. Velho (2001). "Genetic determinants of type 2 diabetes." Recent progress in hormone research 56: 91-106.

Génin, E. and A. A. Todorov (2006). "Homozygosity Mapping." eLS.

Gerbitz, K. D., et al. (1995). "Mitochondrial diabetes mellitus: a review." Biochim Biophys Acta 1271(1): 253-260.

Gervin, K., et al. (2011). "Extensive variation and low heritability of DNA methylation identified in a twin study." Genome research 21(11): 1813-1821.

Gibbs, R. A., et al. (2003). "The international HapMap project." Nature 426(6968): 789-796.

197

Girirajan, S., et al. (2011). "Human copy number variation and complex genetic disease." Annual review of genetics 45: 203-226.

Glessner, J. T., et al. (2013). "ParseCNV integrative copy number variation association software with quality tracking." Nucleic Acids Res: gks1346.

Gloyn, A. L., et al. (2003). "Large-scale association studies of variants in genes encoding the pancreatic β-cell KATP channel subunits Kir6. 2 (KCNJ11) and SUR1 (ABCC8) confirm that the KCNJ11 E23K variant is associated with type 2 diabetes." Diabetes 52(2): 568-572.

Gloyn, A. L., et al. (2003). "Large-scale association studies of variants in genes encoding the pancreatic beta-cell KATP channel subunits Kir6.2 (KCNJ11) and SUR1 (ABCC8) confirm that the KCNJ11 E23K variant is associated with type 2 diabetes." Diabetes 52(2): 568-572.

Goddard, K. A., et al. (2000). "Linkage disequilibrium and allele-frequency distributions for 114 single-nucleotide polymorphisms in five populations." Am J Hum Genet 66(1): 216-234.

Goldstein, D. B., et al. (2013). "Sequencing studies in human genetics: design and interpretation." Nature Reviews Genetics 14(7): 460-470.

Gordon, L., et al. (2012). "Neonatal DNA methylation profile in human twins is specified by a complex interplay between intrauterine environmental and genetic factors, subject to tissue-specific influence." Genome research 22(8): 1395-1406.

Gosadi, I. M., et al. (2014). "Investigating the potential effect of consanguinity on type 2 diabetes susceptibility in a saudi population." Hum Hered 77(1-4): 197-206.

Gouda, H. N., et al. (2010). "The association between the peroxisome proliferator-activated receptor-γ2 (PPARG2) Pro12Ala gene variant and type 2 diabetes mellitus: a HuGE review and meta- analysis." American journal of epidemiology 171(6): 645-655.

Grant, S. F., et al. (2006). "Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes." Nature genetics 38(3): 320-323.

Greeley, S. A. W., et al. (2011). "The Cost-Effectiveness of Personalized Genetic Medicine The case of genetic testing in neonatal diabetes." Diabetes Care 34(3): 622-627.

Grimsby, J., et al. (2000). "Characterization of glucokinase regulatory protein-deficient mice." Journal of Biological Chemistry 275(11): 7826-7831.

Groop, L., et al. (1996). "Metabolic consequences of a family history of NIDDM (the Botnia study): evidence for sex-specific parental effects." Diabetes 45(11): 1585-1593.

198

Groop, L. and F. Pociot (2014). "Genetics of diabetes–Are we missing the genes or the disease?" Molecular and cellular endocrinology 382(1): 726-739.

Guan, W., et al. (2008). "Meta-analysis of 23 type 2 diabetes linkage studies from the International Type 2 Diabetes Linkage Analysis Consortium." Hum Hered 66(1): 35-49.

Guarner, F. (2015). "The gut microbiome: What do we know?" Clinical Liver Disease 5(4): 86-90.

Gudmundsson, J., et al. (2007). "Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes." Nature genetics 39(8): 977-983.

Haeusler, R. A., et al. (2013). "Human insulin resistance is associated with increased plasma levels of 12α-hydroxylated bile acids." Diabetes 62(12): 4184-4191.

Hales, C. N. and D. J. Barker (1992). "Type 2 (non-insulin-dependent) diabetes mellitus: the thrifty phenotype hypothesis." Diabetologia 35(7): 595-601.

Hall, I. M., et al. (2002). "Establishment and maintenance of a heterochromatin domain." Science 297(5590): 2232-2237.

Hamamy, H. (2012). "Consanguineous marriages." Journal of community genetics 3(3): 185-192.

Hanis, C., et al. (1996). "A genome–wide search for human non–insulin–dependent (type 2) diabetes genes reveals a major susceptibility locus on chromosome 2." Nature genetics 13(2): 161-166.

Hanson, R. L., et al. (2007). "A search for variants associated with young-onset type 2 diabetes in American Indians in a 100K genotyping array." Diabetes 56(12): 3045-3052.

Hanson, R. L., et al. (2013). "Strong parent-of-origin effects in the association of KCNQ1 variants with type 2 diabetes mellitus in American Indians." Diabetes: DB_121767.

Hardy, J. and A. Singleton (2009). "Genomewide association studies and human disease." New England Journal of Medicine 360(17): 1759-1768.

Hari Kumar, K. V. and K. D. Modi (2014). "Twins and endocrinology." Indian J Endocrinol Metab 18(Suppl 1): S48-52.

Hattersley, A., et al. (2006). "ISPAD Clinical Practice Consensus Guidelines 2006-2007. The diagnosis and management of monogenic diabetes in children." Pediatr Diabetes 7(6): 352-360.

Hayes, M. G., et al. (2007). "Identification of type 2 diabetes genes in Mexican Americans through genome-wide association studies." Diabetes 56(12): 3033-3044.

199

Hehir-Kwa, J. Y., et al. (2015). "Exome sequencing and whole genome sequencing for the detection of copy number variation." Expert Rev Mol Diagn: 1-10.

Heijmans, B. T., et al. (2008). "Persistent epigenetic differences associated with prenatal exposure to famine in humans." Proceedings of the National Academy of Sciences 105(44): 17046-17049.

Helgason, A., et al. (2001). "mtDNA and the islands of the North Atlantic: estimating the proportions of Norse and Gaelic ancestry." The American Journal of Human Genetics 68(3): 723-737.

Helgason, A., et al. (2000). "Estimating Scandinavian and Gaelic ancestry in the male settlers of Iceland." The American Journal of Human Genetics 67(3): 697-717.

Holliday, R. and J. E. Pugh (1975). "DNA modification mechanisms and gene activity during development." Science 187(4173): 226-232.

Horikawa, Y., et al. (2000). "Genetic variation in the gene encoding calpain-10 is associated with type 2 diabetes mellitus." Nature genetics 26(2): 163-175.

Houseman, E. A., et al. (2012). "DNA methylation arrays as surrogate measures of cell mixture distribution." BMC Bioinformatics 13(1): 86.

Hoyo, C., et al. (2011). "Methylation variation at IGF2 differentially methylated regions and maternal folic acid use before and during pregnancy." Epigenetics 6(7): 928-936.

Huang, Q. Y., et al. (2006). "Linkage and association studies of the susceptibility genes for type 2 diabetes." Yi Chuan Xue Bao 33(7): 573-589.

I Naseer, M., et al. (2014). "Role of gut microbiota in obesity, type 2 diabetes and Alzheimer’s disease." CNS & Neurological Disorders-Drug Targets (Formerly Current Drug Targets-CNS & Neurological Disorders) 13(2): 305-311.

Iafrate, A. J., et al. (2004). "Detection of large-scale variation in the human genome." Nature genetics 36(9): 949-951.

Inoue, I., et al. (1997). "A nucleotide substitution in the promoter of human angiotensinogen is associated with essential hypertension and affects basal transcription in vitro." Journal of Clinical Investigation 99(7): 1786.

Itsara, A., et al. (2009). "Population analysis of large copy number variants and hotspots of human genetic disease." The American Journal of Human Genetics 84(2): 148-161.

Jackson, R. J. and N. Standart (2007). "How do microRNAs regulate gene expression." Sci Stke 367(re1).

200

Jaenisch, R. and A. Bird (2003). "Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals." Nature genetics 33: 245-254.

Jafar-Mohammadi, B. and M. I. McCarthy (2008). "Genetics of type 2 diabetes mellitus and obesity-a review." Annals of medicine 40(1): 2-10.

Jafar-Mohammadi, B. and M. I. Mccarthy (2008). "Genetics of type 2 diabetes mellitus and obesity - a review." Annals of Medicine 40(1): 2-10.

Jeon, J.-P., et al. (2010). "Copy number variation at leptin receptor gene locus associated with metabolic traits and the risk of type 2 diabetes mellitus." BMC genomics 11(1): 426.

Jeon, J. P., et al. (2010). "Copy number variation at leptin receptor gene locus associated with metabolic traits and the risk of type 2 diabetes mellitus." BMC genomics 11: 426.

Ji, B. and J. Nielsen (2015). "From Next-Generation Sequencing to Systematic Modeling of the Gut Microbiome." Frontiers in Genetics 6: 219.

Jirtle, R. L. and M. K. Skinner (2007). "Environmental epigenomics and disease susceptibility." Nature Reviews Genetics 8(4): 253-262.

Kadowaki, T., et al. (2006). "Adiponectin and adiponectin receptors in insulin resistance, diabetes, and the metabolic syndrome." Journal of Clinical Investigation 116(7): 1784.

Kahn, C. R. (1994). "Banting Lecture. Insulin action, diabetogenes, and the cause of type II diabetes." Diabetes 43(8): 1066-1084.

Kahn, C. R., et al. (1996). "Genetics of non-insulin-dependent (type-II) diabetes mellitus." Annu Rev Med 47: 509-531.

Kahn, S. E., et al. (2006). "Mechanisms linking obesity to insulin resistance and type 2 diabetes." Nature 444(7121): 840-846.

Kallioniemi, A., et al. (1992). "Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors." Science 258(5083): 818-821.

Kaminsky, Z. A., et al. (2009). "DNA methylation profiles in monozygotic and dizygotic twins." Nature genetics 41(2): 240-245.

Kaprio, J., et al. (1992). "Concordance for type 1 (insulin-dependent) and type 2 (non-insulin- dependent) diabetes mellitus in a population-based cohort of twins in Finland." Diabetologia 35(11): 1060-1067.

201

Karlsson, F. H., et al. (2013). "Gut metagenome in European women with normal, impaired and diabetic glucose control." Nature 498(7452): 99-103.

Kidd, J. M., et al. (2008). "Mapping and sequencing of structural variation from eight human genomes." Nature 453(7191): 56-64.

Kirchner, H., et al. (2013). "Epigenetic flexibility in metabolic regulation: disease cause and prevention?" Trends Cell Biol 23(5): 203-209.

Kirchner, H., et al. (2013). "Epigenetic flexibility in metabolic regulation: disease cause and prevention?" Trends Cell Biol 23(5): 203-209.

Kirov, G., et al. (2012). "De novo CNV analysis implicates specific abnormalities of postsynaptic signalling complexes in the pathogenesis of schizophrenia." Mol Psychiatry 17(2): 142-153.

Koch, C. M., et al. (2007). "The landscape of histone modifications across 1% of the human genome in five human cell lines." Genome research 17(6): 691-707.

Kooner, J. S., et al. (2011). "Genome-wide association study in individuals of South Asian ancestry identifies six new type 2 diabetes susceptibility loci." Nature genetics 43(10): 984-989.

Kruglyak, L., et al. (1996). "Parametric and nonparametric linkage analysis: A unified multipoint approach." American Journal of Human Genetics 58(6): 1347-1363.

Kryukov, G. V., et al. (2009). "Power of deep, all-exon resequencing for discovery of human trait genes." Proceedings of the National Academy of Sciences 106(10): 3871-3876.

Ku, C. S., et al. (2010). "The pursuit of genome-wide association studies: where are we now&quest." Journal of human genetics 55(4): 195-206.

Kudo, H., et al. (2011). "Frequent loss of genome gap region in 4p16.3 subtelomere in early-onset type 2 diabetes mellitus." Exp Diabetes Res 2011: 498460.

Kumar, P., et al. (2014). "Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance." BMC research notes 7(1): 747.

Langenberg, C., et al. (2014). "Gene-lifestyle interaction and type 2 diabetes: the EPIC interact case- cohort study."

Langmead, B. and S. L. Salzberg (2012). "Fast gapped-read alignment with Bowtie 2." Nat Methods 9(4): 357-359.

202

Lee, H.-S., et al. (2014). "Genome-wide copy number variation study reveals KCNIP1 as a modulator of insulin secretion." Genomics 104(2): 113-120.

Lee, J. A., et al. (2007). "A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders." Cell 131(7): 1235-1247.

Lehne, B., et al. (2015). "A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies." Genome biology 16(1): 37.

Lettre, G. (2014). "Rare and low-frequency variants in human common diseases and other complex traits." Journal of medical genetics: jmedgenet-2014-102437.

Levenson, J. M. and J. D. Sweatt (2005). "Epigenetic mechanisms in memory formation." Nature Reviews Neuroscience 6(2): 108-118.

Li, B., et al. (2007). "The role of chromatin during transcription." Cell 128(4): 707-719.

Li, B. and S. M. Leal (2008). "Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data." The American Journal of Human Genetics 83(3): 311-321.

Li, D., et al. (2011). "Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies." Genetic epidemiology 35(8): 790-799.

Li, E., et al. (1992). "Targeted mutation of the DNA methyltransferase gene results in embryonic lethality." Cell 69(6): 915-926.

Li, H., et al. (2013). "Association between type 2 diabetes and rs10811661 polymorphism upstream of CDKN2A/B: a meta-analysis." Acta diabetologica 50(5): 657-662.

Li, J., et al. (2015). "Genetic basis of type 2 diabetes–recommendations based on meta-analysis." European review for medical and pharmacological sciences 19(1): 138-148.

Ling, C., et al. (2008). "Epigenetic regulation of PPARGC1A in human type 2 diabetic islets and effect on insulin secretion." Diabetologia 51(4): 615-622.

Ling, C. and L. Groop (2009). "Epigenetics: a molecular link between environmental factors and type 2 diabetes." Diabetes 58(12): 2718-2725.

Ling, C., et al. (2004). "Multiple environmental and genetic factors influence skeletal muscle PGC-1α and PGC-1β gene expression in twins." Journal of Clinical Investigation 114(10): 1518.

203

Ling, Y., et al. (2011). "Associations of common polymorphisms in GCKR with type 2 diabetes and related traits in a Han Chinese population: a case-control study." BMC Med Genet 12(1): 66.

Lister, R., et al. (2009). "Human DNA methylomes at base resolution show widespread epigenomic differences." Nature 462(7271): 315-322.

Liu, L., et al. (2008). "Gene-environment interactions and epigenetic basis of human diseases." Current issues in molecular biology 10(1-2): 25.

Liu, Z., et al. (2012). "Methylation status of CpG sites in the MCP-1 promoter is correlated to serum MCP-1 in Type 2 diabetes." Journal of endocrinological investigation 35(6): 585-589.

Lucas, A. (1994). "Role of nutritional programming in determining adult morbidity." Archives of disease in childhood 71(4): 288.

Luft, R. and B. R. Landau (1995). "Mitochondrial medicine." J Intern Med 238(5): 405-421.

Lupski, J. R. (2007). "Structural variation in the human genome." New England Journal of Medicine 356(11): 1169.

Lyssenko, V., et al. (2007). "Mechanisms by which common variants in the TCF7L2 gene increase risk of type 2 diabetes." Journal of Clinical Investigation 117(8): 2155.

Lyssenko, V., et al. (2009). "Common variant in MTNR1B associated with increased risk of type 2 diabetes and impaired early insulin secretion." Nature genetics 41(1): 82-88.

Mackay, I. M., et al. (2002). "Real-time PCR in virology." Nucleic Acids Res 30(6): 1292-1305.

Maher, B. (2008). "Personal genomes: The case of the missing heritability." Nature News 456(7218): 18-21.

Maier, S. and A. Olek (2002). "Diabetes: a candidate disease for efficient DNA methylation profiling." J Nutr 132(8 Suppl): 2440s-2443s.

Manolio, T. A., et al. (2009). "Finding the missing heritability of complex diseases." Nature 461(7265): 747-753.

Marabita, F., et al. (2013). "An evaluation of analysis pipelines for DNA methylation profiling using the Illumina HumanMethylation450 BeadChip platform." Epigenetics 8(3): 333-346.

Marabita, F., et al. (2013). "An evaluation of analysis pipelines for DNA methylation profiling using the Illumina HumanMethylation450 BeadChip platform." Epigenetics 8(3): 333-346.

204

Mardis, E. R. (2008). "The impact of next-generation sequencing technology on genetics." TRENDS in Genetics 24(3): 133-141.

Marjoram, P. and D. C. Thomas (2014). "Next-Generation Sequencing Studies: Optimal Design and Analysis, Missing Heritability and Rare Variants." Current Epidemiology Reports 1(4): 213-219.

Martin, B. C., et al. (1992). "Role of glucose and insulin resistance in development of type 2 diabetes mellitus: results of a 25-year follow-up study." Lancet 340(8825): 925-929.

Matschinsky, F. M. (1996). "A lesson in metabolic regulation inspired by the glucokinase glucose sensor paradigm." Diabetes 45(2): 223-241.

Matsunami, N., et al. (2013). "Identification of rare recurrent copy number variants in high-risk autism families and their prevalence in a large ASD population." PloS one 8(1): e52239.

McCarroll, S. A. (2008). "Extending genome-wide association studies to copy-number variation." Human molecular genetics 17(R2): R135-R142.

McCarroll, S. A., et al. (2006). "Common deletion polymorphisms in the human genome." Nature genetics 38(1): 86-92.

McCarroll, S. A., et al. (2008). "Integrated detection and population-genetic analysis of SNPs and copy number variation." Nature genetics 40(10): 1166-1174.

McCarthy, M. I., et al. (2008). "Genome-wide association studies for complex traits: consensus, uncertainty and challenges." Nature Reviews Genetics 9(5): 356-369.

McQuillan, R., et al. (2008). "Runs of homozygosity in European populations." The American Journal of Human Genetics 83(3): 359-372.

Meigs, J. B., et al. (2000). "Parental transmission of type 2 diabetes: the Framingham Offspring Study." Diabetes 49(12): 2201-2207.

Menni, C., et al. (2013). "Biomarkers for type 2 diabetes and impaired fasting glucose using a non- targeted metabolomics approach." Diabetes: DB_130570.

Metzker, M. L. (2010). "Sequencing technologies—the next generation." Nature Reviews Genetics 11(1): 31-46.

Meyer, P., et al. (1994). "Evidence for cytosine methylation of non-symmetrical sequences in transgenic Petunia hybrida." Embo j 13(9): 2084-2088.

205

Meyre, D., et al. (2005). "Variants of ENPP1 are associated with childhood and adult obesity and increase the risk of glucose intolerance and type 2 diabetes." Nature genetics 37(8): 863-867.

Mills, R. E., et al. (2011). "Mapping copy number variation by population-scale genome sequencing." Nature 470(7332): 59-65.

Mohlke, K. L. and M. Boehnke (2015). "Recent advances in understanding the genetic architecture of type 2 diabetes." Human molecular genetics.

Montgomery, M. K., et al. (1998). "RNA as a target of double-stranded RNA-mediated genetic interference in Caenorhabditis elegans." Proceedings of the National Academy of Sciences 95(26): 15502-15507.

Mook-Kanamori, M. J., et al. (2014). "Elevated HbA1c levels in individuals not diagnosed with type 2 diabetes in Qatar: a pilot study." Qatar medical journal 2014(2): 106.

Moran, A., et al. (1999). "Insulin resistance during puberty: results from clamp studies in 357 children." Diabetes 48(10): 2039-2044.

Morris, A. P., et al. (2012). "Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes." Nature genetics 44(9): 981.

Murphy, R., et al. (2008). "Clinical implications of a molecular genetic classification of monogenic beta-cell diabetes." Nat Clin Pract Endocrinol Metab 4(4): 200-213.

Murray, S. S., et al. (2004). "A highly informative SNP linkage panel for human genetic studies." Nature methods 1(2): 113-117.

Murray, S. S., et al. (2004). "A highly informative SNP linkage panel for human genetic studies." Nat Methods 1(2): 113-117.

Neale, B. M. and S. Purcell (2008). "The positives, protocols, and perils of genome-wide association." Am J Med Genet B Neuropsychiatr Genet 147b(7): 1288-1294.

Newman, B., et al. (1987). "Concordance for type 2 (non-insulin-dependent) diabetes mellitus in male twins." Diabetologia 30(10): 763-768.

Ng, P. C. and S. Henikoff (2003). "SIFT: Predicting amino acid changes that affect protein function." Nucleic Acids Res 31(13): 3812-3814.

Nguyen, C. T., et al. (2001). "Altered chromatin structure associated with methylation-induced gene silencing in cancer cells: correlation of accessibility, methylation, MeCP2 binding and acetylation." Nucleic Acids Res 29(22): 4598-4606.

206

Niemitz, E. (2014). "Parent-of-origin effects." Nat Genet 46(3): 219-219.

Noble, W. S. (2009). "How does multiple testing correction work?" Nature biotechnology 27(12): 1135-1137.

O'Rahilly, S., et al. (2005). "Genetic factors in type 2 diabetes: the end of the beginning?" Science 307(5708): 370-373.

Okano, M., et al. (1999). "DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development." Cell 99(3): 247-257.

Omori, S., et al. (2009). "Replication study for the association of new meta-analysis-derived risk loci with susceptibility to type 2 diabetes in 6,244 Japanese individuals." Diabetologia 52(8): 1554-1560.

Ooi, S. K., et al. (2007). "DNMT3L connects unmethylated lysine 4 of histone H3 to de novo methylation of DNA." Nature 448(7154): 714-717.

Pankratz, N., et al. (2011). "Copy number variation in familial Parkinson disease." PloS one 6(8): e20988.

Parikh, H., et al. (2007). "TXNIP regulates peripheral glucose metabolism in humans." PLoS Med 4(5): e158.

Parikh, H., et al. (2009). "Prioritizing genes for follow-up from genome wide association studies using information on gene expression in tissues relevant for type 2 diabetes mellitus." BMC Medical Genomics 2(1): 72.

Pearson, T. A. and T. A. Manolio (2008). "How to interpret a genome-wide association study." Jama 299(11): 1335-1344.

Petersen, A.-K., et al. (2014). "Epigenetics meets metabolomics: an epigenome-wide association study with blood serum metabolic traits." Human molecular genetics 23(2): 534-545.

Pettitt, D. J., et al. (1988). "Congenital susceptibility to NIDDM: role of intrauterine environment." Diabetes 37(5): 622-628.

Pettitt, D. J., et al. (1983). "Excessive obesity in offspring of Pima Indian women with diabetes during pregnancy." New England Journal of Medicine 308(5): 242-245.

Phillips, T. (2008). "The role of methylation in gene expression." Nature Education 1(1): 116.

207

Pinto, D., et al. (2010). "Functional impact of global rare copy number variation in autism spectrum disorders." Nature 466(7304): 368-372.

Plengvidhya, N., et al. (2012). "Identification of copy number variation of CAPN10 in Thais with type 2 diabetes by multiplex PCR and denaturing high performance liquid chromatography (DHPLC)." Gene 506(2): 383-386.

Plengvidhya, N., et al. (2015). "Detection of CAPN10 copy number variation in Thai patients with type 2 diabetes by denaturing high performance liquid chromatography and real‐time quantitative polymerase chain reaction." Journal of Diabetes Investigation.

Plomin, R., et al. (2009). "Common disorders are quantitative traits." Nature Reviews Genetics 10(12): 872-878.

Porzio, O., et al. (2007). "Missense mutations in the TGM2 gene encoding transglutaminase 2 are found in patients with early‐onset type 2 diabetes." Human mutation 28(11): 1150-1150.

Poulsen, P., et al. (1999). "Heritability of type II (non-insulin-dependent) diabetes mellitus and abnormal glucose tolerance--a population-based twin study." Diabetologia 42(2): 139-145.

Prasad, R. B. and L. Groop (2015). "Genetics of Type 2 Diabetes—Pitfalls and Possibilities." Genes 6(1): 87-123.

Pritchard, J. K. (2001). "Are rare variants responsible for susceptibility to complex diseases?" The American Journal of Human Genetics 69(1): 124-137.

Pritchard, J. K. and N. J. Cox (2002). "The allelic architecture of human disease genes: common disease–common variant… or not?" Human molecular genetics 11(20): 2417-2423.

Prokopenko, I., et al. (2009). "Variants in MTNR1B influence fasting glucose levels." Nature genetics 41(1): 77-81.

Prokopenko, I., et al. (2008). "Type 2 diabetes: new genes, new understanding." Trends in Genetics 24(12): 613-621.

Purcell, S., et al. (2007). "PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses." The American Journal of Human Genetics 81(3): 559-575.

Qian, C. and M.-M. Zhou (2006). "SET domain protein lysine methyltransferases: Structure, specificity and catalysis." Cellular and molecular life sciences CMLS 63(23): 2755-2763.

Rajpathak, S. N., et al. (2009). "The role of insulin‐like growth factor‐I and its binding proteins in glucose homeostasis and type 2 diabetes." Diabetes/metabolism research and reviews 25(1): 3-12.

208

Rakyan, V. K., et al. (2011). "Epigenome-wide association studies for common human diseases." Nature Reviews Genetics 12(8): 529-541.

Rampersaud, E., et al. (2007). "Identification of novel candidate genes for type 2 diabetes from a genome-wide association scan in the Old Order Amish: evidence for replication from diabetes- related quantitative traits and from independent populations." Diabetes 56(12): 3053-3062.

Rampersaud, E., et al. (2008). "Investigating parent of origin effects in studies of type 2 diabetes and obesity." Current diabetes reviews 4(4): 329.

Reaven, G. M. (1988). "Role of insulin resistance in human disease." Diabetes 37(12): 1595-1607.

Redon, R., et al. (2006). "Global variation in copy number in the human genome." Nature 444(7118): 444-454.

Reich, D. E. and E. S. Lander (2001). "On the allelic spectrum of human disease." TRENDS in Genetics 17(9): 502-510.

Reynisdottir, I., et al. (2003). "Localization of a susceptibility gene for type 2 diabetes to chromosome 5q34–q35. 2." The American Journal of Human Genetics 73(2): 323-335.

Rich, S. S. (1990). "Mapping genes in diabetes. Genetic epidemiological perspective." Diabetes 39(11): 1315-1319.

Rich, S. S., et al. (2008). "Genes Associated With Risk of Type 2 Diabetes Identified by a Candidate- Wide Association Scan As a Trickle Becomes a Flood." Diabetes 57(11): 2915-2917.

Ridderstråle, M. and L. Groop (2009). "Genetic dissection of type 2 diabetes." Molecular and cellular endocrinology 297(1): 10-17.

Rideout, W. M., 3rd, et al. (1990). "5-Methylcytosine as an endogenous mutagen in the human LDL receptor and p53 genes." Science 249(4974): 1288-1290.

Riggs, A. D. (1975). "X inactivation, differentiation, and DNA methylation." Cytogenetic and genome research 14(1): 9-25.

Riggs, A. D. (1975). "X inactivation, differentiation, and DNA methylation." Cytogenet Cell Genet 14(1): 9-25.

Risch, N. and K. Merikangas (1996). "The future of genetic studies of complex human diseases." Science 273(5281): 1516-1517.

209

Roden, M., et al. (1996). "Mechanism of free fatty acid-induced insulin resistance in humans." Journal of Clinical Investigation 97(12): 2859.

Rosmond, R., et al. (2003). "The Pro12Ala PPARgamma2 gene missense mutation is associated with obesity and insulin resistance in Swedish middle-aged men." Diabetes Metab Res Rev 19(2): 159- 163.

Rotimi, C., et al. (1997). "Hypertension, serum angiotensinogen, and molecular variants of the angiotensinogen gene among Nigerians." Circulation 95(10): 2348-2350.

Rudd, D. S., et al. (2014). "A genome‐wide CNV analysis of schizophrenia reveals a potential role for a multiple‐hit model." American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 165(8): 619-626.

Rung, J., et al. (2009). "Genetic variant near IRS1 is associated with type 2 diabetes, insulin resistance and hyperinsulinemia." Nature genetics 41(10): 1110-1115.

Rung, J., et al. (2009). "Genetic variant near IRS1 is associated with type 2 diabetes, insulin resistance and hyperinsulinemia." Nat Genet 41(10): 1110-1115.

Salonen, J. T., et al. (2007). "Type 2 diabetes whole-genome association study in four populations: the DiaGen consortium." Am J Hum Genet 81(2): 338-345.

Sánchez, L. (2010). "Sciara as an experimental model for studies on the evolutionary relationships between the zygotic, maternal and environmental primary signals for sexual development." Journal of genetics 89(3): 325-331.

Sandhu, M. S., et al. (2007). "Common variants in WFS1 confer risk of type 2 diabetes." Nature genetics 39(8): 951-953.

Sasaki, H. and Y. Matsui (2008). "Epigenetic events in mammalian germ-cell development: reprogramming and beyond." Nature Reviews Genetics 9(2): 129-140.

Sauman, I. and S. M. Reppert (1996). "Circadian clock neurons in the silkmoth Antheraea pernyi: novel mechanisms of period protein regulation." Neuron 17(5): 889-900.

Saxena, R., et al. (2007). "Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels." Science 316(5829): 1331-1336.

Saxena, R., et al. (2007). "Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels." Science 316(5829): 1331-1336.

210

SCHAFTINGEN, E. (1989). "A protein from rat liver confers to glucokinase the property of being antagonistically regulated by fructose 6‐phosphate and fructose 1‐phosphate." European Journal of Biochemistry 179(1): 179-184.

Scherer, P. E. (2006). "Adipose tissue from lipid storage compartment to endocrine organ." Diabetes 55(6): 1537-1545.

Scherer, S. W., et al. (2007). "Challenges and standards in integrating surveys of structural variation." Nature genetics 39: S7-S15.

Schneider, E., et al. (2010). "Spatial, temporal and interindividual epigenetic variation of functionally important DNA methylation patterns." Nucleic Acids Res 38(12): 3880-3890.

Schork, N. J., et al. (2009). "Common vs. rare allele hypotheses for complex diseases." Current opinion in genetics & development 19(3): 212-219.

Schork, N. J., et al. (2008). "DNA sequence-based phenotypic association analysis." Adv Genet 60: 195-217.

Schouten, J. P., et al. (2002). "Relative quantification of 40 nucleic acid sequences by multiplex ligation-dependent probe amplification." Nucleic Acids Res 30(12): e57-e57.

Scott, L. J., et al. (2007). "A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants." Science 316(5829): 1341-1345.

Sebat, J., et al. (2007). "Strong association of de novo copy number mutations with autism." Science 316(5823): 445-449.

Sebat, J., et al. (2004). "Large-scale copy number polymorphism in the human genome." Science 305(5683): 525-528.

Seki, Y., et al. (2012). "Minireview: epigenetic programming of diabetes and obesity: animal models." Endocrinology 153(3): 1031-1038.

Sharp, A. J., et al. (2005). "Segmental duplications and copy-number variation in the human genome." The American Journal of Human Genetics 77(1): 78-88.

Shi, G. and D. Rao (2011). "Optimum designs for next‐generation sequencing to discover rare variants for common complex disease." Genetic epidemiology 35(6): 572-579.

Shi, Y., et al. (2004). "Histone demethylation mediated by the nuclear amine oxidase homolog LSD1." Cell 119(7): 941-953.

211

Shizuya, H., et al. (1992). "Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector." Proceedings of the National Academy of Sciences 89(18): 8794-8797.

Shoelson, S. E., et al. (2006). "Inflammation and insulin resistance." Journal of Clinical Investigation 116(7): 1793.

Shtir, C., et al. (2009). Copy number variation in the Framingham Heart Study. BMC proceedings, BioMed Central Ltd.

Shu, X. O., et al. (2010). "Identification of new genetic risk variants for type 2 diabetes." PLoS genetics 6(9): e1001127.

Silander, K., et al. (2004). "Genetic variation near the hepatocyte nuclear factor-4α gene predicts susceptibility to type 2 diabetes." Diabetes 53(4): 1141-1149.

Singh, S. (2015). "Genetics of Type 2 Diabetes: Advances and Future Prospect." J Diabetes Metab 6(518): 2.

Skinner, M. K. (2011). "Environmental epigenetic transgenerational inheritance and somatic epigenetic mitotic stability." Epigenetics 6(7): 838-842.

Sladek, R., et al. (2007). "A genome-wide association study identifies novel risk loci for type 2 diabetes." Nature 445(7130): 881-885.

Slosberg, E. D., et al. (2001). "Treatment of type 2 diabetes by adenoviral-mediated overexpression of the glucokinase regulatory protein." Diabetes 50(8): 1813-1820.

Stankiewicz, P. and J. R. Lupski (2002). "Genome architecture, rearrangements and genomic disorders." TRENDS in Genetics 18(2): 74-82.

Stankiewicz, P. and J. R. Lupski (2010). "Structural variation in the human genome and its role in disease." Annual review of medicine 61: 437-455.

Steegers-Theunissen, R. P., et al. (2009). "Periconceptional maternal folic acid use of 400 μg per day is related to increased methylation of the IGF2 gene in the very young child." PloS one 4(11): e7845.

Steinthorsdottir, V., et al. (2007). "A variant in CDKAL1 influences insulin response and risk of type 2 diabetes." Nature genetics 39(6): 770-775.

Storey, J. D. (2002). "A direct approach to false discovery rates." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(3): 479-498.

212

Stranger, B. E., et al. (2007). "Relative impact of nucleotide and copy number variation on gene expression phenotypes." Science 315(5813): 848-853.

Strauch, K., et al. (2000). "Parametric and nonparametric multipoint linkage analysis with imprinting and two-locus–trait models: application to mite sensitization." The American Journal of Human Genetics 66(6): 1945-1957.

Suhre, K., et al. (2010). "Metabolic footprint of diabetes: a multiplatform metabolomics study in an epidemiological setting." PloS one 5(11): e13953.

Suzuki, M. M. and A. Bird (2008). "DNA methylation landscapes: provocative insights from epigenomics." Nat Rev Genet 9(6): 465-476.

Tabassum, R., et al. (2012). "Genetic variant of AMD1 is associated with obesity in Urban Indian Children."

Tadmouri, G. O., et al. (2009). "Consanguinity and reproductive health among Arabs." Reprod Health 6(17): 1-9.

Thomas, A., et al. (2008). "Shared genomic segment analysis. Mapping disease predisposition genes in extended pedigrees using SNP genotype assays." Annals of human genetics 72(2): 279-287.

Thompson, R. F., et al. (2010). "Experimental approaches to the study of epigenomic dysregulation in ageing." Experimental gerontology 45(4): 255-268.

Tong, Y., et al. (2009). "Association between TCF7L2 gene polymorphisms and susceptibility to type 2 diabetes mellitus: a large Human Genome Epidemiology (HuGE) review and meta-analysis." BMC Med Genet 10(1): 15.

Tönjes, A., et al. (2006). "Association of Pro12Ala Polymorphism in Peroxisome Proliferator– Activated Receptor γ With Pre-Diabetic Phenotypes Meta-analysis of 57 studies on nondiabetic individuals." Diabetes Care 29(11): 2489-2497.

Toperoff, G., et al. (2012). "Genome-wide survey reveals predisposing diabetes type 2-related DNA methylation variations in human peripheral blood." Human molecular genetics 21(2): 371-383.

Vaag, A., et al. (2014). "Genetic, nongenetic and epigenetic risk determinants in developmental programming of type 2 diabetes." Acta Obstet Gynecol Scand 93(11): 1099-1108.

Van Dam, R., et al. (2004). "The insulin receptor substrate‐1 Gly972Arg polymorphism is not associated with Type 2 diabetes mellitus in two population‐based studies." Diabetic medicine 21(7): 752-758.

213

van Ommen, G.-J. B. (2005). "Frequency of new copy number variation in humans." Nature genetics 37(4): 333-334.

Vaxillaire, M., et al. (2009). "Breakthroughs in monogenic diabetes genetics: from pediatric forms to young adulthood diabetes." Pediatr Endocrinol Rev 6(3): 405-417.

Vaxillaire, M. and P. Froguel (2008). "Monogenic diabetes in the young, pharmacogenetics and relevance to multifactorial forms of type 2 diabetes." Endocr Rev 29(3): 254-264.

Vaxillaire, M. and P. Froguel (2009). "Monogenic forms of diabetes mellitus: an update." Endocrinología y Nutrición 56: 26-29.

Vaxillaire, M. and P. Froguel (2013). Monogenic diabetes in the young, pharmacogenetics and relevance to multifactorial forms of type 2 diabetes, Endocrine Society.

Verdel, A., et al. (2004). "RNAi-mediated targeting of heterochromatin by the RITS complex." Science 303(5658): 672-676.

Vionnet, N., et al. (2000). "Genomewide search for type 2 diabetes-susceptibility genes in French whites: evidence for a novel susceptibility locus for early-onset diabetes on chromosome 3q27-qter and independent replication of a type 2-diabetes locus on chromosome 1q21-q24." Am J Hum Genet 67(6): 1470-1480.

Visscher, P. M., et al. (2012). "Five years of GWAS discovery." The American Journal of Human Genetics 90(1): 7-24.

Voight, B. F., et al. (2010). "Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis." Nature genetics 42(7): 579-589.

Volkmar, M., et al. (2012). "DNA methylation profiling identifies epigenetic dysregulation in pancreatic islets from type 2 diabetic patients." Embo j 31(6): 1405-1426.

Volpe, T. A., et al. (2002). "Regulation of heterochromatic silencing and histone H3 lysine-9 methylation by RNAi." Science 297(5588): 1833-1837.

Walsh, T., et al. (2008). "Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia." Science 320(5875): 539-543.

Walters, R. G., et al. (2013). "Rare genomic structural variants in complex disease: lessons from the replication of associations with obesity." PloS one 8(3): e58048.

Wang, H., et al. (2014). "Copy number variation detection using next generation sequencing read counts." BMC Bioinformatics 15(1): 109.

214

Wang, K., et al. (2007). "PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data." Genome research 17(11): 1665-1674.

Wei, W., et al. (2013). "A calibrated human Y-chromosomal phylogeny based on resequencing." Genome research 23(2): 388-395.

Weijnen, C. F., et al. (2002). "Risk of diabetes in siblings of index cases with Type 2 diabetes: implications for genetic studies." Diabet Med 19(1): 41-50.

Wellen, K. E. and G. S. Hotamisligil (2005). "Inflammation, stress, and diabetes." Journal of Clinical Investigation 115(5): 1111.

Wheeler, E. and I. Barroso (2011). "Genome-wide association studies and type 2 diabetes." Brief Funct Genomics 10(2): 52-60.

Wigginton, J. E. and G. R. Abecasis (2005). "PEDSTATS: descriptive statistics, graphics and quality assessment for gene mapping data." Bioinformatics 21(16): 3445-3447.

Winchester, L., et al. (2009). "Comparing CNV detection methods for SNP arrays." Briefings in functional genomics & proteomics 8(5): 353-366.

Winckler, W., et al. (2005). "Association testing of variants in the hepatocyte nuclear factor 4α gene with risk of type 2 diabetes in 7,883 people." Diabetes 54(3): 886-892.

Wong, K. K., et al. (2007). "A comprehensive analysis of common copy-number variations in the human genome." Am J Hum Genet 80(1): 91-104.

Xie, C. and M. T. Tammi (2009). "CNV-seq, a new method to detect copy number variation using high-throughput sequencing." BMC Bioinformatics 10(1): 80.

Xu, B., et al. (2008). "Strong association of de novo copy number mutations with sporadic schizophrenia." Nature genetics 40(7): 880-885.

Yan, J., et al. (2011). "Evidence for non-CpG methylation in mammals." Exp Cell Res 317(18): 2555- 2561.

Yan, S.-T., et al. (2014). "Association of calpain-10 rs2975760 polymorphism with type 2 diabetes mellitus: a meta-analysis." International journal of clinical and experimental medicine 7(10): 3800.

Yang, B. T., et al. (2012). "Increased DNA methylation and decreased expression of PDX-1 in pancreatic islets from patients with type 2 diabetes." Molecular endocrinology 26(7): 1203-1212.

215

Yang, Q., et al. (2005). "Serum retinol binding protein 4 contributes to insulin resistance in obesity and type 2 diabetes." Nature 436(7049): 356-362.

Yang, T. L., et al. (2008). "Genome-wide copy-number-variation study identified a susceptibility gene, UGT2B17, for osteoporosis." Am J Hum Genet 83(6): 663-674.

Ye, Z. Q., et al. (2010). "Analyses of copy number variation of GK rat reveal new putative type 2 diabetes susceptibility loci." PloS one 5(11): e14077.

Zaghlool, S. B., et al. (2015). "Association of DNA methylation with age, gender, and smoking in an Arab population." Clinical Epigenetics 7(1): 6.

Zarrei, M., et al. (2015). "A copy number variation map of the human genome." Nat Rev Genet 16(3): 172-183.

Zeggini, E., et al. (2008). "Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes." Nature genetics 40(5): 638-645.

Zeggini, E., et al. (2007). "Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes." Science 316(5829): 1336-1341.

Zeggini, E., et al. (2007). "Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes." Science 316(5829): 1336-1341.

Zhang, C., et al. (2006). "Variant of transcription factor 7-like 2 (TCF7L2) gene and the risk of type 2 diabetes in large cohorts of US women and men." Diabetes 55(9): 2645-2648.

Zhang, D., et al. (2015). "Interactions between Obesity-Related Copy Number Variants and Dietary Behaviors in Childhood Obesity." Nutrients 7(4): 3054-3066.

Zhang, F., et al. (2009). "The DNA replication FoSTeS/MMBIR mechanism can generate genomic, genic and exonic complex rearrangements in humans." Nature genetics 41(7): 849-853.

Zhang, X., et al. (2012). "Lysine methylation: beyond histones." Acta Biochim Biophys Sin (Shanghai) 44(1): 14-27.

Zimmet, P. (1982). "Type 2 (non-insulin-dependent) diabetes—an epidemiological overview." Diabetologia 22(6): 399-411.

216

APPENDICES

217

Appendix A

Supplementary Tables

218

Appendix A1. Results of DNA quality using both NanoDrop and Qubit

NanoDrop Qubit con Qubit con Subject con 260/280 DNA Buffer (ug/mL) (ng/ul) (ng/ul) DM_F1S1 98 1.86 0.466 93.2 53.64807 46.35193 DM_F1S2 113.7 1.86 0.449 89.8 55.67929 44.32071 DM_F1S6 71 1.86 0.438 87.6 57.07763 42.92237 DM_F1S9 97.2 1.88 1.12 224 22.32143 77.67857 DM_F1S10 115 1.87 0.825 165 30.30303 69.69697 DM_F1S11 123.9 1.88 0.403 80.6 62.03474 37.96526 DM_F2S1 154.6 1.89 0.547 109 45.87156 54.12844 DM_F2S2 206 1.85 0.556 111 45.04505 54.95495 DM_F2S4 173.7 1.82 0.412 82.4 60.67961 39.32039 DM_F2S5 128 1.88 0.553 111 45.04505 54.95495 DM_F3S1 130.2 1.87 0.402 80.4 62.18905 37.81095 DM_F3S2 74.4 1.84 0.277 55.4 90.25271 9.747292 DM_F3S3 140.6 1.83 0.532 106 47.16981 52.83019 DM_F3S4 105.2 1.83 0.418 83.6 59.80861 40.19139 DM_F3S5 133.7 1.82 0.301 60.2 83.05648 16.94352 DM_F5S1 135.2 1.87 0.508 102 49.01961 50.98039 DM_F5S2 123.7 1.85 0.353 70.6 70.82153 29.17847 DM_F5S3 149.3 1.8 0.689 138 36.23188 63.76812 DM_F5S4 101.5 1.82 0.413 82.6 60.53269 39.46731 DM_F5S6 154.7 1.86 0.368 73.6 67.93478 32.06522 DM_F7S1 119.5 1.88 0.335 67 74.62687 25.37313 DM_F7S2 120.7 1.9 0.361 72.2 69.25208 30.74792 DM_F7S3 172 1.86 0.675 135 37.03704 62.96296 DM_F7S4 141.3 1.88 0.481 96.2 51.97505 48.02495 DM_F7S5 95 1.81 0.4 80 62.5 37.5 DM_F7S6 134.9 1.87 0.524 105 47.61905 52.38095 DM_F8S1 215.2 1.82 0.708 142 35.21127 64.78873 DM_F8S2 246.6 1.88 0.767 153 32.67974 67.32026 DM_F8S3 193 1.86 0.753 151 33.11258 66.88742 DM_F8S4 253.9 1.87 0.951 190 26.31579 73.68421 DM_F8S5 194.1 1.86 0.633 127 39.37008 60.62992 DM_F9S1 186.9 1.84 1.55 310 16.12903 83.87097 DM_F9S2 193.1 1.86 0.866 173 28.90173 71.09827 DM_F9S3 207 1.84 1.29 258 19.37984 80.62016 DM_F9S4 143 1.83 0.763 153 32.67974 67.32026 DM_F9S5 156.2 1.86 0.706 141 35.46099 64.53901 DM_F9S6 194.3 1.83 1.03 206 24.27184 75.72816

219

Appendix A2. Summary of overlapping homozygous regions with linkage regions. chromosome trait Linkage Regions total No Homozygous Regions cases controls cases controls Enrichment start end of start end with with without without Position Position subjects Position Position ROH ROH ROH ROH (bp) (bp) with (bp) (bp) overlap 1 T2D 230863396 241878353 3 235148098 241878353 3 1 21 12 1.7 1 T2D 230863396 241878353 3 233422186 241878353 3 1 21 12 1.7 1 T2D 230863396 241878353 3 230863396 241878353 3 1 21 12 1.7 1 T2D 230863396 241878353 1 230863396 241878353 2 1 22 12 1.1 2 T2D 69305747 125362837 2 69305747 72206006 4 1 20 12 2.4 2 T2D 69305747 125362837 2 69305747 72206006 4 1 20 12 2.4 2 T2D 69305747 125362837 3 73368662 79018187 3 1 21 12 1.7 2 T2D 69305747 125362837 3 73368662 90237441 3 1 21 12 1.7 2 T2D 69305747 125362837 3 76610800 77827848 3 1 21 12 1.7 2 T2D 69305747 125362837 2 69305747 72206006 3 1 21 12 1.7 2 T2D 69305747 125362837 2 69305747 72206006 3 1 21 12 1.7 2 T2D 69305747 125362837 2 73368662 90237441 2 1 22 12 1.1 2 T2D 69305747 125362837 2 79458752 81668770 2 1 22 12 1.1 6 T2D 142679572 165963574 2 160560845 165963574 2 1 22 12 1.1 6 T2D 142679572 165963574 2 163371020 165963574 2 1 22 12 1.1

220

Appendix A3. Homozygous regions identified among all chromosomes chrom chromStart chromEnd empty score 1 235148098 249210707 1 1 233422186 249210707 1 1 212804965 245121226 1 1 202063456 203334839 2 1 201262432 205002356 2 1 196144843 202146978 2 1 212804965 245121226 6 1 217025120 222761771 6 1 20501974 37212617 10 1 24530316 25932186 10 1 12685681 14198401 11 1 12668321 14224999 11 1 2286408 4906358 12 1 3177480 4744206 12 2 56997660 58965211 1 2 56997660 58965211 1 2 54562343 72206006 1 2 56596007 72206006 1 2 73368662 79018187 2 2 73368662 90237441 2 2 76610800 77827848 2 2 54562343 72206006 3 2 56596007 72206006 3 2 62738883 64702832 3 2 130410719 135911422 5 2 134688815 137374734 5 2 73368662 90237441 7 2 79458752 81668770 7 2 43386134 44900846 10 2 43386134 44900846 10 2 31630846 33741470 11 2 31606670 33150389 11 2 26942156 28328311 12 2 27393030 29231980 12 2 3905982 4946157 13 2 4569789 7460260 13 3 183786678 188508345 1 3 181491628 187942700 1 3 182236744 184724719 1 3 174768342 176933061 2 3 171899309 180268361 2

221

3 175668834 176988490 2 3 101613557 111631274 4 3 101613557 103455560 4 3 171899309 180268361 7 3 177061567 179249599 7 3 171899309 180268361 10 3 172121443 173418466 10 3 146234269 152987456 11 3 147682334 151319088 11 3 146234269 152987456 12 3 131431961 146830477 12 3 52011879 53602745 18 3 52011879 53602745 18 4 111600882 114033802 2 4 107418496 116028808 2 4 111607315 114806208 2 4 67524800 69245482 3 4 67524800 69245482 3 4 74983 7384439 21 4 297932 6416437 21 5 28195854 29594583 1 5 28195854 29594583 1 5 17386955 41560079 1 5 24794619 46399093 1 5 24430108 26854137 2 5 24430108 26854137 2 5 24430108 26854137 2 5 24430108 26854137 2 5 167615217 172879282 4 5 170163908 171867327 4 5 127248485 137876920 5 5 127248485 137876920 5 5 128469491 131436486 5 5 64401565 65578788 6 5 64026367 65428707 6 5 49447784 68828296 6 5 21699126 23260501 9 5 17386955 41560079 9 5 16231273 26854137 9 5 153760704 155576931 12 5 153880459 157758162 12 5 70307464 84621404 14 5 79543299 83660322 14 5 70307464 84621404 15

222

5 75902770 79151010 15 5 49447784 68828296 16 5 61396019 62604244 16 5 49447784 68828296 19 5 60256056 61366133 19 5 49447784 68828296 20 5 55407542 57647874 20 5 49447784 68828296 21 5 52052230 53707175 21 5 13589212 17122819 24 5 16231273 26854137 24 6 27728827 29518665 1 6 27475100 29356587 1 6 27475100 29356587 1 6 27296364 29536685 1 6 27289776 29529754 1 6 27090404 29356587 1 6 26706544 29533870 1 6 26706544 28757656 1 6 26367218 29356587 1 6 26114372 27329267 4 6 27296364 29536685 4 6 27289776 29529754 4 6 27090404 29356587 4 6 26706544 29533870 4 6 26706544 28757656 4 6 26367218 29356587 4 6 44549333 45701998 16 6 44549333 45701998 16 6 44503967 45701998 16 6 20203455 21249977 19 6 20203455 21249977 19 6 20203455 21249977 19 6 20250537 21688142 19 6 160560845 166755480 28 6 163371020 166755480 28 6 23199025 25277419 7 6 23490117 25060346 7 7 111554398 114642928 1 7 96572376 132281079 1 7 114513288 118394290 1 7 86066709 88075888 2 7 85177938 87073594 2 7 85177938 87073594 2

223

7 96572376 132281079 3 7 121098085 125309522 3 7 96572376 132281079 6 7 97904505 101229969 6 7 75163169 78858362 8 7 78095582 79157384 8 7 4402953 13190921 12 7 12745968 13900784 12 8 51631199 52680282 1 8 51519246 53049602 1 8 51519246 52614373 1 8 123443180 143374261 2 8 126856247 128834403 2 8 123443180 143374261 5 8 136764962 141143227 5 8 18668608 21176834 10 8 18668608 21176834 10 9 71954858 74321381 1 9 71014600 81111931 1 9 73183281 74299442 1 9 76198452 77825937 3 9 71014600 81111931 3 9 71014600 81111931 3 9 126918033 129087114 5 9 126926107 129293900 5 10 114914665 129761520 1 10 122116629 131682860 1 10 123217600 125383033 1 10 51785542 60761957 6 10 52910580 54525678 6 11 92316799 94321986 1 11 66690454 107514320 1 11 92868270 99620792 1 11 80017466 82066388 2 11 80017466 82066388 2 11 66690454 107514320 2 11 81679838 83580633 2 11 66690454 107514320 3 11 92868270 99620792 3 11 66690454 107514320 6 11 103576791 106543882 6 11 66690454 107514320 8 11 90365458 91974552 8 11 66690454 107514320 9

224

11 84918461 86453700 9 11 39269748 40961066 12 11 40143676 42656681 12 12 85696565 88014161 1 12 85696565 88014161 1 12 85696565 92850038 1 12 85696565 98881228 1 12 127502469 133584643 2 12 118181289 131586886 2 12 128066314 132157473 2 12 51190692 79713434 3 12 72683229 79713434 3 12 77635414 78934816 3 12 85696565 92850038 8 12 85696565 98881228 8 12 92177784 93507069 8 12 80994495 82625012 10 12 80994495 83546071 10 12 37858073 40264547 15 12 37858073 40264547 15 13 40984601 55668296 2 13 41656159 43329817 2 14 79945162 81945020 1 14 79945162 81945020 1 14 56920900 88317375 1 14 77003815 80874587 1 14 56920900 88317375 3 14 59083515 63336953 3 14 59109685 65789077 3 14 91749595 98121716 4 14 91749595 98121716 4 14 92814389 95233824 4 14 71326681 75722496 5 14 56920900 88317375 5 14 71168940 72219146 5 14 56920900 88317375 8 14 84509670 90229409 8 14 55864130 57032961 14 14 56920900 88317375 14 14 46610740 48925919 15 14 47236994 48647168 15 15 95538232 98159737 1 15 95538232 98159737 1 15 96671173 98072046 1

225

15 42492689 44457653 2 15 42444610 44365946 2 15 42432355 44485276 2 15 41106900 42925825 2 15 41106900 42925825 2 15 84267587 85498209 4 15 82244299 93731304 4 15 67174607 93731304 4 15 82244299 93731304 5 15 81133977 82256969 5 15 67174607 93731304 5 15 45995757 47975966 6 15 45995757 47975966 6 15 46747377 48539026 6 15 46860288 47990266 6 15 67174607 93731304 10 15 82244299 93731304 10 15 67174607 93731304 13 15 72070067 74080021 13 15 67174607 93731304 14 15 66441275 67909473 14 15 66441275 67909473 14 15 46747377 48539026 23 15 48387530 51684411 23 16 53545118 65856443 1 16 53540295 65856443 1 16 81820323 82949155 2 16 82027399 86642680 2 17 72611712 74097331 1 17 71831784 77369333 1 17 71831784 77369333 2 17 75192907 79415652 2 17 43364914 45879377 4 17 45340421 47196790 4 17 37036554 40307168 5 17 37036554 40307168 5 17 4448756 6021541 7 17 4493126 5948604 7 18 50448761 51578568 1 18 38343028 51291819 1 18 38343028 51291819 2 18 30894482 44354242 2 18 18969765 22898482 7 18 20781212 21850617 7

226

19 46118127 48511888 1 19 46118127 48511888 1 19 39297624 41697945 2 19 39636847 42107792 2 20 5223014 7294846 1 20 5223014 7294846 1 20 2491209 5243042 1 20 252757 5243042 1 20 252757 6255188 1 20 40481388 62365866 2 20 48579348 60240753 2 20 55294330 57362524 2 20 43786336 44967742 3 20 29507776 45881388 3 20 29507776 44004777 3 20 40481388 62365866 3 20 29507776 45881388 4 20 29507776 44004777 4 20 38484460 41411340 4 20 40481388 62365866 4 20 40481388 62365866 8 20 48579348 60240753 8 20 52820317 55276800 8 20 40481388 62365866 9 20 48579348 60240753 9 20 48678378 52492228 9 20 29507776 45881388 13 20 29507776 44004777 13 20 29507776 36557237 13 20 22076189 26278250 14 20 22076189 26278250 14 20 22076189 26278250 14 20 17479730 20889934 15 20 13069774 20889934 15 20 15843015 20889934 15 21 32505836 36203846 1 21 32505836 36203846 1 21 32762015 42381055 1 21 34358720 36673040 1 21 43335112 47979176 3 21 43335112 47747579 3 21 45992287 47979176 3 21 43335112 47979176 4 21 43335112 47747579 4

227

21 36235739 41353003 5 21 32762015 42381055 5 21 34358720 36673040 5 21 22393029 24810710 11 21 22393029 24810710 11 22 22597173 48705291 1 22 38793267 44699533 1 22 40083730 42873942 1 22 22597173 48705291 2 22 28181399 35795413 2 22 34005507 35592615 2 22 30597810 32325460 3 22 28181399 35795413 3 22 22597173 48705291 3 22 23871042 25393476 4 22 23871042 25392832 4 22 22597173 48705291 4

228

Appendix B

Supplementary Figures

229

Appendix B1. Information content plots of my custom marker map from Illumina 2.5M (left) panel and Illumina panel V (right)

230

231

232

233

234

235

236

237

Appendix B2. NPL and variance component linkage peaks using custom Illumina 2.5M map

238

239

240

241

242

243

244

245

Appindex B3. Linkage peaks differences using custom Illumina 2.5M markers and Illumina panel V markers. The highest linkage peaks obtained from Illumina 2.5M map were compared with Illumina panel V map

246

247

Appendix C

Forms

248

موافقة مستبينة للمشاركة بدراسة بحث طبى GENERIC SIGNED CONSENT FORM

رقم السجل: :HC NO إسم المريض: :PATIENT NAME تاريخ الميالد: :DOB الجنس )ذكر | أنثى( : :GENDER الجنسية: :NATIONALITY

كمشارك فى هذا البحث العلمى لك مطلق الحرية في طرح أى سؤال أو You are free to ask as many questions as you إستفسار عن هذا البحث وذلك قبل , أثناء إجراء, أُو بعد إكمال إجراء ,like before, during or after in this research البحث إذا قررت إعطاء الموافقة على المشاركة في هذا البحث. الهدف you decide to give consent to participate in الرئيسى من المعلومات الواردة في هذا النموذج هو أن نقدم لكم الشرح this research study. The information in this الوافى والمستفيض عن كل األخطار والفوائد التى يمكن أن تتمخض عن form is only meant to better inform you of all إجراء هذا البحث. المشاركة فى هذا البحث عمل طوعى خالص وبالتالى possible risks or benefits. Your participation in لكم مطلق الحرية بعدم المشاركة. قراركم بعدم المشاركة فى هذا البحث this study is voluntary. You do not have to العلمى ال يترتب عليه اى تبعات او حرمان من حقوقكم المستحقة. أيضا take part in this study, and your refusal to يمكنكم االنسحاب وعدم مواصلة المشاركة فى هذا البحث فى أى وقت أو participate will involve no penalty or loss of مرحلة دون أن يؤثر ذلك فى حقوقكم أو فوائدكم المستحقة والمشرعة. ألعضاء فريق البحث العلمى الخاص بهذه الدراسة الحق في إيقاف أو rights to which you are entitled. You may إلغاء مشاركتكم في هذه الدراسة إذا رأو مصلحة لكم فى هذا اإليقاف أو withdraw from this study at any time without اإللغاء أو فى حالة عدم التزامكم بخطة البحث الموضوعة أو إذا تبين لهم penalty or loss of rights or other benefits to ضرر أو إصابة نتيجة إجراء الدراسة وذلك دون أخذ موفقتك (which you are entitled. The investigator(s may stop your participation in this study without your consent for reasons such as: it will be in your best interest; you do not follow the study plan; or you experience a study- related injury.

عنوان المشروع: :Project Title

التحليل الجيني لمرض السكري من النوع الثاني بين العوائل القطرية Genetic Analysis of Type 2 Diabetes Among Qatari Families

اسم الباحث الرئيسي : :Name of Principal Investigator

الدكتور/ كارستن زورى Dr. Karsten Suhre

249

موقع إجراء البحث وأرقام الهواتف )أثناء أوقات الدوام, بعد الدوام و في Location and phone numbers: [provide العطالت(: appropriate daytime contact information and after-hours or on weekends] البرفسور الدكتور/ كارستن زورى مدير مختبر المعلوماتيه الحيويه Prof. Dr. Karsten Suhre كليه طب وايل كورنيل بقطر Director Bioinformatics Core مؤسسه قطر – المدينه التعليميه Weill Cornell Medical College in Qatar صندوق بريد Qatar Foundation-Education City 24144 الدوحه، دوله قطر P. O. Box 24144 هاتف: 44928482-974+ Doha, State of Qatar جوال: 33541843-974+ Phone: +974 4492 8482 بريد الكترونى:[email protected] Mobile: +974 3354 1843 [email protected]

يجب ملء كل البنود أدناه وفي حال عدم توافر اإلجابة الرجاء كتابة غير Each item given below has to be filled. Please متوفر. write NA, if not applicable

1. مقدمة عن البحث الطبى :Introduction to the research

سيتم اجراء هذه الدراسة في كلية طب وايل كورنيل بالتعاون مع الجمعية This study will be conducted Weill Cornell القطرية لمرضى السكري وسيقوم الباحثون في هذه الدراسة على تجربة Medical College Qatar (WCMC-Q) in تقنيات تحليلية جديدة قد تساعد على تقديم مساعدة مستقبليه أفضل collaboration with the Qatar Diabetes لتشخيص مبكر لمرض السكري. باالضافه الى ذلك، سيتم تحديد صفات Association (QDA). In this study, the وراثية جديدة من المحتمل أن تكون لها أهمية في االصابة بمرض السكري researchers will investigate new analytical من النوع الثاني بين العائالت في قطر. لقد تمت دعوتك وأفراد أسرتك techniques that may in the future provide help للمشاركة في هذه الدراسة وذلك الصابتك واصابة أفراد أسرتك بمرض a better and earlier diagnosis of diabetes. Also, السكري من النوع الثاني مما قد يزيد من خطر اصابة االجيال القادمة من the study will identify new potential genetic بين أفراد أسرتك بمرض السكري من النوع الثاني. variants that might be of etiological significant with familial type 2 diabetes. We have been invited because you and your family member are affected with T2D, which may increase the قبل أن تقرر، نشجعكم على التحدث الى أي شخص كنت تشعر بالراحة .risk of T2D among your other family members معه حول هذا البحث. قد يكون هناك بعض الكلمات المستخدمة غير مفهومة. من فضلك أوقفنى في أي وقت، ونحن نمضي خالل المعلومات Before you decide, you are encouraged to talk وأسألنى لشرح مزيد مما ال تفهمه حول هذا البحث. إذا كان ال يزال لديك to anyone you feel comfortable with about .أسئلة الحقا، يمكنك أن تسأل في أي وقت this research. There may be some words I use that you may not understand. Please stop me at any time, as we go through the information and ask me to further explain what you do not understand about this research. If you still في أي وقت قبل أو أثناء البحث ، يمكنك أن تسأل عن رأي ثان حول .have questions later, you can ask at any time مشاركتكم من طبيب آخر ليس جزءا من هذه الدراسة. At any time before or during the research, you may ask for a second opinion about your الرجاء التفكير جيداً عند اتخاذ قرار بالمشاركة , علماً أن مشاركتك في participation from another doctor who is not هذه الدراسة يعد أمراً طوعياً بحتاً . ولك مطلق الحرية بعدم المشاركة .part of this study 250

كما يمكنك التوقف عن المشاركة في الدراسة في أي وقت تشاء دون أن . يؤثر ذلك على المزايا التي تستحقها .Please take your time to make your decision Taking part in the study is entirely voluntary. You may decide not to participate in the study or you may decide to stop participating in the study at any time without loss of any benefits to which you are entitled.

2. الغرض من إجراء دراسة البحث :Purpose of the research

اليوم قد يأخذ طبيبك عينه دم من وريدك لتحديد محتواها من السكر Today your doctor may take blood from your والدهون ، الكولسترول، والمكونات األخرى التي تستخدم لمنشأة التقنيات ,veins to determine its content of sugar, fat الطبية. قد تشير معايير الدم الغير أعتياديه إلى الطبيب بمشكلة طبيه cholesterol, and other ingredients using محتمله established medical techniques. Unusual blood measures may then indicate to the doctor a potential medical problem.

Recent technical developments have أدخلت التطورات التقنية الحديثة تقنيات تحليلية جديدة تسمح لقياس ما introduced new analytical techniques that يصل الى 500 من مختلف مكونات الدم الجديده. هذه التقنيات والذى allow to measure up to 500 different new يطلق عليها االيض )التمثيل الغذائى(، ال تزال تحت التجربة وتحتاج إلى ingredients of the blood. These techniques, إثبات قيمتها قبل استخدامها في مقاييس الرعاية الطبي. called metabolomics, are still experimental and need to prove their value before being used in standard medical care.

باالضافة الى ذلك سيتم االستفادة من أحدث أنواع تكنولوجيا التحليل In addition, We will take the benefit from the الجيني )الجين هو عبارة عن الوحدة األساسية لنقل المعلومات الوراثية advances of genetic analysis technologies بين الكائنات الحية، وقد تم التعرف على العديد من األمراض الوراثية عند genes are molecular units that pass hereditary) حدوث طفرات جينية( لتحديد عوامل جينية نادرة من المحتمل أن يكون لها information in living organisms; genetic عالقة في االصابة بمرض السكر من النوع الثاني بين العوائل القطرية. mutations have led to the diagnosis of different inherited diseases) to identify rare genetic factors that might be associated with type 2 diabetes.

والغرض من هذا البحث هو معرفة ما إذا كانت هذه البيانات الجديدة هى The purpose of this research is to find out حقا مفيده لطبيبك لتشخيص حالتك ،الخاصه باالضافة الى التعرف على whether this new data really is helpful to your صفات وراثية جديدة ذات أهمية في حدوث االصابة بمرض السكري من .doctor in the diagnosis of your particular case النوع الثاني. As well as, identifying new potential genetic factors associated with type 2 diabetes.

251

3. إختيار المشاركين بالدراسة :Selection of research subjects

لقد ُطلب منك المشاركة في هذه الدراسة حيث أننا نعتقد أن حالتك قد تسهم You have been asked to participate in this في تقييم قيمةهذه التقنية الجديدة في تشخيص مرض السكر. study since we believes that your case may contribute to evaluate the value of these new genetic and metabolomics technique in the diagnosis of diabetes.

نحن نتوقع دراسه 12 عائلة من قطر. .We expect to study a total of 12 families

4. ع ٌرف الفرق بين خدمة الرعاية اإلعتياد ّية واألنشطة البحث ّية Distinction between routine care and research activities:

هذا البحث ال يغيرمن رعايتك الروتينية. نحن ندعوكم للتبرع بكمية This research does not change your routine صغيرة من الدم مقدار 10 مل )2 ملعقتي طعام(. care. We are inviting you to donate small amount of blood (2 tablespoons)

من المهم أن تعلم بأنه لن يتم اعالمك بنتائج الدراسة الن جميع التحاليل It is important to know that results of the ستكون تجريبية وليس لها أي عالقة مباشرة بالرعاية الطبية التي تتلقاها. study will not be shared with you because all the tests are experimental and not relevant to the clinical care you receive.

عدم موافقتك أو رفضك التبرع بالدم الجراء البحث لن يؤثر على نوعية الرعاية الطبية المقدمة لك في المستقبل. Neither consenting to donate nor refusing to donate your blood for research will affect the quality of any of your future medical care. أنت لن تكون مسؤوالً عن أى تكاليف للقيام بإجراءات هذا البحث. You will not be responsible for the costs of any procedures done for the research.

5.وصف اإلجراءات : :Description of the procedures

سوف يطلب منك التبرع بكمية صغيرة من الدم )حوالي ملعقتي طعام(، You will be invited to donate a small amount واللعاب )عن طريق مضغ قطعة من القطن(.كما سوف نقوم بقياس وزنك of blood (about 2 tablespoons) and saliva (by و طولك. chewing a cotton bud). We will also measure your weight and height.

6. وصف للمخاطر واإلزعاج الناجمة عنه Description of the risks and discomfort involved:

قد تشعر بعدم االرتياح جراء سحب الدم )مثل األلم، التورم أو الحكة( Withdrawing your blood may cause discomfort بسبب وخز اإلبرة. وقد ي ٌنشأ عن ذلك كدمات تحت موضع الجلد بالقرب من such as pain, swelling or itchiness) from the) مكان الوخز. أما بالنسبة بقياس الوزن والطول واألسئلة، فال يوجد مخاطر needle stick. Also you may develop a bruise متعلقة بها. under the skin near the puncture site. The questionnaires, weight and height measurements, and arm scan will not pose any risk.

252

7. وصف إجراءات و إحتياطات السالمة Description of safety precautions in this research:

سوف نقوم باستخدام تقنيات حديثة تعتمد على المعايير تتعلق بعملية جمع We will use standard techniques for collecting الدم ، كما سوف نبين لك في هذا النموذج كيف نعمل على حماية your blood. Later in this form, we will describe خصوصيتك المرضية والحفاظ على سرية المعلومات من اجل تجنب انتهاك how we will protect your private medical خصوصيتك .information to avoid risks to your privacy

8. وصف لفوائد المشاركة بالدراسة إن وجدت :Descriptions of the benefits of the study

يجب أن تعلم بأن نتائج الدراسة ستساعد على التعرف وعالج مرضى You should know that the results of the study السكر فالمستقبل و أنه من المحتمل ان ال توجد أية فوائد مباشرة لكم ، may help identify, treat and manage diabetes ولكن مشاركتكم ذات قيمة بالنسبه لنا. يجب أن تكون على بينة من الحقيقة for future patients. Also, your routine care can أننا ال نستطيع ضمان ان يكون لكم في نهاية المطاف مساعدة طبية في be fully assured without your participation in تشخيص الحالة الخاصة بك. this study, and that there is no direct benefit to your personal treatment.

9. وصف اإلجراءات أوالعالج البديل لهذه الدراسة Description of the alternative procedures or treatments for this research: غير مطابق NA

10. تفاصيل عن خيارات البقاء على العالج المتبع أثناء فترة البحث حتى Details of the options to remain on the بعد انتهاء البحث research treatment after termination of the research غير مطابق NA

11. تفاصيل عن الشخص الممكن اإلتصال به في حالة وجود استفسار Details of the person to contact in case of .11 أو حدوث إصابة خالل فترة البحث Injury or enquiry during the research

If you are injured as a direct result of research إذا حدث لك إصابه كنتيجة مباشرة من إجراءات البحث، يرجى األتصال procedures, please contact Dr. Karsten Suhre بالدكتور كارستون زوري في كلية طب وايل كورنيل على under 44928482 or 33541843. 44928482 أو 33541843 If you are injured as a direct result of research procedures, we will arrange appropriate care إذا حدث لك إصابه كنتيجة مباشرة من إجراءات البحث ، وسوف نرتب at HMC. If you seek care outside HMC, such الرعاية المناسبة في مؤسسة حمد الطبية. إذا كنت تسعى لتلك الرعاية care will be at your expense. Compensation is .خارج مؤسسة حمد الطبية، ستكون على نفقتك not normally available in case of injury. .عادة ال يتوفرالتعويض فى حالة اإلصابة

If you have questions about your rights as a اذا كان لديك استفسار حول حقوقك كمشارك بالبحث ، الرجاء االتصال research subject, please contact the WCMC-Q بكلية طب ويل كورنيل : هاتف 44928434 Office of Research at 4492-8434 or irb@qatar- [email protected]. med.cornell.edu.

253

12. تفاصيل عن التعويضات المال ّية أو غيرها المحتمل إعطائها Details of the financial or other compensation للمشاركين في البحث which might be provided to the research participants if any:

ليس هناك تعويضات مالية للمشاركة في هذه الدراسة. وكرمز لتقديرنا There is no financial compensation for the سوف نقدم لك كهدية جهاز قياس نسبه السكرى فى الدم. participation in this study. As a token of gratitude we shall provide you with a free glucometer.

هذا البحث قد يؤدي إلى تقديم منتجات تجارية جديدة، مثل عالجات جديدة This research may result in new commercial أو اختبارات تشخيصية. عادة المشاركين ال يحصلون على تعويضات مالية products, such as new treatments or عن المنتجات التجارية الذى تم تطويرها كنتيجة للبحث. diagnostic tests. Subjects normally do not receive financial compensation for commercial products developed as a result of research.

سنقوم تعويضك عن أي نفقات السفر التي قمت بها )الحد األقصىWe will reimburse you for any travel expenses َ 50 لاير قطري(. .(you have made (up to QR 50

13. مدة إجراء البحث Duration of the research

يستغرق هذا البحث حوالي ساعة ، والمشاركة مرة واحدة فقط. اإلجراء The research will take 1 hour and you will only الوحيد هو إعطاء اإلذن. لن يتم اجراء أي اختبارات أخرى كما لن يطلب .participate one time منك تقديم أي التزامات آخرى.

14. أسماء مصادر تمويل :Names of the sponsors of the research

غير مطابق NA

15. السرية حول النتائج، العينةالمختبرية أو أي بيانات أخرى :Assurance of anonymity and confidentiality = You will not be identified in any publications لن يتم التعرف عليك من خالل أي منشورات أو محاضرات حول هذه .or presentations about this study الدراسة Your anonymous coded sample may be sent . قد يتم إرسال عيناتك إلى خارج البالد للقيام بتحاليل معينة، لكن ليكن في overseas for specific analyses, but all data will علمك بأن جميع المعلومات الخاصة بك سوف يكون محافظ عليها بأمان always be stored securely at either WCMC-Q. في كلية طب وايل كورنيل قطر.

Information that identifies you will only be

قد نستخدم المعلومات التي تكشف عن هويتك ولكن بعد أخد أذن مكتوب revealed with your written permission or منك،أو بسب حمايتك ، أو عندما يتطلب القانون ذلك، أو عند احتياج الناس ,when it would protect you, the law requires it لها or when the following people need it. These people are bound by rules of confidentiality: هؤالء الناس ُملتزمون بقواعد السريه: • Members of the research team from WCMC-Q staff who help run the study.

254

- أعضاء فريق البحث وآخرون من الموظفين بطب وايل كورنيل الذين Representatives of WCMC-Q and Qatar • يساعدون إلجراء الدراسة Diabetes Association who make sure the study - ممثلين من كليه طب وايل كورنيل بقطر و الجمعية القطرية لمرضى is done properly and that your rights and السكري سيتأكدون من أتمام الدراسة بشكل صحيح وحمايه حقوقك و .safety are protected سالمتك.

16. تـنويه بعدم القسرية :Non-coercive disclaimer

المشاركه آختياريه. وفي حال الرفض بالمشاركه لن يترتب عنها أي Participation is voluntary. If you refuse to عقوبة، ولن تفقد اية مزايا او حقوق لك. أنت غير ملزم للمشاركة في هذا participate, there will not be any penalty and البحث إذا كنت ال ترغب في القيام بذلك. وان عدم رغبتك بالمشاركة لن you will not lose any benefits to which you are يؤثر على عالقتك مع كلية طب وايل كورنيل او الجمعية القطرية لمرض entitled. You do not have to take part in this السكري او اي نوع من الرعاية الطبية في المستقبل research if you do not wish to do so. Refusing to participate will not affect any relationship . you have with WCMC-Q or QDA.

17. إمكانية إنسحاب المشارك من الدراسة أو البحث دون عواقب Option to withdraw from the study without penalty

يمكنك التوقف عن المشاركة في البحث في أي وقت. إذا تركت الدراسة، You may stop participating in the research at لن يكون هناك أي عقوبة، ولن تخسر أي فوائد تحق لك. االنسحاب لن any time. If you leave the study, there will not يؤثر على عالقتك مع كليه طب وايل كورنيل. be any penalty and you will not lose any benefits to which you are entitled. Withdrawal will not affect any relationship you have with WCMC-Q.

يمكنك تغير رأيك وسحب العينه الخاصه بك من الدراسة عن طريق You can change your mind and withdraw your االتصال بالباحثين وسيتم ازالة عيناتك من مركز التخزين. samples from the study by contacting the investigators and we can remove your samples

في حالة الموافقة على المشاركة في أي أبحاث مستقبلية سوف يتم تخزين .from storage العينات بإستخدام رموز سرية ال تشير الى صاحب العينة، ونتيجة لذلك لن If you agree to use your samples for future نتمكن من ازالة عيناتك من مركز التخزين اذا قمت بتغيير رأيك. ,research, they will be stored and de-identified and cannot be traced back to you in any way. Then we will no longer be able to remove your . samples from the storage.

255

18. تفاصيل عن إنهاء الدراسة أو البحث :Details about termination of the study

ونود الحصول على إذنك لالتصال بك للمشاركة في دراسات مستقبلية. ال We would like your permission to contact you يزال يمكنك المشاركة في هذه الدراسة حتى لو كنت ال تعطي إذن لالتصال about participating in future studies. You may بك في المستقبل. يرجى اإلشاره الختيارك بوضع باألحرف األولى من still participate in this study even if you do not أسمك في الفضاء المناسب أدناه give permission for future contact. You may also change your mind at any time. Please indicate your choice by initialing the appropriate space below: ____نعم، يمكنك األتصال بى عن البحوث والدراسات المستقبلية

_____YES, you may contact me about future ____ال، يمكنك األتصال بى عن البحوث والدراسات المستقبلية. research studies.

_____NO, you may NOT contact me about future research studies.

19. تفاصيل حول البحوث المستقبلية األخرى: :Details about other future research

أثناء الدراسة، سيتم اإلحتفاظ وإستخدام العينات الخاصة بك، نود أن During the study, your samples will be kept نحتفظ بما تبقى من العينات إلستعمالها في أبحاث مستقبلية. قد يكون هذا and used. We would like to keep your leftover البحث ليس لديه أي عالقة مع العالمات البيولوجية للسكري . samples for other future research. This research may relate to topics outside of diabetes biomarker research.

هذا غير إجباري. يمكنك أن تشارك في هذه الدراسة حتى لو كنت ال تسمح This is optional. You may participate in this بأن العينات الخاصة بك أن تستخدم في المستقبل. study even if you do not allow your samples إذا كنت ال تسمح بتخزين العينات الخاصة بك للبحث في المستقبل، سوف نقوم بتدميرها بعد إنتهاد هذه الدراسة. for future use. If you do not allow storage of your samples for future research, we will destroy the samples after the end of this study. سيتم تخزين العينات الخاصه بك في مكان آمن و ستعطى العينات أرقام خاصة دون االشارة الى اسم صاحب العينة. ,Your samples will be stored and de-identified and cannot be traced back to you in any way.

_____YES, I allow storage of my samples for future research. _____نعم، أسمح بتخزين عيناتي ألبحاث أخري في المستقبل. _____NO, I do NOT allow storage of my samples for future research. _____ال، أنا ال أسمح بتخزين عيناتي ألبحاث أخرى في المستقبل.

256

20. تفاصيل عن الحاالت التي قد يكون فيها افشاء المعلومات غير Details of the instances in which there might مكتمل be incomplete disclosure of information

غير مطابق NA

الموافقة المستبينة للمشاركة فى البحث Signed Consent for Study Participation

*الموافقه المستبينة الخطية: عزيزي المشارك في هذا البحث قد قمت Consent: You (the participant) have read or بقراءة أو قد قرىء عليك المرفق أعاله. د. كارستن زورى أو من ينوب .have had read to you all of the above. Dr عنه قد قام بإعطائك شرح للدراسة متضمنا اسباب القيام بهذه الدراسة و Karsten Suhre or his/her authorized اإلجراءات المتضمنة في هذه الدراسة. كما تم إعطائك شرح عن اإلزعاج representative has provided you with a و المخاطر و الفوائد المحتملة من وراء هذه الدراسة و إذا ما كان لك description of the study including an خيارات بديلة للعالج. لك حق االستفسار و طرح أي سؤال متعلق بهذه explanation of what this study is about, why it الدراسة أو مشاركتك فيها. تم القيام بتقديم شرح عن حقوقك كمشارك في is being done, and the procedures involved. هذه الدراسة . The risks, discomforts, and possible benefits of this research study, as well as alternative treatment choices, have been explained to you. You have the right to ask questions related to this study or your participation in و ان موافقتك على المشاركة في هذا البحث هي عمل تطوعي. بالتوقيع this study at any time. Your rights as a على هذه الورقة, إقرار منك بالموافقة على المشاركة في هذا البحث. ,research subject have been explained to you سوف تعطى لك نسخة من هذا اإلقرار بالموافقة. and you voluntarily consent to participate in this research study. By signing this form, you willingly agree to participate in the research توقيعك على هذه الوثيقة يعتبر صالح في حال تم تجديد هذا البحث طوال study described to you. You will receive a copy فترة الدراسة البحثية. سوف يتم إبالغك في حال حصول أي تغيير في of this signed consent form. As long as the الدراسة مما قد يؤثر على موافقتك على المشاركة. study is renewed as required by the IRB, your signature on this document is valid for the duration of the entire research study. Should any changes occur during the course of the study that may affect your willingness to participate, you will be notified

257

إسم: المشارك ، الوالد التوقيع وتاريخه Participant / Parent(s)/ Signature & Date )الوالدين( أو الوصي Guardian’s Nam

إسم الطفل المشارك Child's Name Signature & Date

إسم الشاهد التوقيع وتاريخه Witness Name Signature & Date

إسم الباحث الرئيسي التوقيع وتاريخه Principal Investigator’s Signature & Date Name

إستخدام مركز األبحاث الطبية فقط For use of Medical Research Center only

258

Diabetes Interview 1. Patient/Study ID (please double check) *

2. What is the total number of previous married years?

3. What type of diabetes do you have?

 Type 1 diabetes  Type 2 diabetes  Gestational diabetes  Other

4. How did your doctor diagnose you with diabetes?

 Fasting plasma glucose ≥ 7 mM  hr post-prandial glucose ≥ 11.1 mM  75g oral glucose tolerance test  Diabetes symptoms (thirst, blurred vision, etc) AND random plasma glucose ≥ 11.1 mM  I don't know/ I don't remember

5. Approximately how often do you visit a doctor for your diabetes?

 5 or more times a year  to 4 times a year  1 or 2 times a year  Once every 2 or 3 year

6. How often do you check your blood sugar yourself?

 or more times a day  Once a day  About 2-6 times a week  About 2-3 times a month  Once a month or less  Never

7. How many months ago was the last time your HbA1c is measured by your doctor?

8. How high was your last HbA1c level (%)?

9. Do you use insulin?  Yes  No

10. What kind of insulin do you use?

 Novorapid  Novomix  Levemir  Lantus

259

 Apidra  Humalog  Humalog Mix  Actrapid/Humuline Regular/Insuman Rapid  Actraphane/Humline/Mixtard  Insulatard/Humuline NPH  Other

11. How many units of insulin do you use, on average, per day?

12. Are you taking any oral anti-diabetic drugs?

 Yes  No

13. What kind of oral anti-diabetic medication do you take?  Acarbose/Glucobay  Exenatide/Byetta  Glibenclamide  Gliclazide/Diamicron  Glimepiride/Amaryl  Liraglutide/Victoza  Metformin  Pioglitazon/Actos  Repaglinide/Novonorm  Rosiglitazon/Avandia  Saxagliptin/Onglyza  Sitagliptin/Januvia  Tolbutamide  Vildagliptin/Galvus  Other

14. What daily dosage do you use (mg/day)? Please state per medication.  Medication:  Medication:  Medication:

15. How many times do you have a hypoglycaemic episode?

 or more times a day  Once a day  About 2-6 times a week  About 2-3 times a month

16. Have you ever had the following complications? Yes No  Diabetic coma  Insulin shock

260

 Heart disease  Kidney disease  Eye trouble (retinopathy)  High blood pressure  Nerve disorder  Slow healing wounds  Skin infection  Other complication

Thank You!

261

ELSEVIER LICENSE TERMS AND CONDITIONS

Jul 08, 2015

This is a License Agreement between Wadha Al Muftah ("You") and Elsevier ("Elsevier") provided by Copyright Clearance Center ("CCC"). The license consists of your order details, the terms and conditions provided by Elsevier, and the payment terms and conditions.

INTRODUCTION

1. The publisher for this copyrighted material is Elsevier. By clicking "accept" in connection with completing this licensing transaction, you agree that the following terms and conditions apply to this transaction (along with the Billing and Payment terms and conditions established by Copyright Clearance Center, Inc. ("CCC"), at the time that you opened your Rightslink account and that are available at any time at http://myaccount.copyright.com).

GENERAL TERMS

2. Elsevier hereby grants you permission to reproduce the aforementioned material subject to the terms and conditions indicated.

3. Acknowledgement: If any part of the material to be used (for example, figures) has appeared in our publication with credit or acknowledgement to another source, permission must also be sought from that source. If such permission is not obtained then that material may not be included in your publication/copies. Suitable acknowledgement to the source must be made, either as a footnote or in a reference list at the end of your publication, as follows:

"Reprinted from Publication title, Vol /edition number, Author(s), Title of article / title of chapter, Pages No., Copyright (Year), with permission from Elsevier [OR APPLICABLE SOCIETY COPYRIGHT OWNER]." Also Lancet special credit - "Reprinted from The Lancet, Vol. number, Author(s), Title of article, Pages No., Copyright (Year), with permission from Elsevier."

4. Reproduction of this material is confined to the purpose and/or media for which permission is hereby given.

5. Altering/Modifying Material: Not Permitted. However figures and illustrations may be altered/adapted minimally to serve your work. Any other abbreviations, additions, deletions and/or any other alterations shall be made only with prior written authorization of Elsevier Ltd. (Please contact Elsevier at [email protected])

6. If the permission fee for the requested use of our material is waived in this instance, please be advised that your future requests for Elsevier materials may attract a fee.

7. Reservation of Rights: Publisher reserves all rights not specifically granted in the combination of (i) the license details provided by you and accepted in the course of this licensing transaction, (ii)

262

these terms and conditions and (iii) CCC's Billing and Payment terms and conditions.

8. License Contingent Upon Payment: While you may exercise the rights licensed immediately upon issuance of the license at the end of the licensing process for the transaction, provided that you have disclosed complete and accurate details of your proposed use, no license is finally effective unless and until full payment is received from you (either by publisher or by CCC) as provided in CCC's Billing and Payment terms and conditions. If full payment is not received on a timely basis, then any license preliminarily granted shall be deemed automatically revoked and shall be void as if never granted. Further, in the event that you breach any of these terms and conditions or any of CCC's Billing and Payment terms and conditions, the license is automatically revoked and shall be void as if never granted. Use of materials as described in a revoked license, as well as any use of the materials beyond the scope of an unrevoked license, may constitute copyright infringement and publisher reserves the right to take any and all action to protect its copyright in the materials.

9. Warranties: Publisher makes no representations or warranties with respect to the licensed material.

10. Indemnity: You hereby indemnify and agree to hold harmless publisher and CCC, and their respective officers, directors, employees and agents, from and against any and all claims arising out of your use of the licensed material other than as specifically authorized pursuant to this license.

11. No Transfer of License: This license is personal to you and may not be sublicensed, assigned, or transferred by you to any other person without publisher's written permission.

12. No Amendment Except in Writing: This license may not be amended except in a writing signed by both parties (or, in the case of publisher, by CCC on publisher's behalf).

13. Objection to Contrary Terms: Publisher hereby objects to any terms contained in any purchase order, acknowledgment, check endorsement or other writing prepared by you, which terms are inconsistent with these terms and conditions or CCC's Billing and Payment terms and conditions. These terms and conditions, together with CCC's Billing and Payment terms and conditions (which are incorporated herein), comprise the entire agreement between you and publisher (and CCC) concerning this licensing transaction. In the event of any conflict between your obligations established by these terms and conditions and those established by CCC's Billing and Payment terms and conditions, these terms and conditions shall control.

14. Revocation: Elsevier or Copyright Clearance Center may deny the permissions described in this License at their sole discretion, for any reason or no reason, with a full refund payable to you. Notice of such denial will be made using the contact information provided by you. Failure to receive such notice will not alter or invalidate the denial. In no event will Elsevier or Copyright Clearance Center be responsible or liable for any costs, expenses or damage incurred by you as a result of a denial of your permission request, other than a refund of the amount(s) paid by you to Elsevier and/or Copyright Clearance Center for denied permissions.

LIMITED LICENSE

The following terms and conditions apply only to specific license types:

15. Translation: This permission is granted for non-exclusive world English rights only unless your license was granted for translation rights. If you licensed translation rights you may only translate

263

this content into the languages you requested. A professional translator must perform all translations and reproduce the content word for word preserving the integrity of the article. If this license is to re-use 1 or 2 figures then permission is granted for non-exclusive world rights in all languages.

16. Posting licensed content on any Website: The following terms and conditions apply as follows: Licensing material from an Elsevier journal: All content posted to the web site must maintain the copyright information line on the bottom of each image; A hyper-text must be included to the Homepage of the journal from which you are licensing athttp://www.sciencedirect.com/science/journal/xxxxx or the Elsevier homepage for books athttp://www.elsevier.com; Central Storage: This license does not include permission for a scanned version of the material to be stored in a central repository such as that provided by Heron/XanEdu.

Licensing material from an Elsevier book: A hyper-text link must be included to the Elsevier homepage at http://www.elsevier.com . All content posted to the web site must maintain the copyright information line on the bottom of each image.

Posting licensed content on Electronic reserve: In addition to the above the following clauses are applicable: The web site must be password-protected and made available only to bona fide students registered on a relevant course. This permission is granted for 1 year only. You may obtain a new license for future website posting.

17. For journal authors: the following clauses are applicable in addition to the above:

Preprints:

A preprint is an author's own write-up of research results and analysis, it has not been peer- reviewed, nor has it had any other value added to it by a publisher (such as formatting, copyright, technical enhancement etc.).

Authors can share their preprints anywhere at any time. Preprints should not be added to or enhanced in any way in order to appear more like, or to substitute for, the final versions of articles however authors can update their preprints on arXiv or RePEc with their Accepted Author Manuscript (see below).

If accepted for publication, we encourage authors to link from the preprint to their formal publication via its DOI. Millions of researchers have access to the formal publications on ScienceDirect, and so links will help users to find, access, cite and use the best available version. Please note that Cell Press, The Lancet and some society-owned have different preprint policies. Information on these policies is available on the journal homepage.

Accepted Author Manuscripts: An accepted author manuscript is the manuscript of an article that has been accepted for publication and which typically includes author-incorporated changes suggested during submission, peer review and editor-author communications.

Authors can share their accepted author manuscript:

 - immediately

o via their non-commercial person homepage or blog

264

o by updating a preprint in arXiv or RePEc with the accepted manuscript

o via their research institute or institutional repository for internal institutional uses or as part of an invitation-only research collaboration work-group

o directly by providing copies to their students or to research collaborators for their personal use

o for private scholarly sharing as part of an invitation-only work group on commercial sites with which Elsevier has an agreement

 - after the embargo period

o via non-commercial hosting platforms such as their institutional repository

o via commercial sites with which Elsevier has an agreement

In all cases accepted manuscripts should:

 - link to the formal publication via its DOI

 - bear a CC-BY-NC-ND license - this is easy to do

 - if aggregated with other manuscripts, for example in a repository or other site, be shared in alignment with our hosting policy not be added to or enhanced in any way to appear more like, or to substitute for, the published journal article.

Published journal article (JPA): A published journal article (PJA) is the definitive final record of published research that appears or will appear in the journal and embodies all value-adding publishing activities including peer review co-ordination, copy-editing, formatting, (if relevant) pagination and online enrichment.

Policies for sharing publishing journal articles differ for subscription and gold open access articles:

Subscription Articles: If you are an author, please share a link to your article rather than the full- text. Millions of researchers have access to the formal publications on ScienceDirect, and so links will help your users to find, access, cite, and use the best available version.

Theses and dissertations which contain embedded PJAs as part of the formal submission can be posted publicly by the awarding institution with DOI links back to the formal publications on ScienceDirect.

If you are affiliated with a library that subscribes to ScienceDirect you have additional private sharing rights for others' research accessed under that agreement. This includes use for classroom teaching and internal training at the institution (including use in course packs and courseware programs), and inclusion of the article for grant funding purposes.

Gold Open Access Articles: May be shared according to the author-selected end-user license and should contain a CrossMark logo, the end user license, and a DOI link to the formal publication on ScienceDirect.

Please refer to Elsevier's posting policy for further information.

18. For book authors the following clauses are applicable in addition to the above: Authors are permitted to place a brief summary of their work online only. You are not allowed to download and post the published electronic version of your chapter, nor may you scan the printed edition to

265

create an electronic version. Posting to a repository: Authors are permitted to post a summary of their chapter only in their institution's repository.

19. Thesis/Dissertation: If your license is for use in a thesis/dissertation your thesis may be submitted to your institution in either print or electronic form. Should your thesis be published commercially, please reapply for permission. These requirements include permission for the Library and Archives of Canada to supply single copies, on demand, of the complete thesis and include permission for Proquest/UMI to supply single copies, on demand, of the complete thesis. Should your thesis be published commercially, please reapply for permission. Theses and dissertations which contain embedded PJAs as part of the formal submission can be posted publicly by the awarding institution with DOI links back to the formal publications on ScienceDirect.

Elsevier Open Access Terms and Conditions

You can publish open access with Elsevier in hundreds of open access journals or in nearly 2000 established subscription journals that support open access publishing. Permitted third party re- use of these open access articles is defined by the author's choice of Creative Commons user license. See our open access license policy for more information.

Terms & Conditions applicable to all Open Access articles published with Elsevier:

Any reuse of the article must not represent the author as endorsing the adaptation of the article nor should the article be modified in such a way as to damage the author's honour or reputation. If any changes have been made, such changes must be clearly indicated.

The author(s) must be appropriately credited and we ask that you include the end user license and a DOI link to the formal publication on ScienceDirect.

If any part of the material to be used (for example, figures) has appeared in our publication with credit or acknowledgement to another source it is the responsibility of the user to ensure their reuse complies with the terms and conditions determined by the rights holder.

Additional Terms & Conditions applicable to each Creative Commons user license:

CC BY: The CC-BY license allows users to copy, to create extracts, abstracts and new works from the Article, to alter and revise the Article and to make commercial use of the Article (including reuse and/or resale of the Article by commercial entities), provided the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, indicates if changes were made and the licensor is not represented as endorsing the use made of the work. The full details of the license are available at http://creativecommons.org/licenses/by/4.0.

CC BY NC SA: The CC BY-NC-SA license allows users to copy, to create extracts, abstracts and new works from the Article, to alter and revise the Article, provided this is not done for commercial purposes, and that the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, indicates if changes were made and the licensor is not represented as endorsing the use made of the work. Further, any new works must be made available on the same conditions. The full details of the license are available at http://creativecommons.org/licenses/by-nc-sa/4.0.

266

CC BY NC ND: The CC BY-NC-ND license allows users to copy and distribute the Article, provided this is not done for commercial purposes and further does not permit distribution of the Article if it is changed or edited in any way, and provided the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, and that the licensor is not represented as endorsing the use made of the work. The full details of the license are available at http://creativecommons.org/licenses/by-nc-nd/4.0. Any commercial reuse of Open Access articles published with a CC BY NC SA or CC BY NC ND license requires permission from Elsevier and will be subject to a fee.

Commercial reuse includes:

 - Associating advertising with the full text of the Article

 - Charging fees for document delivery or access

 - Article aggregation

 - Systematic distribution via e-mail lists or share buttons

Posting or linking by commercial companies for use by customers of those companies.

20. Other Conditions:

v1.7

Questions? [email protected] or +1-855-239-3415 (toll free in the US) or +1-978- 646-2777.

267

268

269

Open Access License

No Permission Required

The OMICS Publishing Group applies the Creative Commons Attribution License (CCAL) to all works we publish (read the human-readable summary or the full license legal code). Under the CCAL, authors retain ownership of the copyright for their article, but authors allow anyone to download, reuse, reprint, modify, distribute, and/or copy articles in OMICS journals, so long as the original authors and source are cited. No permission is required from the authors or the publishers.

The broad license was developed to facilitate open access to, and free use of, original works of all types. Applying this standard license to your own work will ensure your right to make your work freely and openly available. Learn more about open access. For queries about the license, please contact us.

All Published content, except where otherwise noted, is licensed under a Creative Commons Attribution License.

Science for Society � Technology for Tactics

270