GENETIC ASSOCIATION TESTS FOR BINARY TRAITS

WITH AN APPLICATION

by

SULGI KIM

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Robert C. Elston

Department of Epidemiology and Biostatistics

CASE WESTER RESERVE UNIVERSITY

August, 2009

CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES

We hereby approved the dissertation of Sulgi Kim candidate for the Ph. D. degree*. (Signed) Robert C. Elston, Ph.D Department of Epidemiology and Biostatistics Chair of the committee

Xiaofeng Zhu, Ph.D Department of Epidemiology and Biostatistics

Courtney Gray-McGuire, Ph.D Department of Epidemiology and Biostatistics

Jill S. Barnholtz-Sloan, Ph.D Case Comprehensive Cancer Center

June 8, 2009

*We also certify that written approval has been obtained for any proprietary material therein.

ii

Table of Contents

List of Tables ...... iv

List of Figures ...... v

Acknowledgements ...... vi

Abstract ...... 1

Chapter 1. Introduction...... 3

Chapter 2. Association Tests for a Binary Trait with Unrelated Individuals ...... 6

Chapter 3. An Application to Diabetic Nephropathy Data ...... 41

Chapter 4. Association Tests for a Binary Trait Measured on Related Individuals .61

Chapter 5. Conclusions and Areas for Future Study ...... 73

APPENDIX ...... 76

BIBLIOGRAPHY ...... 88

iii

List of Tables

Table 2-I Genetic Association Tests for Case-Control Data ...... 7

Table 2-II Probabilities for unphased genotypes ...... 10

Table 2-III Single-marker and two-marker association tests with corresponding models and hypotheses ...... 16

Table 2-IV Constraints for disease models ...... 18

Table 2-V Comparisons of Theoretical and Empirical Power of Test 1-2 ...... 25

Table 2-VI Empirical Type I Error of Test 1-2 ...... 25

Table 2-VII Power comparisons of two-marker tests ( rare allele frequency ) ...... 34

Table 2-VIII Power comparisons of two-marker tests ( common allele frequency ) ...... 35

Table 2-IX Mean of Power over Chromosome 11 of CEU HapMap Data ...... 37

Table 3-I Details of Tag SNP at the ...... 44

Table 3-II Categorical Covariates used in our analysis ...... 47

Table 3-III Logistic regression model identifying significant covariates ...... 48

Table 3-IV The Distribution of rs2146098 and rs6659783 in Cases and

Controls ...... 53

Table 4-I Empirical Type I errors with Random Samples ...... 54

Table 4-II Empirical Power with Random Samples ...... 54

Table 4-III Empirical Type I errors with Ascertained Samples ...... 54

Table 4-IV Empirical Power with Ascertained Samples ...... 54

iv

List of Figures

Figure 2-1 Power of single-marker test vs LD in four disease models (K=0.05) ...... 28

Figure 2-2 Power of single-marker test vs disease allele frequency in four disease models

(K=0.05) ...... 29

Figure 2-3 Power of single-marker test vs LD in four disease models (K=0.20) ...... 30

Figure 2-4 Power of single-marker test vs disease allele frequency in four disease models

(K=0.20) ...... 31

Figure 2-5 Power of single-marker test vs LD in four disease models (K=0.005) ...... 32

Figure 2-4 Power of single-marker test vs disease allele frequency in four disease models

(K=0.005) ...... 33

Figure 3-1 LD Plot of CNDP1 and tag SNPs ...... 45

Figure 3-2 LD Plot of ELMO1 and tag SNPs ...... 46

Figure 3-3 Results for CNDP1 and ELMO1 with ARB use covariate (B) ...... 51

Figure 3-4 Results for the eight other candidate genes ...... 52

Figure 3-5 Association test results for HMCN1 with covariate configuration (B)

(including ARB use as a covariate) ...... 53

Figure 3-6 LD Plot of HMCN1 and tag SNPs ...... 54

v

Acknowledgements

Most of all, I wish to thank my advisor, Dr. Robert Elston, for his time and guidance in making this dissertation a reality. I also thank Dr. Xiaofeng Zhu, Dr. Courtney Gray-

McGuire and Dr. Jill Barnholtz-Sloan for serving on the committee and providing valua- ble advice. I am grateful to Nathan Morris who has offered useful suggestions. My work on this dissertation was supported by a U.S. Public Health Service research grant (R01-

DK069844) from the NIDDK (P.I.: Dr. Sharon Adler). Computational support was by a

U.S. Public Health Service resource grant (RR03655) from the National Center for Re- search Resources (P.I.: Dr. Robert Elston). I am grateful to Dr. Sudha Iyengar and Dr.

Sharon Adler for the opportunity to receive this funding and also for their helpful advice.

This dissertation includes some work of other FIND collaborators from Perlegen in AP-

PENDIX E. I also wish to express my appreciation to the other faculty members at Case

Western Reserve University for their excellent teaching. Sincere thanks go to my friends

Sungho Won, Sung-Gon Yi, Gyungah Jun, Robert Goodloe and Yali Li for their willing- ness to share what they had learned. Finally, I thank my family for their love and prayers.

vi

Genetic Association Tests for Binary Traits

with an Application

Abstract

by

SULGI KIM

Genetic association studies aim to map causal variants for a trait by performing

many association tests between each marker along a chromosomal region and a trait of

interest. Therefore, valid and powerful association tests are essential for a genetic asso-

ciation study. In this dissertation, association tests are considered for binary traits using both unrelated and related individuals.

For unrelated data, this dissertation shows that prospective models may be devel- oped that correspond conceptually to retrospective tests. Two single-marker tests and four two-marker tests are discussed. The true association models are derived, allowing us to understand the effects of marker association patterns. The power of the association tests was investigated by simulation using HapMap data. Among the single-marker tests, the allelic test has on average the most power in the case of an additive disease; but, for non- additive diseases, the genotypic test has the most power. Among the four two-marker tests, the Allelic-LD contrast test provides the most reliable power overall for the cases

studied.

The proposed methods were applied to Diabetic Nephropathy (DN) data. Two genes, Carnosine Dipeptidase 1 (CNDP1) and Engulfment and Cell Motility 1 (ELMO1) have previously shown association with DN. These two genes, along with eight other

genes (HMCN1, CFH, AHSG, CASP3, HSPA1A, HSPB1, CASP12, and HMOX1) were

examined in a new study of Mexican-Americans. There was no replication of the associa-

tions with either CNDP1 or ELMO1. Of the other eight candidate genes, association with

DN was found with a SNP pair, rs2146098 and rs6659783, in HMCN1. Association with

a rare haplotype in this region was subsequently identified.

Lastly, association tests for related individuals were considered, particularly with

genome-wide data. Two versions of the quasi-likelihood score test using a generalized

linear mixed model (GLMM-QLS) were proposed. For 100 nuclear families of the same

structure, it was shown that the proposed methods maintain nominal Type I error and

have power comparable previously published methods. The main strength of the GLMM-

QLSs is their computational efficiency when applied to genome-wide association studies.

Because they are based on the prospective model, it can easily incorporate other cova-

riates.

2

Chapter 1 INTRODUCTION

1.1. GENETIC ASSOCIATION STUDIES AND GENETIC ASSOCI-

ATION TESTS

Genetic variants can occur in multiple formats. First, they can be coded as un- phased genotypes. The single marker genotypic variant, GX, can be coded with three

equally spaced values such as (-1, 0 and 1) or (0, 1 and 2). A non-linear variate, X, can

be added, so that the genetic variant becomes multivariate: GX X. Multi-marker genotypic variants are coded by simply adding variables for multiple markers, such as

GX X … X. They can also be coded as phased . For a given region, a variant with q haplotypes can be set up as GHH H, where H is the number of the i-th haplotypes a person carries. There are also other types of genetic variants such as copy number variants (CNVs). Interactions involving the variants introduced above can be set up in the form genetic genetic or genetic nongenetic as components of the regression model. This is discussed in Chapter 2.

When a genetic variant has a causal relationship with a trait, the two are asso- ciated. Variants of other markers in (LD) with it are also asso- ciated with the trait. Therefore, a significant association between a trait and a genetic marker variant implies that it, or a neighboring variant, may cause the trait.

Genetic association studies generally focus on the two questions: (1) Is there any genetic variant associated with the trait? (2) Are the genetic variants in the candidate re-

3

gion associated with the trait? The first question is often investigated with genome-wide

data and the second with candidate gene data. However, both studies use the same analy-

sis process: a series of single genetic association tests is performed with each marker va-

riant and “significant” results are considered to be strong evidence of association. There-

fore, genetic association tests are necessary for a genetic association study, and develop-

ing valid and powerful tests would enhance identifying any region containing a causal

variant.

The genetic variants are the crucial components in a genetic association test. Their form is sometimes determined from those listed above by the research topic and the data available, but sometimes they must be selected from a list of possible genetic variants.

For example, one may have to choose between two competitive tests: a haplotype-based test and a genotype-based test. Selecting the appropriate genetic variants to test requires comprehensive knowledge about the genome structure and plausible penetrance model, but this knowledge has not been fully acquired. Even for given conditions, which genetic variants would be appropriate for the research aim may not have been fully studied, as is also discussed in Chapter 2.

1.2. GENETIC ASSOCIATION TESTS FOR A BINARY TRAIT

The trait in a genetic association study is usually either quantitative or binary. It is

generally more difficult to set up an association test for a binary trait than for a quantita-

tive trait. This dissertation focuses on binary traits.

4

Genetic data can be also classified according to the relationships among subjects:

these can be unrelated or related individuals. Association tests for unrelated individuals

are relatively simple compared to those for related individuals. Many genetic association

studies are concerned with the unrelated data obtained by case-control sampling. Chap-

ters 2 and 3 will focus on case-control data. In contrast, related individuals are collected

by various sampling designs depending on the characteristics of the disease studied. As a

result, a greater variety of methods have been proposed for the analysis of related indi-

viduals. However, each method has its own limitations in terms of its applicability to data

type, flexibility of the method or computational time. Chapter 4 will focus on analyses with related individuals.

1.3. STRUCTURE OF THE DISSERTATION

This dissertation includes the development of a method and its application for a

genetic association study with a binary trait. In Chapter 2, various association tests with a

binary trait will be reviewed for unrelated data, and the power of these different methods

will be investigated. In Chapter 3, the findings from the previous chapter will be applied

to unrelated case-control Diabetic Nephropathy data. In Chapter 4, new association tests

for related individuals will be developed. Finally, in Chapter 5, a summary of the findings

and areas for future study will be presented.

5

Chapter 2 Association Tests for a Binary Trait with unrelated individuals

2.1. INTRODUCTION

A genome-wide association study with case-control data aims to localize disease susceptibility regions in the genome. Single Nucleotide Polymorphism (SNP) markers, which are usually diallelic, have been used to cover the whole genome. Two categories of tests have been applied to these data: single-marker association tests, which examine as- sociation between affection status and the marker data one SNP at a time, and multi- marker association tests, which examine association between affection status and mul- tiple SNP data simultaneously.

For single-marker association tests, we can consider the allelic frequency contrast test (allelic test) [Sasieni 1997], and the Hardy Weinberg Disequilibrium (HWD) contrast test [Song and Elston 2006]. In genome-wide association studies, the allelic test has been predominantly used (e.g. The Wellcome Trust Case Control Consortium [2007]). How- ever, this test often fails on account of a relatively strict correction for multiple compari- sons, because it does not take advantage of marker linkage disequilibrium (LD) structure efficiently, and/or it is not suitable for detecting rare disease variants. For these reasons, multi-marker association tests may have more power than single-marker association tests.

Such tests include the haplotype-based test [Schaid 2004; Schaid, et al. 2002 ], Hotel- ling’s test [Chapman, et al. 2003; Clayton, et al. 2004; Xiong, et al. 2002] and the LD

6

Retrospective Prospective • Allele frequency contrast test • Tests regarding regression Single-Marker (Cochran-Armitage trend test) coefficient(s) in a logistic • HWD contrast test model • Hotelling’s T test (LRT or Score Test) Multi-Marker • LD contrast test o Genotype-Based model • Haplotype-frequency contrast test o Haplotype-Based model Table 2-I Genetic Association Tests for Case-Control Data

contrast test [Nielsen, et al. 2004; Wang, et al. 2007; Zaykin, et al. 2006] (Table 2-I).

The original LD-contrast test requires phased genotype data, but Zaykin et al. [2006] proposed the composite-LD contrast test that does not require phased genotype data.

From now on in this paper, when it applies to unphased data, we use the term “LD con- trast test” to denote the composite-LD contrast test.

Several authors have proposed that either of the HWD and LD contrasts be

jointly tested with the allele frequency contrast [Song and Elston 2006; Zheng, et al.

2008; Zheng, et al. 2007]. Recently, Won and Elston [2008] described the allele fre-

quency, HWD and LD contrasts as three distinct sources of information about case-

control differences and suggested performing these tests in a joint or multi-stage man-

ner. While these three sources of information are often close to being independent, they

are only strictly independent under limiting conditions [Won and Elston 2008]. This

fact has restricted a systematic use of the three tests, because extra work is required to

adjust for their correlations.

The allele frequency, HWD and LD contrast tests are typically developed in what has been termed a retrospective context; i.e. case-control status is considered fixed and

7

the genotypes are considered random. However, for case-control data, epidemiologists

typically take advantage of the properties of the odds ratio and use the prospective logis-

tic regression model, making the case-control status the random variable dependent on

the predictors (i.e. genotypes and covariates), which are considered fixed [Prentice and

Pyke 1979]. This prospective modeling tends to allow for greater flexibility, especially

when adjusting for covariates. It also provides a natural way to adjust for any correlations between the tests or other covariates and can be extended to quantitative traits. The allele frequency contrast test has been performed in a prospective logistic regression model

[Longmate 2001]. However, there is little discussion in the literature concerning the prospective modeling of the other two tests.

Here, for unphased case-control data, we discuss how the allelic, HWD and LD contrasts tests may be combined, either pairwise or all three together, in the retrospective context. We then show that these joint tests correspond very closely to analogous tests based on certain prospective models. Using these general models, various specific models and their corresponding tests are presented.

The disease penetrance model will be written as a polynomial that allows a gener- al penetrance model, including models appropriate for additive, dominant and recessive diseases. Given the penetrance model, together with the LD structure, the true single- marker association model will be derived. Similarly, the true two-marker association models will be considered. Deriving the true association models allows us to understand many of the test properties in genetic association studies. Then, we look for the best test in terms of power. Last, under the assumption that LD among two markers and the dis- ease locus is similar to that among three markers, we compare the power of each test us-

8

ing the SNP data on chromosome 11 of the HapMap CEU (Utah residents with ancestry

from northern and western Europe) population data.

2.2. SINGLE-MARKER AND TWO-MARKER ASSOCIATION

MODELS AND TESTS

2.2.1. Notation and Assumptions

We assume that there is a single diallelic disease SNP in the genomic region being

considered, but we allow multiple disease SNPs to exist that are not in LD with any SNP in the region. We suppose there are two diallelic SNP markers, A and B, having alleles

A,A and B,B, respectively, where A and B are the minor alleles. Let X and Y

denote random variable for genotypes of markers and coded as follows:

1 for AA 1 for BB X0 for AA ,Y 0 for BB . 1 for AA 1 for BB

The random variables X and Y for the i-th individual are denoted by X and Y. I and

I denote the sets of cases and controls. We assume a multinomial distribution for un- phased genotype data in the general population and denote their probabilities as in Table

2-II.

9

Table 2-II Probabilities for unphased genotypes

, , ,

, , ,

, , ,

1

Note that we make minimal assumptions about the general population sampled; in partic-

ular, we do not assume HWE in the population. The allele frequencies of A and B are

given by and . We use , and to de- A AA AA B BB BB X X X,Y note the expected value of X, the variance of X and the covariance of X and Y, respec- tively. Note that X 2A 1 and Y 2B 1. We similarly assume a multinomial

distribution for cases and controls, denoting any parameters associated with these popula-

tions respectively by the subscripts “case” and “ctrl”. The HWD parameter for marker A

and the composite LD parameter for alleles A and B of markers A and B are respective- ly given by [Weir 1996; Zaykin 2004] as

A AA A and

. AB , , , , A B X,Y

10

The HWD parameter A can also be expressed in a different form. It can be shown that

X X X 2A AA4A . Under HWE i. e. A AA , this be-

comes X|HWE 2A1 A.Thus the HWD parameter can be expressed as

. A X X|HWE

This means that the HWD parameter, A, is half the deviation of the variance from the

variance expected under HWE.

2.2.2. Association Tests in the Retrospective Context

First, we will review association tests in the retrospective context. The allele fre-

quency and HWD contrast tests for marker A and the LD contrast test for markers A and

B test the equalities, between cases and controls, of the parameters, A,A and , re- spectively. Suppose we have n cases and n controls. Since the allele frequency

estimate is ̂ X 1, where X is the sample mean of the X , the test statistic for the A

allele frequency contrast can be written

XX TAF , SX|SX|

∑ ∑ where SX| I X X and SX| I X X . This

corresponds to the univariate version of Hotelling’s [1931] T test. Estimates for the

HWD and LD parameters and AB in a population from a sample with size n are ob-

tained by

11

∑X X X and A X

∑X Y . AB X Y

Then the T statistics for the HWD and LD contrast tests are given by

A|A| THWD and V A|V A|

AB|AB| TLD . V AB|V AB|

It can be easily verified that, under the assumption of known X and Y,

Var and Var . (2-1) A X AB XY X Y Y X

In practice, when X and Y are unknown, we may replace them by the consistent esti-

mates X and Y. While other tests exist that utilize allele frequency, HWD and LD con-

trasts, these tests are similar in form to the ones presented here and perform similarly.

The joint test of allele frequency and HWD contrasts between cases and controls

tests the null hypothesis H:A| A|A| A|. Note the important point that the parameter vector A A determines the genotype distribution and there-

fore this test is equivalent to the genotypic test. We denote this test the Allelic-HWD

contrast test. In the following, M denotes the transpose of a column vector M. Letting

Z X X , the sample mean Z is a sufficient statistic for A A . Thus the Allelic-

HWD contrast test can be performed by comparing Z and Z. The T statistic for

this test is given by

12

T Z Z ST Z Z, (2-2)

∑ ∑ where ST I Z Z Z Z I Z Z Z

Z and denotes a generalized inverse. Under the null hypothesis, T asymptotical-

ly follows the chi-square distribution with degrees of freedom equal to the rank of ST.

Similarly, we can build a joint test of the allele frequency contrasts of two markers and their LD contrast, which tests the null hypothesis H:A| B| AB|

A| B| AB| . We denote this test the Allelic-LD contrast test. Letting

Z X Y XY, Z is a sufficient statistic for AB . Thus, the Allelic-LD

contrast test can be performed using a version of Hotelling’s T.

Therefore, it can be seen that the additional case-control differences that can be

captured by the HWD and LD contrast tests, given the allele frequency contrast(s), are

equivalent to the differences in quadratic and interaction terms, respectively. The joint

test for the case-control difference in allele frequency and HWD of two markers and their

LD (the Allelic-HWD-LD contrast test) can be constructed, in a similar manner, by con-

trasting the mean vector of Z X Y XY X Y between cases and controls.

2.2.3. Marker Association Model and Tests

It is well known in epidemiological theory that the prospective model is still valid

with data sampled in a retrospective way [Prentice and Pyke 1979]. Moreover, when the

data consist of total individuals, it has been shown that the Cochran-Armitage test sta-

13

tistic, that is, the statistic calculated in the retrospective context, is times the score statistic from a prospective logistic model [Schaid, et al. 2002]. This shows that the test developed in a retrospective context can be transformed into the asymptotically equiva- lent test in a prospective model. This can also be shown for the T and score tests. The

score test statistic is of the same form as (2-2), but with ST replaced by the similar ma-

trix

1 Z ZZ Z n n

ST Z ZZ Z .This implies that the T

and score test statistics are asymptotically equivalent as the null and the alternative hypo-

thesis approach each other (i.e. they become a Pitman sequence [Davidson and MacKin-

non 1987]).

Therefore, a single marker model for the Allelic-HWD contrast test incorporating

covariates can be written as

log X X covariates, 2‐3 X X where we suppress the index i and denotes the probability that an individual is affected. Similarly, the two marker model for the Allelic‐HWD‐LD contrast test in‐

corporating covariates is

log X Y XY X Y covariates . 2‐4 X Y XY X Y

14

Based on these two models, or a reduced form of them, we can set up various types of

association tests that examine the significance of all or a subset of the regression coeffi-

cients. Because the likelihood ratio test (LRT) for a logistic model is also very close to

the score test for the same model, we expect all three tests, the T test, score test and LRT,

to behave similarly. Six models and their global hypotheses that may be tested in a single

stage analysis are numbered and presented in Table 2-III and, for each of these hypo- theses, we can set up the corresponding score test or LRT. For genome-wide association analysis, we can perform a single marker test with every single SNP and/or a two-marker test with every consecutive pair of SNPs.

Multi-stage analysis, which utilizes the allele frequency, HWD and LD contrasts in sequential stages of the analysis, can be created by using a sequence of the tests in Ta-

ble 2-III in a prospective model. Suppose, for example, that SNPs have been selected for genotyping by an allele frequency contrast test based on pooled DNA samples. Then the

HWD contrast test adjusted for the information that is used in the first stage can be per-

formed by the test of H: | 0 in the model of Test 1-2, whether the second test is to be applied to the same or a new sample of persons. Therefore, based on these marker association models, using this framework in a multi-stage study we can perform the allele frequency, HWD and LD contrast tests and their joint tests in a more systematic and con-

venient way.

15

Table 2-III Single-marker and two-marker association tests with corresponding models and hypotheses

Test Model Null hypothesis Test Description

Single-marker association

Test 1-2 β β 0 Allelic-HWD contrast test (Genotypic test) X X 1

Allele frequency contrast test (Allelic test) Test 1-1 βX 0 16 1

Two-marker association

Joint Allelic-HWD-LD contrast test Test 2-5 β Xβ Y β X β Y β XY βX βY βXY βX βY 0 1 X Y X Y XY

Joint Allelic-HWD contrast test Test 2-4 β Xβ Y β X β Y βX βY βX βY 0 1 X Y X Y

Joint Allelic-LD contrast test Test 2-3 β Xβ Y β XY βX βY βXY 0 1 X Y XY

Joint Allelic contrast test Test 2-2 β Xβ Y βX βY 0 1 X Y

2.2.4. Haplotype-Based Test

The association models in Table 2-III are specific genotype-based models and it

may be also helpful to review the relationship between these genotype-based models and

haplotype-based models. Let H be the number of A,B haplotypes an individual has.

Then the haplotype-based model for two markers, taking H as a baseline, is written as

log H H H covariates, (2-5)

and the haplotype-based test is set up as a global test of their effects, that is, a test of

: O. Ignoring covariates, the saturated haplotype-based model has

four parameters for main effects (i.e. we rewrite model (2-5) to include H instead of

) and six parameters for their first-order interaction effects. This model, with a total of ten parameters, is equivalent to the genotype-based model that includes the six terms in

(2-4), together with the higher order terms XY, XY,XY and an extra term for phase

[Schaid 2004]. Therefore, each test in Table 2-III and the haplotype-based test examine

the case-control differences summarized in a different way. In particular, for two-marker

data, Test 2-3 is comparable to the haplotype-based test because both of them are gener-

ally 3-degree-of-freedom tests [Conti and Gauderman 2004].

2.2.5. True Association Models

We can obtain true marker association models utilizing known information about

the genotype penetrances and the LD structure among the disease and marker alleles. The

tests in Table 2-III do not require HWE in the general population, but we will consider

17 true marker association models under the assumption of HWE in the general population.

Let the disease SNP have alleles D,D , where D is the minor allele. Let D denote the disease genotype variable coded as

1 for DD D0 for DD . 1 for DD

We write the penetrance model as:

Paffected|D DDDD . (2-6)

which is specified by parameters γ, γD and γD. Note that for simplicity of exposition this is not written on a logit scale. Consider the four disease models: additive, dominant, recessive and heterozygote (dis)advantage. These can be obtained by constraining the coefficients of the penetrance model (2-6) as indicated in Table 2-IV.

Table 2-IV Constraints for disease models

Disease Model Constraint

Additive 0

Dominant or Recessive

Heterozygote (Dis)advantage 0

18

In the following, we assume that the homozygote with the major allele D has the lowest

risk. This implies that the minor allele is the disease susceptibility allele for the additive, dominant and recessive diseases, and that we are considering a heterozygote disadvantage disease. Nevertheless, the same test statistics are appropriate, and their power will be sim- ilar for the diseases in which the homozygote with the major allele D has the highest risk.

Now consider the LD structure between a single marker locus and a disease locus

in the general population. Let D denote the disease allele frequency and AD denote the

LD of the marker allele A and the disease allele. The LD structure implies that

ED|X aXb and

ED|X aX abXc , where

a AD , A A

b AD 12D and A A

AD AD c12D AD 2D D (see APPENDIX A). A A A A

Given the true disease model and the LD structure, we can set up the true single-marker

association model between the phenotype and single-marker data , as follows:

Paffected|X Paffected|DPD|X DED|X DED |X D,,

abD aDXa DX DcDb . (2-7)

19

This true association model, which has the same form as the penetrance model (2-6), al- lows us to understand how the disease model and the LD structure affect the SNP associ- ation pattern. It clearly implies that a marker allele that is in LD with the disease allele would be observed in the cases and controls as if it were a disease SNP with a particular disease model. Now, in the model (2-7), as AD approaches 0, the coefficients of both the linear and quadratic terms go to 0. However, the ratio of the coefficient of the linear term to that of the quadratic term in model (2-7) is

D 12A 12D . D AD

D When 1 2D 0 , as AD approaches 0, the absolute value of the ratio ap- D proaches infinity. Thus as the LD decreases, the coefficient of the quadratic term general- ly approaches 0 faster than does that of the linear term, and the association model be-

D comes similar to that of an additive disease. However, if 12D 0 holds for D the disease model, it can be shown that X| X| X (APPENDIX B). This im- plies th.at a test based on a model with only a linear term, such as the allelic test, cannot identify the association at all when it is applied to any SNP correlated with the disease

SNP, even the disease SNP itself. This condition holds only in over- or under-dominant disease models.

Given LD parameters and allele frequencies, the true two-marker association model can be obtained in a similar way to the single-marker case as follows:

Paffected|X, Y Paffected|DPD|X, Y D,,

20

DED|X, Y DED |X, Y, (2-8)

which is a full model with 9 polynomials of X and Y, or a reduced form of this model.

While the regression coefficients cannot be expressed simply in general, we may easily write out ED|X, Y and ED|X, Y for computational purposes (see APPENDIX A).

Consider a disease allele on a multi-dimensional LD structure, by which we mean

that ED|X, Y deviates from a linear combination of the variables X and Y. This often

happens when there is high three-locus LD. The case ED|X, Y XXYYXYXY

with a relatively large absolute value of XY is a “simple” multi-dimensional LD struc-

ture. In this case, the true marker association model for an additive disease is written as

Paffected|X, Y DXXYYXYXY .

This implies that tests that include the contrast of an interaction term (i.e. Tests 2-5, Test

2-3 and the LD contrast test) may gain power by taking into account the multi-

dimensional LD structure. On the other hand, tests that do not take into account the multi-

dimensional LD structure (any single marker tests, Test 2-2 and Test 2-4) might have less

power. The haplotype-based test also takes multi-dimensional LD structure into account,

and its gain in power in a region of high multi-dimensional LD structure has been ob-

served by Nielsen et al. [2004].

We can consider the models in Table 2-III as reduced, full, or extended in com-

parison to the true model. However, the models in Table 2-III and the true association

models are written with different link functions (the logit and identity functions, respec-

tively); whereas the identity link function simplifies exposition for the relationship be-

tween the disease model and the association model, the logit link function is more conve-

21

nient for data analysis. For small effect sizes the two link functions should yield similar

models. Therefore, our true marker association models are sufficient to provide an intui-

tion about which predictors will be important components of the logistic model.

Finding the most powerful test among the tests in Table 2-II is not straightfor-

ward because the test under the full true model, which examines contrasts of all the pre-

dictor variables in the full model, is not always the most powerful. When a reduced mod-

el explains the data parsimoniously, its corresponding test becomes more powerful than

the corresponding tests under the full model because of the smaller number of degrees of

freedom. Therefore, we compared various association tests with different penetrance

models and LD structures - which together determine the true association model.

2.3. POWER OF THE ASSOCIATION TESTS

2.3.1. Power Computation

As mentioned earlier, the T test in a retrospective model and the score test and

LRT in a prospective logistic model are expected to perform similarly. We first derive the

theoretical power calculated from the noncentrality parameter of the T test and compare

this with the empirical power of the T test, score test and LRT. The noncentrality para- meter of the T test for Test 2-5 is

Σ

where X| Y| XY| X| Y| ,

22

X| Y| XY| X| Y| ,

Σ Σ Σ,

X,Y|X,XY| X,X| X,Y | X| Y,XY| Y,X| Y,Y| Y| XY,X|XY,Y| Σ XY| and X,Y| X| Y|

X,Y|X,XY| X,X| X,Y | X| Y,XY| Y,X| Y,Y| Y| XY,X|XY,Y| Σ XY| . X,Y| X| Y|

Here, Σ and Σ are symmetric, so we only indicate the diagonal and above diagonal elements. The number of degrees of freedom, , is the rank of Σ. Given the penetrance model and LD structure, we can derive the entries of ,,Σ and Σ (see AP-

PENDIX C). The noncentrality parameters for the other tests in Table 2-III can be ob- tained by using the corresponding sub-matrices of and Σ. Then the power of the α significance level test with noncentrality parameter is given by POWER

1F Χ , where F is the cumulative density function of a chi-square distribution

with noncentrality parameter and degrees of freedom , and Χ is the 1 quantile of a central chi-square distribution with r degrees of freedom.

We compared this theoretical power of the T test with the empirical power of the

T test, score test and LRT. For each of the four disease models, we generated 100,000 replicate datasets, performed Test 1-2 on each dataset with each of the three test statistics and obtained their empirical power from the 100,000 replicate datasets. Specifically, the

23

′ parameters were set as follows: D 0.2,A 0.3,AD 0.048 0.8, K

0.05 5%, and 0.04 4%, where K is the disease prevalence and is the baseline risk. By fixing K and instead of the effect size, for each region and each

disease model we can condition on a constant attributable risk calculated as K.

The coefficients D,D and are determined by the disease model, the prevalence K

and baseline risk using the constraints shown in Table 2-IV and the following equ-

ations:

KDD 1D DD 1D

D D .

The significance level was set to 0.05/500,000 for a genome-wide association

study with 500K independent SNPs. Each dataset consist of 2,000 cases and 2,000 con-

trols for the additive, dominant and heterozygote disadvantage diseases, and 500 cases

and 500 controls for a recessive disease so that its power would not be too high. Empiri-

cal power was obtained by the ratio of the number of rejected replicates to the total num-

ber of replicates.

24

The theoretical power of the T test was close to the empirical power of the score,

LRT and T tests (Table 2-V) while the three test statistics led to almost identical (but

very small), departure from nominal Type I error (Table 2-VI). The T test is slightly

more powerful than the other two, while the LRT is slightly more powerful than the score

test. (The power under a recessive disease showed relatively greater inconsistencies be-

cause of the smaller sample size.) Therefore the theoretical power of the T test can be a

good estimate for any of the three tests. For the purpose of comparing the power of the

tests in Table 2-III and any other association tests, it is sufficient to compare the theoret- ical power of the corresponding T test.

Theoretical Empirical Power Power T test T test LRT Score test

Additive 0.532 0.533 0.527 0.523

Dominant 0.366 0.366 0.361 0.359

Recessive 0.734 0.741 0.736 0.708

Heterozygote 0.284 0.283 0.277 0.275 Disadvantage Table 2-V Comparisons of Theoretical and Empirical Power of Test 1-2 For each of the four disease models, parameters are set as follows: ′ D 0.2,A 0.3,K0.05,AD 0.048 0.8, 2,000 500 for recessive, 0.05/500,000. Empirical power is obtained by the ratio of the number of rejected replicates to the total 100,000 replicates.

α T test LRT Score test 0.05 0.0501 0.0501 0.0501 0.01 0.0103 0.0103 0.0103 0.001 0.0011 0.0010 0.0010

Table 2-VI Empirical Type I Error of Test 1-2

25

2.3.2. Power Comparisons

The theoretical power of the HWD contrast test, LD contrast test and haplotype-

based test can be computed from their noncentrality parameters. The noncentrality para-

meters for the HWD contrast test and the LD contrast test are given by

A|A| λHWD and VA|VA|

AB|AB| λLD , VAB|VAB|

where the denominators are given in (2-1). The noncentrality parameter of the haplotype-

based test, λHAP, can be also computed (see APPENDIX D). These three noncentrality

parameters are given under the assumption that the true minor allele frequencies of mark-

ers A and B, A,B, are known, or that the haplotype frequencies are determined with certainty, in both cases and controls. The theoretical power from these noncentrality pa- rameters somewhat overestimate the power in a real situation because the variances in the denominators of the noncentrality parameters would be greater. Therefore, our theoretical power comparisons of these three tests and the tests in Table 2-III may give results which are a little favorable to these three tests.

We compared power among the single-marker tests and among the two-marker tests. For the single-marker tests, we present power as a function of the LD (Lewontin’s

D) between marker and disease allele (Fig. 2-1). Test 1-1 (the allelic test) always had

more power than Test 1-2 (the genotypic test) or the HWD contrast test in the case of an

additive disease. But in the other disease models, Test 1-2 had more power than the allel-

26

ic test when LD is high. The HWD contrast test had less power for all four disease mod-

els. However, although the HWD contrast test performed poorly by itself, when it was

combined with the allele frequency test to be Test 1-2, power was maximized. The power

is also presented as a function of the disease allele frequency (Fig. 2-2). For dominant

and heterozygote disadvantage diseases, the power of Test 1-2 is larger than that of Test

1-1 when the disease allele frequency is close to a marker allele frequency. We present

these two plots for the larger prevalence K=0.20 (Fig. 2-3 and 2-4) and the smaller preva-

lence K=0.005 (Fig. 2-5 and 2-6) and the findings were not materially differnet. This

prevalence range (0.005~0.200) covers the prevalences of many diseases such as diabetic

nephropathy (in the age group of older than 75 years, about 0.005) and hypertension

(about 0.200).

For the two-marker tests, two cases of LD structure were considered for each dis-

ease model: a case with low multi-dimensional LD (LD structure 1) and one with high

multi-dimensional LD (LD structure 2), as defined in the legend to Table 2-VII. In LD

structure 1, Test 2-2, which examines the allele frequencies of the two markers, had the most power except in the case of a recessive disease (Table 2-VII). However, in LD

structure 2, the haplotype-based test, Test 2-5 and Test 2-3 had more power than Test 2-2

by taking into account the multi-dimensional LD. The LD contrast test, like the HWD

contrast test, performed poorly alone but, when it was combined with the allele frequency

contrast test, power was maximized. Power of the two-marker tests is also presented for a

common disease allele (Table 2-VIII).

27

Figure 2-1 Power of single-marker test vs LD in four disease models (K=0.05)

Parameters other than LD are set as follows: D 0.2,A 0.3,K0.05, 0.04, 0.05/500,000 and sample size for each of cases and controls is set to 2,000 for additive, dominant and heterozygote disadvantage diseases and 500 for recessive disease so that its power is not too high.

28

Figure 2-2 Power of single-marker test vs disease allele frequency in four disease models (K=0.05)

Parameters other than disease allele frequency are set as follows: DAD 0.8, A 0.3, K 0.05, 0.04, 0.05/500,000 and sample size for each of cases and controls is set to 2,000 for additive, dominant and heterozygote disadvan- tage diseases, and 500 for recessive disease so that its power is not too high.

29

Figure 2-3 Power of single-marker test vs LD in four disease models (K=0.20)

Parameters other than LD are set as follows: 0.2, A 0.3, K 0.20, 0.16, 0.05/500,000 and sample size for each of cases and controls is set to 4,000 for additive, dominant and heterozygote disadvantage diseases, and 1,000 for recessive disease so that its power is not too high.

30

Figure 2-4 Power of single-marker test vs disease allele frequency in four disease models (K=0.20)

Parameters other than disease allele frequency are set as follows: DAD 0.8, A 0.3, K 0.05, 0.16, 0.05/500,000 and sample size for each of cases and controls is set to 4,000 for additive, dominant and heterozygote disadvan- tage diseases, and 1,000 for recessive disease so that its power is not too high.

31

Figure 2-5 Power of single-marker test vs LD in four disease models (K=0.005)

Parameters other than LD are set as follows: 0.2, A 0.3, K 0.005, 0.16, 0.05/500,000 and sample size for each of cases and controls is set to 4,000 for additive, dominant and heterozygote disadvantage diseases, and 1,000 for recessive disease so that its power is not too high.

32

Figure 2-6 Power of single-marker test vs disease allele frequency in four disease models (K=0.005)

Parameters other than disease allele frequency are set as follows: DAD 0.8, A 0.3, K 0.05, 0.16, 0.05/500,000 and sample size for each of cases and controls is set to 4,000 for additive, dominant and heterozygote disadvan- tage diseases, and 1,000 for recessive disease so that its power is not too high.

33

Test 2-5 Test 2-4 Test 2-3 Haplotype- Test 2-2 LD based contrast (LD structure 1) Additive 0.775 0.813 0.851 0.842 0.890 0.000 Dominant 0.695 0.736 0.774 0.749 0.819 0.000 Recessive 0.823 0.845 0.746 0.784 0.717 0.001 Heterozygote 0.617 0.653 0.673 0.621 0.711 0.000 Disadvantage (LD structure 2) Additive 0.962 0.758 0.970 0.948 0.850 0.007 Dominant 0.921 0.673 0.926 0.887 0.769 0.003 Recessive 0.851 0.647 0.910 0.945 0.618 0.206

Heterozygote 0.845 0.584 0.831 0.773 0.656 0.001 Disadvantage Table 2-VII Power comparisons of two-marker tests ( rare allele frequency ) For both LD structure 1 and LD structure 2 the parameters are set as follows: n n=2,000 (500 for recessive), K=0.05, 0.04, 0.05/500,000, A 0.2, D 0.1, B 0.2, AD 0.045 AD 0.56 , BD 0.047BD 0.59, AB 0.028AB 0.18. For LD structure 1, the three locus LD, ADB, was 0.01, and for LD structure 2, ADB was 0.035. Bold numbers in each row indi- cate, separately for the single- and two-marker tests, the maximum power over all two-marker tests in that row.

34

Test 2-5 Test 2-4 Test 2-3 Haplotype- Test 2-2 LD based contrast (LD structure 1) Additive 0.865 0.834 0.920 0.889 0.904 0.000 Dominant 0.326 0.274 0.421 0.339 0.361 0.000 Recessive 0.800 0.783 0.869 0.842 0.871 0.000 Heterozygote 0.006 0.003 0.004 0.001 0.000 0.000 Disadvantage (LD structure 2) Additive 0.819 0.787 0.873 0.864 0.871 0.000 Dominant 0.247 0.238 0.295 0.290 0.324 0.000 Recessive 0.697 0.681 0.785 0.793 0.794 0.000

Heterozygote 0.002 0.003 0.000 0.000 0.000 0.000 Disadvantage Table 2-VIII Power comparisons of two-marker tests ( common allele frequen- cy ) For both LD structure 1 and LD structure 2 the parameters are set as follows: n n = 4,000 (1,000 for recessive), K=0.05, 0.04, 0.05/ 500,000 , A 0.3, D 0.4, B 0.3, AD 0.130 AD 0.72 , BD 0.120 BD 0.67, AB 0.008 AB 0.04. For LD structure 1, the three locus LD, ADB, was -0.016, and for LD structure 2, ADB was -0.024. Bold numbers in each row indicate, separately for the single- and two-marker tests, the maximum power over all two-marker tests in that row.

For a given disease model, none of the tests were found to be the most powerful in both cases of LD structure. Therefore, the best test for a particular association study may be determined by specifying the joint LD structure of the markers and a real disease

SNP, were it known. If we assume that the LD structure among the marker and disease

SNPs is similar to that among the marker SNPs, we can compute and compare the power of each test given a known LD structure by estimating the necessary parameters from the marker LD structure.

35

2.3.3. Power Comparisons Based on Real Data

We estimated LD parameters and marker allele frequencies from the HapMap

CEU population data [The International HapMap Project 2005]. These data consist of

120 haplotypes on chromosome 11 estimated from 30 parent-offspring trios. We split

chromosome 11 into mutually exclusive consecutive regions containing 3 SNPs each. For

each region we estimated the LD and allele frequency parameters. We excluded regions

where the minor allele frequencies of three consecutive markers were less than 0.1, leav-

ing 4,648 regions. Following the method in Nielsen et al.’s [2004] simulation, we chose

the disease SNP to be the one with the smallest allele frequency, assuming the disease

allele would have a smaller allele frequency than the ascertained marker SNPs. Parame- ters other than the LD parameters were set to be the same as before, i.e. n=2,000 (500 for recessive), K=0.05, 0.04, 0.05/500,00. For each region, we computed the

power of the single marker tests based on the first of the non-disease SNPs, and we com-

puted the power of the two marker tests based on the two non-disease SNPs. Then the

mean power over all regions was computed (Table 2-IX).

36

Single-marker Test Two-marker Test

Test Test HWD Test Test Test Test Haplotype- LD Disease Model 1-2 1-1 contrast 2-5 2-4 2-3 2-2 based contrast

Additive 0.423 0.457 0.000 0.575 0.586 0.604 0.632 0.625 0.019

Dominant 0.361 0.347 0.001 0.505 0.513 0.518 0.505 0.488 0.003

Recessive 0.519 0.415 0.255 0.687 0.677 0.672 0.572 0.624 0.278

Heterozygote 0.423 0.241 0.163 0.587 0.580 0.546 0.367 0.344 0.058 Disadvantage

Table 2-IX Mean of Power over Chromosome 11 of CEU HapMap Data For both single-marker and two-marker tests, the parameters are set as follows: , (500 for recessive), K=0.05, ., . /, . Bold numbers in each row indicate the maximum power in that row.

In a comparison of the single marker-tests, while Test 1-1 was the most powerful

on average for an additive disease, Test 1-2 was the most powerful on average for the

other three disease models. Comparing among the two-marker tests, the most powerful

tests on average were: Test 2-2 for an additive disease, Test 2-3 for a dominant disease,

and Test 2-5 for a recessive or heterozygote disadvantage disease. However, among these

tests, Test 2-3 had relatively consistent power. It had greater overall average power than

the haplotype-based test except for an additive disease, but it nevertheless had compara-

ble power to the haplotype-based test. As seen before, the purely HWD contrast test and

LD contrast test had the lowest power.

37

2.4. DISCUSSION

The single-marker and two-marker association models were constructed and the

various association tests based on these models were investigated. By elucidating the re-

lationships between retrospective tests and prospective tests, and between genotype-based

tests and the haplotype-based test, one can understand more comprehensively the proper-

ties of the various association tests using case-control data.

The case-control HWD and LD contrasts, given the allele frequency contrast(s), were represented as a quadratic term and an interaction term, respectively. Accordingly,

we have written single-marker and two-marker association models with the predictors

expressed as simple polynomials. Therefore, the joint tests of the allele frequency, HWD

and LD contrasts may be performed by testing the regression coefficients in a prospective

marker association model. We gain some useful information by understanding this rela-

tionship. For example, the observation that the HWD contrast test has power only in the

case of a non-additive disease is obvious in a prospective model because the HWD con-

trast corresponds to a quadratic term.

True association models of any marker SNP in LD with the disease SNP can be written from the penetrance model of the disease SNP and the LD between alleles at the two loci. From the single-marker association model, we could learn many things. Except

D in the special case that 12D 0, any non-additive effect will diminish and D the association pattern will approach the additive model. This may be the unrecognized underlying reason why testing the additive effect is popular. However, as 12D

D becomes smaller, this approach becomes slower and there may still remain a large D

38

non-additive effect. In this case, testing only an additive effect will not have as much

power.

If the disease model is given, the LD determines the appropriate association mod-

el and power of the tests. To have an idea of the LD structure expected in current ge-

nome-wide studies, we have utilized the HapMap data. The estimated LD was high enough for the genotypic test to have more power than the allelic test in all three non- additive diseases. For an additive disease, the genotypic test lost little power compared to the allelic test. Therefore, in a genome-wide study with current levels of marker density, the genotypic test may be preferable.

For the two-marker association tests, we concluded that Test 2-3, which has an extra interaction term as well as linear terms, provides reliable power in any disease

model. This suggests that, in tests involving more than two markers, it may be reasonable

to include interaction terms as well as linear terms. In fact, multi-marker models with in-

teraction terms have been presented by several authors [Conti and Gauderman 2004;

Cordell and Clayton 2002; Devlin, et al. 2003]. Conti and Gauderman [2004] compared

various two-marker tests, some of which are equivalent to the tests in Table 2-III. They

introduced a modified interaction term and gained power comparable to that of the haplo-

type-based test. Here, we compared a different set of two-marker tests under a more

comprehensive set of disease models. We also showed how the interaction term may be

interpreted as an LD contrast. Therefore, our results provide further support for the inclu-

sion of interaction terms in multi-marker tests.

39

The power of the HWD and LD contrast tests were by themselves very low.

Therefore we conclude that it is not advisable for HWD or LD contrast tests to be used

alone when conducting a whole genome-wide association study, but rather they should be

used in conjunction with the allele frequency contrast test.

The results showed that in most cases the genotype-based tests have greater power than a haplotype based test. The practical benefit of genotype-based tests is that they do not require phase inference. Estimating haplotypes for a genome-wide association study not only introduces another source of variation in the test, but it also entails significantly more analysis time. However, it should be noted that we only considered two-locus hap- lotypes and our power comparisons using HapMap data are only valid under the assump- tion that a disease SNP would be part of the same LD pattern as neighboring marker

SNPs. Specifying a more realistic distribution for a disease allele and its LD structure with marker alleles – perhaps, for example, using coalescent theory [Zöllner and Prit- chard 2005] - could provide a fairer comparison of the tests.

By considering the prospective models, the allele frequency, HWD and LD con- trasts can be utilized and understood better as follows. First, we can test the three con- trasts more systematically in a multi-stage analysis by adopting a sequential test or any other testing procedures developed in regression modeling. Second, these contrasts can easily be tested together with other covariates. Therefore, under this framework popula- tion stratification can be modeled by any method that uses covariates for ancestry, as was developed in the context of the allele frequency contrast test [Price, et al. 2006; Pritchard, et al. 2000; Zhu, et al. 2002]. Last, the tests can be easily extended to quantitative traits.

40

Chapter 3 An Application to Diabetic Nephropathy Data

3.1. INTRODUCTION

Diabetic nephropathy (DN) is the main cause of end stage renal disease (ESRD)

in the US. The disease burden in people of Mexican-American descent is particularly

high, but there are only a limited number of studies that have characterized genes for DN

in this ethnic group. Recently, two genes, Carnosine dipeptidase 1 (CNDP1) and En-

gulfment and Cell Motility 1 (ELMO1), were reported to be associated with DN [Freed- man, et al. 2007; Janssen, et al. 2005; Leak, et al. 2009; Shimazaki, et al. 2005]. Janssen et al. [2005] reported an association of DN and a microsatellite marker, D18S880, in

CNDP1 in Type 1 and Type 2 diabetic patients from four different countries, and Freed- man et al. [2007] reported its replication in Type 2 diabetic Caucasian patients. Shimaza- ki et al. [2005] reported an association in the Japanese population of DN and ELMO1, which includes as the most significant SNP rs741301. Leak et al. [2009] reported a repli- cation of the association of ELMO1 gene in African-American population.

Here we study ten candidate genes for their association with diabetic nephropathy in the Mexican-American population. We attempt to replicate the previous associations of

CNDP1 and ELMO1 and, in addition, we study the following eight genes which are good biological candidates but have not been studied extensively. The additional genes being examined are: Hemicentin 1 (HMCN1), complement factor H (CFH), alpha-2-

Heremans-Schmid-glycoprotein (AHSG), caspase 3 (CASP3), heat shock 70kDa protein

41

1A (HSPA1A), heat shock 27kDa protein 1 (HSPB1), caspase 12 (CASP12), and heme

oxygenase (decycling) 1 (HMOX1).

HMCN1 was shown to be associated with change in calculated glomerular filtra-

tion rate [Thompson, et al. 2007] but its role in DN has never been examined. CFH is

long known to play a role in atypical hemolytic uremia (aHUS) and membranoprolifera-

tive glomerulonephritis (MPGN), but its involvement in DN has not been evaluated.

AHSG is reported to be associated with type 2 diabetes and dyslipidemia, and AHSG in-

hibits insulin-induced tyrosine phosphorylation of IRS-1 [Andersen, et al. 2008]; it has

been identified as a marker of acute kidney injury [Zhou, et al. 2006]. Its serum concen-

tration is increased in non-dialyzed patients with diabetic nephropathy [Mehrotra, et al.

2005], low in patients with ESRD [Ketteler, et al. 2003]. High serum levels are associated with insulin resistance [Stefan, et al. 2006]. HSPB1, also known as HSP27, is involved in the regulation of cell adhesion and invasion [Lee, et al. 2008], regulates actin cytoskele- ton turnover, and has anti-apoptotic and antioxidant properties in a wide variety of cells

and tissues [Garrido 2002]. A mutation in HSPB1 causing a variant of Charcot-Marie-

Tooth disease is associated with the development of focal and segmental glomeruloscle- rosis [Gherardi, et al. 1985]. HMOX1, also known as HO-1, provides anti-oxidant adap- tive functions in response to renal injury [Agarwal and Nick 2000] and is associated with

the degree of renal failure in DN [Calabrese, et al. 2007]. CASP3 and CASP12 mediate

apoptotic cell death and were chosen as candidate genes because of their relevance to DN

[Isermann, et al. 2007; Kumar, et al. 2004]. Finally, HSPA1A was chosen because of its

cellular protectant role in the unfolded protein response [Kaufman 2002].

42

Our study aimed to replicate the previous association of the two genes and/or dis-

cover new associations at the other eight genes of biologic importance by contrasting the

genotype frequencies of SNPs in these ten genes between cases and controls after allow-

ing for relevant covariates.

3.2. RESULTS

We sampled 455 diabetic patients with nephropathy (cases) and 437 patients

without nephropathy (controls) (N=892). The detailed inclusion criteria are described in

section 3.4. Subjects were recruited through three centers: the Harbor-UCLA clinical cen-

ter, the UC Irvine clinical center and the University of Texas Health Science Center at

San Antonio. To assess the effects of the three recruitment centers, two dummy cova-

riates were formed: an indicator variable for UC Irvine samples and an indicator variable

for UT samples. The significances of these covariates, which represent UC Irvine sam-

ples versus Harbor-UCLA samples (UCIrvine) and UT samples versus Harbor-UCLA

samples (UT), respectively, were tested together with other covariates, as detailed below.

We chose tagged SNPs to cover LD bins crossing exons of the ten genes. Details of the tagging SNPs are given in Table 3-I. The LD structure at these genes in Mexican-

Americans was obtained from the HapMap Phase 3 data. The LD plots of CNDP and

ELMO1 obtained from HapMap Phase 3, along with our tagSNPs, are given in Figures 3-

1 and 3-2. Multiple markers in our SNP set for CNDP1 are in high LD with the microsa- tellite marker D18S880 that showed association in Janssen et al [2005]. The SNP rs741303 in ELMO1, for which association has been reported by Shimazaki et al. [2005], is included in our SNP set for that gene.

43

Gene Chromosome Size in Kbp Number of tagSNPs CNDP1 18 70.6 9

ELMO1 7 614.6 46

HMCN1 1 476.4 21

CFH 1 115.5 18

AHSG 3 28.2 8

CASP3 4 41.8 9

HSPA1A 6 22.4 8

HSPB1 7 21.7 4

CASP12 11 32.0 7

HMOX1 22 33.1 10

Table 3-I Details of Tag SNP at the candidate gene

The SNP genotype data were analyzed using the likelihood ratio test (LRT) in

five logistic regression models: two single-marker tests (i.e. an allelic test and a genotyp- ic test) and three two-marker tests (i.e. a 2-DF test, a 3-DF test and a 5-DF test) (see sec- tion 3.4 and Table 2-II). This strategy is appropriate for detecting non-additive effects

(e.g. dominant, recessive and heterozygote (dis)advantage effects) as well as additive ef- fects, and takes the multi-dimensional LD structure into account, similar to a haplotype- based analysis.

44

Figure 3-1 LD Plot of CNDP1 and tag SNPs

45

Figure 3-2 LD Plot of ELMO1 and tag SNPs

46

Table 3-II Categorical Covariates used in our analysis

Variable Code Phenotype 0: Control, 1: Case

UCIrvine 0: Not UC Irvine, 1: UC Irvine

UT 0: Not Texas, 1: Texas

Gender 1: Male 2: Female

0: not on ACEi medication ACEi 1: on medication 0: not on ARB medication ARB 1: on medication

The covariates, other than recruitment center (Site), were gender (Gender), age at diabetes diagnosis (Age), duration of diabetes (DM_Duration), use of Angiotensin Con- verting Enzyme inhibitors (ACEi) and use of Angiotensin II Receptor Blockers (ARB).

Categorical covariates are coded as shown in Table 3-II. Population stratification among the samples was adjusted for by using the first four principal components [Price, et al.

2006] (PC1-PC4) computed from genome-wide null SNP markers that were available in

this sample (see section 3.4. and APPENDIX E for details).

The covariates were tested under a null logistic model that does not include can-

didate SNPs (Table 3-III). The following covariates were significant at α=0.1 and so in-

corporated as covariates in the genetic association analysis: gender (Gender), the two sites (UCIrvine and UT), diabetes duration (DM_Duration), age (Age), the first, second

and fourth principal components (PC1, PC2 and PC4) and ARB use (ARB). The third

47

Estimate of the Effect P value (Intercept) -1.16E+00 0.060 Gender -1.06E+00 < 0.001 UCIrvine 7.45E-01 0.002 UT -4.79E+00 < 0.001 DM_Duration 6.83E-02 < 0.001 Age 2.55E-02 0.005 PC1 5.64E-05 0.067 PC2 -3.87E-04 0.009 PC3 -3.96E-05 0.789 PC4 -3.04E-04 0.050 ACEi -9.20E-02 0.596 ARB 1.19E+00 <0.001

Table 3-III Logistic regression model identifying significant covariates

principal component (PC3) and ACEi use (ACEi) were not significant. Because medica-

tion use (ARB) records were unavailable on a portion of the samples (11.0%) despite

careful medical record review, we performed the analyses with two different covariate

sets, one that included the covariate for ARB use and the other excluding it. The two sets

of covariates are thus:

(A) : Gender, Age, DM_Duration, UCIrvine, UT, PC1, PC2, PC4

(B) : (A) + ARB use.

For quality control, 296 additional SNPs over the whole genome were typed and a

QQ plot of their chi-square statistics was examined (APPENDIX E). There was no signif- icant inflation of the statistic (mean: 1.038 and not significantly different from 1). There-

fore, our association test results using the logistic regression model will not be invali-

48

dated by spurious association as a result of unexplained population stratification or diffe-

rential bias in genotyping.

Because CNDP1 and ELMO1 are candidates that have previously shown evidence

of association, we used a sequential procedure to adjust for the number of tests. Thus, we

first conducted our association testing on CNDP1, followed by ELMO1 and then fol-

lowed this up by testing the eight other candidate genes. In each study we set a signific- ance level adjusted for multiple tests according to the number of tests performed in that study, using a conservative Bonferroni correction.

The association test results for CNDP1 and ELMO1 adjusting for ARB use are shown in Figure 3-3. For the CNDP1 gene, we considered 9 SNPs and 8 SNP pairs. Two

and three tests are performed for each SNP and each SNP pair, respectively; therefore,

forty two (9x2 + 8x3 = 42) tests were performed for each covariate configuration (A) and

(B). In the same way, two hundred and twenty seven (42x2 + 41x3 = 227) tests were per-

formed for the ELMO1 gene. Hence, Bonferroni corrected significance levels corres-

ponding to α=0.05 for a single test are α=0.05/42 and α=0.05/227 for CNDP1 and EL-

MO1, respectively. These are shown as horizontal lines in Figure 3-3. These figures

show that our study did not replicate the associations of either CNDP1 or ELMO1 with

DN, including at the SNP marker rs741303 in ELMO1 (uncorrected p values are 0.804

and 0.274 for the allelic and genotypic test, respectively). Therefore, the common va-

riants we tested in these genes do not play a strong role in diabetic nephropathy in Mex-

ican-Americans.

49

Four hundred and one (85x2 + 77x3 = 401) tests were performed to the eight oth-

er candidate genes and the results are shown in Figure 3-4. Among them, two tests per-

formed with one SNP pair in the HMCN1 gene (rs2146098, rs6659783) provided signifi-

cant results at the α=0.05/401 level after adjusting for all the significant covariates in-

cluding ARB use (Figure 3-4) (p values of the 2-DF-test and the 3-DF-test were 6.1x10-5 and 1.2x10-4, respectively). These SNP pairs also had strong signals in the tests without

ARB use (e.g. p value of the 3-DF-test was 6.8x10-4), but in those tests failed to reach the

same significance levels.

Because the two SNPs rs2146098 and rs6659783 are in high LD with each other

(r2=0.96), their haplotypes could be determined almost unambiguously. Haplotype distri-

butions in cases and controls are given in Table 3-IV. In particular, the table shows that

13 G-A haplotypes are found in the cases, but only one G-A haplotype is present in the

controls. A subsequent haplotype-based test showed evidence of association for the rare

haplotype G-A (p values are 2.1x10-4 and 2.1x10-5 for covariate set (A) and (B), respec- tively). We also assessed this finding using Fisher’s exact test applied to a 2 x 2 table un- adjusted for covariates. A less significant results was obtained by the exact test (p value =

0.00175). This may be due to exclusion of covariates, because the same reduction in sig- nificance was also observed in the LRT analysis without covariates (p value = 0.00057).

Lastly, we assessed the interaction between the haplotype and ARB use (including a cat- egory for its use status being missing) using a log-linear model. The three-way interaction term of haplotype, phenotype and ARB use was not significant (p value = 0.966), which indicates that the association of haplotype and phenotype is not significantly modified by

ARB use or its status being missing.

50

CNDP1

Allelic test Genotypic test 2 DF test 3 DF test 5 DF test -log p -log 01234

70.35 70.36 70.37 70.38 70.39 70.40

Mbp

ELMO1 -log p 01234

36.836.937.037.137.237.337.4

Mbp

Figure 3-3 Results for CNDP1 and ELMO1 with ARB use covariate (B) Results for the two-marker tests (2, 3 and 5 DF tests) are plotted at the midpoint of the two markers. Bonferroni corrected significance levels of α=0.05/42 and α=0.05/227 are shown as horizontal lines.

51

HMCN1 CFH 024 024

184.1 184.2 184.3 184.4 194.90 194.95 195.00 195.05

AHSG CASP3 024 024

187.810 187.815 187.820 187.825 185.79 185.80 185.81 185.82 185.83

HSPA1A HSPB1 024 024

31.80 31.82 31.84 31.86 31.88 31.90 31.92 75.76 75.77 75.78 75.79 75.80 75.81

CASP12 HMOX1 024 024

104.22 104.24 104.26 104.28 104.30 104.32 34.10 34.11 34.12 34.13 34.14 34.15 34.16

Figure 3-4 Results for the eight other candidate genes

Significant levels α=0.05/401, are shown as horizontal lines.

52

HMCN1

Allelic test Genotypic test 2 DF test 3 DF test 5 DF test -log p -log 01234

184.1 184.2 184.3 184.4

Mbp

Figure 3-5 Association test results for HMCN1 with covariate configuration (B) (in- cluding ARB use as a covariate)

Haplotype Cases Controls GT 533 (58.6%) 511 (58.5%) GA 13 (1.4%) 1 (0.1%) AA 364 (40.0%) 362 (41.4%) Total 910 874

Table 3-IV The Haplotype Distribution of rs2146098 and rs6659783 in Cases and Controls

53

The LD plot of HapMap Mexican-Americans in HMCN1, together with our tagged SNPs, is given in Figure 3-6. The 21 tagged SNPs covered the exonic regions of this gene fairly thoroughly. Although we have only reported the best SNP and the mark- ers within a rare haplotype that tagged disease risk, many other markers that were in moderate LD with these SNPs also showed evidence of association (Figures 3-4 and 3-5).

Figure 3-6 LD Plot of HMCN1 and tag SNPs

54

3.3. DISCUSSION

In a cohort of 892 Mexican-American probands with DN and long-term diabetic

controls without nephropathy, we found a significant association with DN of tag SNPs in the HMCN1 gene. We tried to replicate the previously reported association between DN and CNDP1 in individuals of European ancestry, but failed to do so. Wanic et al. [2008] also failed to confirm an association between DN and CNDP1 in patients with Type 1 diabetes, but found moderate association in the CNDP2 promoter region with DN in Type

1 diabetes. CNDP1 may not have an association in Mexican-Americans, or the previous associations may be have been caused by CNDP2, which could be in moderate LD with

CNDP1.

In our Mexican-American subjects, we also could not replicate the association with ELMO1 previously reported in Japanese and African-American cohorts [Leak, et al.

2009; Shimazaki, et al. 2005]. Mexican-Americans are an admixed population with sub-

stantial American Indian ancestry. The latter, in turn, have Asian ancestral relatedness.

Therefore, our negative results may be explained by genetic heterogeneity, population

structure, type 1 error in the original paper, and/or Type 2 error in our cohort analysis.

Interestingly, many of the SNPs in the ELMO1 gene reported by Shimazaki et al. [2005],

including the most significant SNP, rs741301, show severe deviation from Hardy-

Weinberg proportions (HWP) in their controls (e.g. p value for rs741301 is 7.7x10-5 for their largest control group). We noticed that the minor allele homozygote genotype fre- quencies in their three control groups (3.3~6.3%) are quite different from the estimates for HapMap Japanese data (12.8% for merged Phase II and III data), while their cases and our cases and controls conform to HWP quite well (p value = 0.79 (their largest case

55

group), 0.39 and 0.77, respectively). We also noted that the reported association in Afri-

can-Americans reported by Leak et al. [2009] was not adjusted for multiple tests. Al-

though there could be other reasons for the difference between our results and previous

results, this discrepancy would strongly suggest that previous findings were due to a sys-

tematic differential bias in genotyping, population stratification within their samples or

Type I error.

Among the other eight candidate genes, our study found a new association of

HMCN1 with DN in the Mexican-American population. A series of subsequent analyses on this region identified the strong association of the rare haplotype G-A in the SNPs rs2146098 and rs6659783. HMCN1, which is located in 1q25.3-q31.1, codes for extracel- lular protein in the immunoglobulin superfamily. It has been suspected to contain muta- tions associated with age-related macular degeneration (AMD) [Schultz, et al. 2005;

Thompson, et al. 2007]. Here we found a strong association between HMCN1 and DN.

Our result provides a connection between DN and retinal disease. Further study controlling for both DN and AMD would allow better understanding of the underlying biological pathway. Additional work at this locus and other loci will enable us to refine the genetic hypotheses regarding DN in Mexican-Americans.

3.4. CONCISE METHODS

The Family Investigation of Nephropathy and Diabetic (FIND) study employed

two study designs: genome-wide linkage analysis and mapping by admixture linkage dis-

equilibrium (MALD) [Knowler, et al. 2005]. The MALD study described here uses a

case-control design and enrolls individuals with African American and Mexican-

56

American ancestry to ascertain diabetic nephropathy association based on the ancestral inheritance of risk loci. This study was performed as an ancillary study to the Mexican-

Americans MALD study in FIND. We used the same phenotypic definitions as previous- ly reported in other FIND papers [Iyengar, et al. 2007; Knowler, et al. 2005] with the ex- ception that for this study all participants were required to have both parents and all four grand-parents of Mexican-American ancestry. In addition, whereas in the FIND cohort, by definition controls had siblings with DN, in the MALD study, and in the subjects re- ported herein, all controls had long-term diabetes but no known first-degree relative with

DN or other kidney disease.

Assessment of Diabetes

The definition of diabetes utilized in this study was identical to the definition em- ployed by the FIND consortium [Knowler, et al. 2005]. Diabetic subjects were, at the time of recruitment, either currently or previously treated with insulin, oral hypoglycemic drugs, or both. For diabetic subjects not treated with drugs or subjects with no history of diabetes, eligibility was determined by plasma hemoglobin Alc (HbA1c) levels or, if possible, fasting plasma glucose concentrations at the time of study entry. HbA1c ≥ 6.0% suggestive of diabetes was followed by plasma glucose testing to confirm DM in order to be classified as diabetic for FIND. American Diabetes Association (1997) criteria were used for diagnosis in persons not previously known to have diabetes. For previously di- agnosed persons, the date of diabetes diagnosis was obtained from the medical history,

57

with confirmatory medical record review if possible. Persons with any type of diabetes

were eligible for the study, but almost all had type 2 diabetes.

Assessment of Diabetic Nephropathy

Diabetic nephropathy was defined by any one of 3 criteria. The greatest number of

subjects with DN who entered the study had ESRD. For subjects with ESRD, eligibility

for probands was based on 1) onset of diabetes ≥5 years before renal replacement therapy

with diabetic retinopathy, 2) onset of diabetes ≥5 years before renal replacement therapy

with historic 24-h urine protein ≥3 g (protein-to-creatinine ratio ≥3.0 g/g), or 3) diabetic

retinopathy with historic 24-h urine protein >3 g (protein-to-creatinine ratio >3.0 g/g).

The second largest group of subjects was those with chronic kidney disease resulting

from DN. For this group, eligibility was based on either diabetic retinopathy with historic

24-h urine protein ≥1 g (protein-to-creatinine ratio ≥1.0 g/g), or 24-h urine protein excre-

tion ≥3 g (protein-to-creatinine ratio ≥3.0 g/g) after diabetes duration ≥10 years. A few

subjects were entered with biopsy confirmation of diabetic nephropathy in association

with proteinuria. DN was confirmed by biopsy if the following histological criteria were

present: 1) nodular and/or diffuse increase in mesangial matrix accumulation; 2) thick-

ened glomerular basement membranes and/or arteriolar hyalinization; and 3) absence of mesangial immunoglobulin or paraprotein deposits by immunofluorescence microscopy,

absence of amyloid deposits by Congo Red staining or electron microscopy, and absence

of electron-dense deposits along the glomerular capillary wall.

Cases and Controls Inclusion Criteria

58

Unrelated control samples were recruited with diabetes duration ≥10 years but

without diabetic nephropathy, as defined by normoalbuminuria, and no known prior his- tory of microalbuminuria or overt neprhopathy. Cases and controls were both recruited

through three centers: Harbor-UCLA Medical Center, UC Irvine, and the University of

Texas Health Science Center at San Antonio.

Adjusting for Population Stratification (Principal Component Analysis)

Principal components were computed from control samples and used for case and

HapMap samples. Only HapMap SNPs that had minor allele frequency > 0.01 in the

HapMap populations and were not in linkage disequilibrium with any other SNP (r-

squared < 0.2 with all SNPs in a 20 SNP window in each of the 3 HapMap populations)

were used for this computation. SNPs in known areas of long LD[Fellay, et al. 2007]

were also excluded. We initially included the first four principal components (PC1-PC4)

in the logistic regression model as covariates, based on their significance with case-

control status. When applied to HapMap data, they discriminated well Caucasian (CEU),

Japanese/Chinese (JPT/CHB), and West African samples (YRI) (APPENDIX E).

Association Tests in a Logistic Regression Model

Two single-marker tests and three two-marker tests were performed in a logistic

regression model. They are given and described in Table 2-II. Consecutive SNP markers

were paired and the two-marker tests performed with these (overlapping) pairs. The like-

lihood ratio test p-value of the genotype term is reported for each SNP and each SNP pair.

59

Haplotype-based Analysis

The same logistic regression model was constructed with a haplotype G-A count

variable instead of a genotype variable. Then we performed a LRT with this model to test the effect of the haplotype. It is important to note that, because the haplotype G-A is ob- served as either only one copy or no copies on each member of the sample, there is no difference between the additive model and the dominant model.

60

Chapter 4 Association Tests for a Binary Trait

Measured on Related Individuals

4.1. INTRODUCTION

An association test with data from unrelated individuals is relatively simple com- pared to that with data on related individuals because it assumes independence of the data, ignoring any residual genetic effects that may exist. However, ignoring this can invalidate the test when it is applied to data on related individuals. Therefore, appropriate methods are required for these data.

Association tests for data on related individuals, whether they are for quantitative or binary traits, can be categorized into two classes: Transmission Disequilibrium (TD)- based methods and other methods. TD-based methods examine only transmission dise- quilibrium within each pedigree [Spielman, et al. 1993]. Because gamete transmissions are independent of the specification of the founders’ genotypes, TD-based methods are robust to spurious association in the founders’ genotypes, such as occurs in the case of population stratification. These methods have been extended to be nonparametric, family- based association tests (FBAT) [Laird, et al. 2000].

Other methods combine the association information from the founders with that from the transmission [Aulchenko, et al. 2007; Bourgain, et al. 2003; Chen and Abecasis

2007; George and Elston 1987; Purcell, et al. 2005; Slager and Schaid 2001; Slager, et al.

2003; Thornton and McPeek 2007; Yu, et al. 2005]. One can expect that these methods

61

would have more power by using more information than the TD-based methods, and this

has been observed by Purcell et al. [2005] and Aulchenko et al. [2007].

For quantitative traits, most of these methods are based on the mixed model [Pin-

heiro and Bates 2000]. The mixed model generally incorporates a polygenic effect, which

specifies additive background genetic correlation as a random effect [Chen and Abecasis

2007; Elston, et al. 1992; Rabe-Hesketh, et al. 2008; Yu, et al. 2005]. Generally, associa- tion tests of the SNP marker are done as a test of the fixed effect in the mixed model.

The Generalized Linear Mixed Model (GLMM) is the natural extension of the mixed model for a binary trait [Breslow and Clayton 1993]. This model corresponds to the threshold model with a normally distributed latent variable, which is a classic biologi- cal model for binary traits. GLMMs have been used to test genetic association in pedigree data with binary traits [Burton, et al. 1999; Noh, et al. 2006; Pawitan, et al. 2004; Rabe-

Hesketh, et al. 2008; Zhao, et al. 2007].

Unlike for data on unrelated subjects, fitting a mixed model or GLMM for pedi- gree data takes much more time, making it more difficult to extend these methods to ge- nome-wide studies. However, for a quantitative trait, Chen and Abecasis [2007] proposed the efficient score test [Rao 1973] for genome-wide data. When the tests share the same null model, the efficient score test requires fitting only the null model for all genome- wide testing. This can dramatically reduce the computational time.

Here, two versions of a quasi-likelihood score (QLS) test in a GLMM, which in- herit the computational efficiency of score test, will be proposed. To evaluate the relative power of the proposed tests, two other tests will be considered: Bourgain et al.’s [2003]

62

test and Thornton and McPeek’s [2007] test. They are named WQLS and MQLS, respec-

tively, following their notation. First, WQLS and MQLS will be briefly reviewed. Then

GLMM will be introduced and two versions of a QLS tests for a GLMM will be proposed.

The validity and power of the proposed tests will be assessed together with that of WQLS

and MQLS.

4.2. ASSOCIATION TESTS IN THE RETROSPECTIVE CONTEXT

WQLS and MQLS were developed in the retrospective context. The genotype

score, G, was considered as a random variable and its means in the cases and controls

were contrasted to find association. Thornton and McPeek [2007] expressed the two test

statistics in the form

, V

where p is an estimator of the allele frequency calculated under the assumption of no

association and p is a contrasting estimator of the allele frequency that should have a

different expectation from p when there is association. Var denotes the variance

calculated under the assumption that the null hypothesis is true. MQLS had an improved

mean model p which takes into account the fact that affected individuals who have affected relatives have a higher expected frequency of the alleles that increase suscepti- bility for a genetic trait than do individuals who do not have affected relatives. Thornton and McPeek [2007] have shown that MQLS generally has more power than WQLS.

WQLS and MQLS are robust to the sampling design because they are based on the re-

63 trospective model. However, as discussed in Chapter 2, the retrospective models cannot incorporate covariates which may affect the binary trait.

4.3. GENERALIZED LINEAR MIXED MODEL

FOR A BINARY TRAIT

Let Y be the binary trait vector of n individuals from m pedigrees and g be the genetic score of the SNP that we want to test. Denote X to be a (q+1)-variate covariate matrix including a unit vector for the mean parameter. The GLMM for testing marker association is given by

hµ gβ XβZb (4-1) where µEY, bN0, σI and h is a link function. For the binary trait, we can con- sider a logit or a probit function as a link function.

The correlation structure between the individuals is parameterized with the ran- dom effect, Zb. Let

1h 2 2 2 1h 2 K , 2 2 1h

where h is the inbreeding coefficient of individual i and is the kinship coefficient between individuals i and j. K will be a block diagonal matrix, blocked by pedigree. The polygenic effect can be modeled by setting Z as the nn matrix satisfying ZZ K where Z denotes the transpose of Z. Z can be obtained by Cholesky decomposition. A

64

looped pedigree can be modeled with appropriate values for h and . For pedigrees without any loops from outbred population, all h are 0, and so all the diagonal terms of K

become 1. Elston et al. [1992] described additional variance components other than one

for a polygenic effect and these can also be incorporated in a GLMM. See APPENDIX F

for further details.

4.4. METHOD

The likelihood for the GLMM does not have a closed form. Therefore, the normal

estimation procedure for the mixed model cannot be used and other methods, such as the

penalized quasi-likelihood (PQL) method [Breslow and Clayton], have been proposed.

For hypothesis testing the Wald test was proposed [Pinheiro and Bates 2000]. However,

because this requires fitting the alternative model for each test marker, the analysis time

will be very long in the case of genome-wide SNP data.

The aim is to develop the association test and investigate its performance, togeth-

T T T er with other tests. Denote θβ ,β ,σ . Let µθ be a conditional mean given the

T random effect, b, of an individual i, so that µθ gβ x βZb. Then the QLS

with a GLMM is written as

Uθ gTVθ Y µθ,

where Vθ is the variance matrix of Y µθ and is given by matrix with diagonal

terms µ1µ . Let θ denote the estimate under the constraints of H, that is

65

β 0. Let rr be a vector of Pearson residuals of the GLMM, given by r

µ , where µ µθ. Then, it can be shown that Uθ∑ rg. µµ

Assuming r are independent and identically distributed, the variance of Uθ

can be written as

Vθn1Varr∑ g g .

Now, the quasi-likelihood score test statistic with this variance estimate is given by

U S . (4-2) V

Interestingly, it can be shown that this can be written as n1R, where R is the sample

correlation coefficient between the residuals of the null model and explanatory variable,

g, we test. Therefore, we can also view this test as a correlation test between Pearson re-

siduals and test variables. We name this test GLMM-QLS1.

However, GLMM-QLS1 may be invalidated for many reasons. Among them, two

major reasons can be considered. The first is bias in estimation. Estimation in a GLMM is

challenging and biased estimation is common in GLMMs. The second is a failure to

model an appropriate correlation structure. Even though we can allow a more complex

correlation structure, as in Elston et al. [1992] (APPENDIX F), there could be a more

complex familial correlation structure. To make matters worse, ascertainment sampling

results in truncated data, which have a correlation structure different from that in the gen-

eral population.

66

When data consist of pedigrees all of the same structure, a robust variance may be considered to adjust for the problems mentioned above. Let G and R be the column vectors of centered genotype variables and Pearson residuals for the k-th pedigree, re- spectively. Then the variance estimate is given by

V ∑ GTR RTG .

A new test statistic can be constructed by replacing V in (4-2) by V . We name this test

GLMM-QLS2.

4.5. SIMULATION RESULTS

Validity and power of GLMM-QLS1 and GLMM-QLS2 were assessed and com- pared with WQLS and MQLS in two data frameworks. The first framework considered sampling without ascertainment. Data of a test marker were generated according to Men- delian inheritance. The minor allele frequencies of the test markers were set to 0.4 and

0.1 to represent common and rare variants. Then, binary traits were generated according to a GLMM with a probit link. Note that this link function is different from what is used below for the GLMM-QLS. 100 nuclear families were generated for each replication and each nuclear family has a father, a mother and two offspring. Random nuclear families are sampled regardless of their affection status. 10,000 replicates were simulated for va- lidity and power assessment. The size of the variance component for the polygenic effect,

σ, was set to 1, which makes the heritability of this latent variable 0.5. No covariate was considered except the one for the mean parameter, and the mean parameter was set to

67

Marker frequency α GLMM-QLS1 GLMM-QLS2 MQLS WQLS 0.05 0.054 0.047 0.047 0.048 0.4 0.01 0.013 0.010 0.010 0.010

0.05 0.055 0.048 0.048 0.050 0.1 0.01 0.012 0.008 0.008 0.010 Table 4-I Empirical Type I errors with Random Samples

Marker frequency α GLMM-QLS1 GLMM-QLS2 MQLS WQLS 0.05 0.949 0.942 0.935 0.850

0.4 0.01 0.858 0.839 0.824 0.661

0.0001 0.411 0.352 0.338 0.174

0.05 0.628 0.602 0.589 0.463

0.1 0.01 0.395 0.364 0.356 0.237

0.0001 0.059 0.046 0.045 0.022

Table 4-II Empirical Power with Random Samples

0.5, which makes the baseline risk about 30.9%. Note that random sampling is reasona- ble only in a trait with high prevalence. For the GLMM-QLS1 and GLMM-QLS2 tests, a logit link was used and the restricted penalized quasi-likelihood method (REPQL) [Bres-

low and Clayton 1993] was utilized for fitting the model. For the required parameter k of

MQLS, we set it to be the sample mean of the trait for each set of simulated data.

The results for validity assessment with the random samples are shown in Table

4-I. While all four tests did not show severe deviation from the nominal significance lev- el, GLMM-QLS1 is a little anti-conservative. To compare power, a test marker was set to

be an associated SNP with additive effect β 0.5 and the same settings of all other pa-

rameters. The results for power assessment with the random samples are shown in Table

68

4-II. It shows that GLMM-QLS1, GLMM-QLS2 and MQLS had similar power, while

WQLS had the smallest power. GLMM-QLS1 had the most power for both allele fre-

quencies, but it should be taken into account that it was anti-conservative compared to the other three tests. GLMM-QLS2, which used the robust variance, did not lose much power compared to GLMM-QLS1 and still had more power than MQLS and WQLS.

Next, we considered a framework with ascertainment, for which nuclear families with at least one affected individual are collected. Among the random nuclear families only nuclear families that have at least one affected individual were kept in the sample.

We generated data until we had 100 ascertained families. The other parameters were set up in the same way as in the previous simulation except for the mean parameter, which was set to -1, which makes the baseline risk about 15.9%.

The results for the ascertained samples are shown in Table 4-III and 4-IV. While the empirical Type I error of GLMM-QLS1 was conservative for both the allele frequen- cies, those of GLMM-QLS2, MQLS and WQLS were not significantly different from nominal significance levels. Because GLMM-QLS1 did not show valid empirical Type I error, it was excluded in the power study. The power of GLMM-QLS2 was almost the same as that of the MQLS and greater than that of WQLS, for both allele frequencies.

The analysis time for a genome-wide association study can be estimated from the running time of the simulations. The analysis times of most methods are mainly deter- mined by the number of SNPs and increase linearly with it. In a computer with an Intel

Xeon 2.8 GHz CPU and 1 GB RAM, GLMM-QLS1 or GLMM-QLS2 with 500K SNPs was expected to take less than one and a half hours. In contrast to this, the other methods,

69

Marker frequency α GLMM-QLS1 GLMM-QLS2 MQLS WQLS 0.05 0.038 0.050 0.050 0.050 0.4 0.01 0.006 0.009 0.010 0.011

0.05 0.038 0.051 0.053 0.053 0.1 0.01 0.007 0.010 0.010 0.011

Table 4-III Empirical Type I errors with Ascertained Samples

Marker frequency α GLMM-QLS2 MQLS WQLS 0.05 0.934 0.932 0.849 0.01 0.803 0.801 0.665 0.4 0.0001 0.317 0.316 0.180

0.05 0.602 0.605 0.485

0.1 0.01 0.358 0.362 0.247

0.0001 0.048 0.050 0.018

Table 4-IV Empirical Power with Ascertained Samples

which require fitting the alternative GLMM for each marker, are expected to take more than 15 days. The expected analysis time of GLMM-QLS1 or GLMM-QLS2 is also much shorter than the estimated running time of the MQLS and WQLS based on their report of

Thornton and McPeek [2007] (about 30 minutes for 10K SNPs, therefore about 25 hours are expected for 500K SNPs).

4.6. DISCUSSION

Fast association tests for related data with binary traits were developed using

GLMM. GLMM-QLS1 can be used with randomly sampled pedigree data, where each pedigree can have any structure. However, GLMM-QLS1 with REPQL did not strictly

70

maintain the nominal Type I error. It is not clear how this can be modified, but better es-

timation methods in GLMM may be considered. In fact, REPQL is known to introduce a

severe bias in parameter estimates. Therefore, to reduce the effect of biased estimation, more accurate estimation methods such as Laplacian or Adaptive Gaussian Quadratic me- thods [Pinheiro and Bates 2000] can be considered.

For data that consisted of pedigrees with the same structure, GLMM-QLS1 was modified to GLMM-QLS2 using a variance estimate appropriate for pedigree data.

GLMM-QLS2 maintained valid Type I error rates in both random and ascertained pedi- gree samples and had power comparable to that of MQLS.

GLMM-QLS2 may be valid only for the two sampling designs that have been

used in the simulations. More serious ascertainment sampling may be possible in associa-

tion studies with related data; therefore, an additional evaluation and modification may be

required for those types of data.

Based on the prospective model, the GLMM-QLS1 or GLMM-QLS2 test can in-

corporate any covariates, including ones for population stratification. It is a standard pro-

cedure to test the significance of the potential covariates and select only the significant

ones. GLMM-QLS1 or GLMM-QLS2 can be used for this significance test using the

same test procedure. It can also be extended to other types of genetic variants such as

multi-marker genotype or haplotype variants. Even though it has been investigated here

only for binary data, the test can be extended to other traits with non-normal distributions

using other corresponding link functions.

71

It is common in genome-wide data to have differentially missing SNP data for

each individual. There are two possible methods for handling this. The first method is

simply to exclude those individuals who have missing genotype on the markers we test.

This method would maintain the analysis time to be as fast as with complete data. The second method uses the expected genotype values for the individuals, derived from the pedigree information, as introduced in Chen and Abecasis [2007], so that all individuals can be kept for analysis.

The score test or QLS test with related data have been shown to reduce the analy- sis time dramatically. However, other estimation methods can utilize the computational efficiency of the QLS as well. For example, the explicit likelihood corrected for ascer- tainment is written in a form involving integrals. This likelihood can be maximized using the likelihood computation method shown in Pawitan [2004] and followed by a score test procedure.

72

Chapter 5 Conclusions and Areas for Future Study

In this dissertation genetic association tests for both unrelated data and related da-

ta have been studied. In this chapter, the findings are summarized and areas for future

study are proposed for both unrelated and related data.

5.1. GENETIC ASSOCIATION TESTS FOR UNRELATED DATA

In Chapter 2, genotype association models for case-control data were developed.

Many properties of these case-control analyses could be explained using these association

models. Various association tests were proposed based on the association model. These tests are appropriate for detecting more general disease models located on more complex genome structures.

First, the general association tests proposed in Chapter 2 can be easily applied to any type of unrelated case-control data. There are multiple publically available databases with case-control data such as the Wellcome Trust Case-Control Consortium [2007] data, or other case-control data in dbGaP (http://www.ncbi.nlm.nih.gov/gap). If the tests such as the genotypic test and Test 2-3 are applied to these data, results from these tests would provide additional information regarding disease susceptibility loci in these data.

Investigating the true disease model can also be an important area for future stu- dies. Most genetic association studies have assumed the additive disease model and, at most, the dominant and recessive disease models. In other words, they have excluded the

73

plausibility of the heterozygote (dis)advantage model without its assessment. Application of the genotypic test to various data with subsequent detailed analysis of the results would help to understand the true disease model.

Genotype-based methods have often referred to the method that utilizes only li- near terms, and only this method was discussed and compared with the haplotype-based method [Chapman, et al. 2003]. However, this dissertation showed that genotype-based methods with an interaction term can be comparable to a haplotype-based method for a two-marker region. For a multi-marker region, new genotype-based methods could be set up in the same way with linear terms and the interaction terms of consecutive markers.

This would provide the same ease of analysis. However, the appropriateness of this me- thod for a multi-marker region needs to be investigated.

For a given set of data, different strategies can be adopted for an association test; how to partition each region, and what terms (linear, quadratic and interaction terms) should be included for an analysis of that given region. There may exist an almost unli- mited number of possible analysis strategies. The analysis strategy should be determined with consideration of the following factors: the resolution aimed for in the study, the number of variants in a given region, and the LD structure. Developing a general guide for the analysis strategy should be a good research topic.

74

5.2. GENETIC ASSOCIATION TESTS FOR RELATED DATA

In Chapter 4, new association tests for a genome-wide association study, GLMM-

QLS, were proposed. Based on the prospective model, the GLMM method is flexible

enough that it can be extended to genetic variants other than SNPs, such as multi-marker

variants or copy number variants. These types of genetic variants have been shown to be

more appropriate for some complex diseases. The prospective model also allows incorpo-

rating covariates, including ones for population stratification. Controlling for the known

confounders in each disease, the method can decrease Type I error and increase power.

In addition, the method is computationally fast enough for genome-wide data. As

genotyping costs decrease, it will be common in the near future to have data with millions

of SNPs. Even from a limited number of markers, much greater numbers of test variants

can be considered by combining them together in different ways. Therefore, a genetic association test needs to work fast, even with genotype data of huge dimensions, which is achieved by the proposed method.

The proposed method was shown to be valid for data without ascertainment or on- ly the mild ascertainment sampling that was used for the simulations. However, various ascertainment sampling designs are used for binary traits, depending on the characteris- tics of the disease. For example, for diseases of early or late age of onset, sampling such that all probands are at a specific position in his/her pedigree, such as an offspring or a parent, is common. Therefore, developing a method that can handle this type of data maintaining flexibility and efficiency can be next step of this study.

75

APPENDIX

APPENDIX A

Computation of |,|,|, and |,

There could be eight haplotypes ADBi1,2,j1,2,k1,2 in a three lo-

cus region. Let denotes the frequency of the haplotype, ADB . Given pairwise

LD, AD,BD,AB , three locus LD, ADB , and allele frequencies ,, 1

, 1, 1, the haplotype frequencies are given by

DB

DB

DB AB AD

DB AB

DB

DB

DB

DB .

Let ,, denote the frequency of the genotype coded as Xi,Dj,Yk.

Then the genotype frequencies of the three loci when in HWE can be computed as

76

,, ∑ where H and H are two haplotypes and the summation is over all hap-

lotype pairs consistent with the genotype Xi,Dj,Yk. Two marker genotype

frequencies are obtained by , ,, ,, ,, . With this notation,

ED|X, Y ,, ,, and ED |X, Y ,, ,,.

For single marker data, ED|X ∑Y ED|X, Y, and ED |X

∑Y ED |X, Y, . It follows that ED|X and ED |X a X abXc,

where ,b 12 and

AD AD 12 AD 2 .

APPENDIX B

Proof that if , then | |

If 12 0, the ratio of the coefficient of the linear term to that of the quadratic term of the true association model (2-7) becomes 1 2. Therefore, the true

association model (2-7) can be then rewritten as µX cX 12X c

where c and c are constants. Note that KPaffected µ1 µ021

µ11 21c c.

Then,

µX| EX|affected PX1|affected PX1|affected

77

µ1 µ11 1 1 2 12 1 c c 2 1K Paffected K K

2 1 µX

Therefore, µX| µX| µX .

APPENDIX C

Parameters in Cases and Controls

Genotype frequencies in cases and controls are derived by

Paffected|x, y Px, y|affected , ,| Paffected

, γ γ ED|x, y γ ED |x, y K D D

P|, Px, y|unaffected , , 1 γ γ ED|x, y ,| P K D

γDED |x, y ,

in order to specify the multinomial distributions in cases and controls. Based on these pa-

rameters, mean (µ), variance (σ) and covariance (σ) of each variable can be computed by the following relations:

, , ,, , ,,

, , ,, , ,,

78

, , ,, , ,,

, , ,, , ,,

, , , ,,

, , , ,,

, , , ,,

, , , ,,

, ,

, ,

,

, , , ,

, , , ,

, , , ,

, , , ,

, , , .

79

APPENDIX D

Noncentrality Parameter of the Haplotype-Based Test

For two-marker data, there are four haplotypes, AB,AB,AB and AB, and

we denote their frequencies in the general population respectively by ,,

and . Given LD parameters and allele frequencies, three-locus haplotype frequen-

cies are given under HWE as in APPENDIX A. By summing the frequencies of the two

three-locus haplotypes that are consistent with each two-locus haplotype,

,, and can be computed (e.g. ).

The haplotype frequencies in cases are given by

∑ Paffected|HAB ,H A B , | K

where Paffected|HAB,H AB can be obtained by (2-8) and the summation is

over all four haplotypes h . Similarly, the haplotype frequencies in controls are given

by

∑ 1 Paffected|HAB ,H A B . | K

The noncentrality parameter for the haplotype-based test is

λHAP 2 Σ| Σ| ,

| |

where | , | ,

| |

80

Σ|

|1 ||| ||

|| |1 |||

|| || |1 | and

Σ|

|1 ||| ||

|| |1 ||| .

|| || |1 |

81

APPENDIX E

Principal Components Analysis for Population Stratification

The following figure shows the fraction of genetic variance explained by the first

12 principal components.

82

The following figure shows the Normal Q-Q plots of control principal components. The components are expected to be Normally distributed for components that capture only random noise, which would produce a straight line in Q-Q plot. As can be seen, the first and possibly the second component are picking up non-random variations. These two components also separate well the CEU, J+C and YRI HapMap populations (see next figure).

83

The following plot shows a grid of the scatter plots between the first 5 components. The principal components of case samples and HapMap samples are included in this plot and the 3 HapMap populations are depicted in different colors.

84

The following table shows the correlations between each of the first 10 principal compo- nents and the case-control status. The p-value depicts the significance of adding the prin-

cipal component to the model that already contains the covariates Gender, UT and

DM_Duration (UCLA samples and UCIrvine samples are merged).

principal r-squared with case-control p-value component 1 0.002515 0.392 2 0.011434 0.002 3 0.000386 0.881 4 0.008698 0.032 5 0.002737 0.235 6 0.000719 0.522 7 0.000007 0.702 8 0.000834 0.987 9 0.000128 0.841 10 0.000206 0.727

85

APPENDIX F

Mixed Model with Additional Variance Components

Elston et al. [1992] considered additional variance components for common nuc- lear family effect (F), a common marital effect (M) and a common sibship effect (S) as

well as a polygenic effect (G). This can be also written as a mixed model without loss of

generality (Chapter 13 and 26 of Lynch and Walsh [1998]).

In the same way, a GLMM with these additional variance components can be

written as

hµ gβ XβZGbG ZFbF ZMbM ZSbS,

where b is an n-dimensional random effect following NO, σ I and Z is its design ma-

trix. For a quantitative trait, the summed random effect variance is ∑ Qσ , where

Q ZZ forms a coefficient variance-covariance matrix for the random effect b . A cor-

relation structure with a constant marginal variance for each individual can be con- structed by specifying Q or Z. We have seen that the coefficient matrix for the polygenic

component, QG, is given by K as defined in section 4.3. We will describe how to con- struct any Q other than QG using as an example, QF, as follows:

Prepare an initial nn working matrix QF by assigning “1” to the (i j)-th element

of QF, QF, if individuals i and j are in the same cluster. Let mF denote the max-

imum number of these F clusters in which any sample individual is included. For

86

example, if every individual is included in at most three nuclear families, then

mF 3. mF can be alternatively obtained by taking the maximum among the

sums of each of the rows of the working matrix QF. Now, assign “mF” to every

diagonal entry and fill in 0 for the remaining entries. As a result, QF may take the

form

mF 0 1 0 0 0 mF 0 0 0 QF , 1 0 mF 0 0

where the number of unity entries in each row is at most mF.

QM and QS can also be constructed in the same way. This algorithm will also work to

overlapping clusters, including clusters for multiple mates. Z can be obtained by ortho-

gonal-triangular (QR) decomposition of Q.

87

BIBLIOGRAPHY

Agarwal A, Nick HS. 2000. Renal response to tissue injury lessons from heme oxyge-

nase-1 gene ablation and expression. Am Soc Nephrol. p 965-973.

Andersen G, Burgdorf KS, Sparso T, Borch-Johnsen K, Jorgensen T, Hansen T, Pedersen

O. 2008. AHSG tag single nucleotide polymorphisms associate with type 2 di-

abetes and dyslipidemia: studies of metabolic traits in 7,683 white Danish subjects.

Diabetes 57(5):1427-32.

Aulchenko YS, de Koning DJ, Haley C. 2007. Genomewide rapid association using

mixed model and regression: a fast and simple method for genomewide pedigree-

based quantitative trait loci association analysis. Genetics 177(1):577.

Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, Reynolds R, Ober C,

McPeek MS. 2003. Novel case-control test in a founder population identifies P-

selectin as an atopy-susceptibility locus. The American Journal of Human Genet-

ics 73(3):612-626.

Breslow NE, Clayton DG. 1993. Approximate inference in generalized linear mixed

models. Journal of the American Statistical Association:9-25.

Burton PR, Tiller KJ, Gurrin LC, Cookson W, Musk AW, Palmer LJ. 1999. Genetic va-

riance components analysis for binary phenotypes using generalized linear mixed

models (GLMMs) and Gibbs sampling. Genetic epidemiology 17(2).

Calabrese V, Mancuso C, Sapienza M, Puleo E, Calafato S, Cornelius C, Finocchiaro M,

Mangiameli A, Di Mauro M, Stella AM and others. 2007. Oxidative stress and

88

cellular stress response in diabetic nephropathy. Cell Stress Chaperones

12(4):299-306.

Chapman JM, Cooper JD, Todd JA, Clayton DG. 2003. Detecting disease associations

due to linkage disequilibrium using haplotype tags: a class of tests and the deter-

minants of statistical power. Hum Hered 56(1-3):18-31.

Chen WM, Abecasis GR. 2007. Family-based association tests for genomewide associa-

tion scans. The American Journal of Human Genetics 81(5):913-926.

Clayton D, Chapman J, Cooper J. 2004. Use of unphased multilocus genotype data in in-

direct association studies. Genet Epidemiol 27(4):415-28.

Conti DV, Gauderman WJ. 2004. SNPs, haplotypes, and model selection in a candidate

gene region: the SIMPle analysis for multilocus data. Genet Epidemiol 27(4):429-

41.

Cordell HJ, Clayton DG. 2002. A unified stepwise regression procedure for evaluating

the relative effects of polymorphisms within a gene using case/control or family

data: application to HLA in type 1 diabetes. Am J Hum Genet 70(1):124-41.

Davidson R, MacKinnon JG. 1987. Implicit Alternatives and the Local Power of Test

Statistics. Econometrica 55(6):1305-29.

Devlin B, Roeder K, Wasserman L. 2003. Analysis of multilocus models of association.

Genet Epidemiol 25(1):36-47.

Elston RC, George VT, Severtson F. 1992. The Eiston-Stewart Algorithm for Continuous

Genotypes and Environmental Factors. Hum Hered 42:16-27.

89

Fellay J, Shianna KV, Ge D, Colombo S, Ledergerber B, Weale M, Zhang K, Gumbs C,

Castagna A, Cossarizza A and others. 2007. A Whole-Genome Association Study

of Major Determinants for Host Control of HIV-1. Science 317(5840):944-947.

Freedman BI, Hicks PJ, Sale MM, Pierson ED, Langefeld CD, Rich SS, Xu J, McDo-

nough C, Janssen B, Yard BA and others. 2007. A leucine repeat in the carnosi-

nase gene CNDP1 is associated with diabetic end-stage renal disease in European

Americans. Nephrol Dial Transplant 22(4):1131-5.

Garrido C. 2002. Size matters: of the small HSP27 and its large oligomers. Cell Death

Differ 9(5):483-5.

George VT, Elston RC. 1987. Testing the association between polymorphic markers and

quantitative traits in pedigrees. Genet Epidemiol 4(3):193-201.

Gherardi R, Belghiti-Deprez D, Hirbec G, Bouche P, Weil B, Lagrue G. 1985. Focal

glomerulosclerosis associated with Charcot-Marie-Tooth disease. Nephron

40:357-361.

Hotelling H. 1931. The Generalization of Student's Ratio. The Annals of Mathematical

Statistics:360-378.

Isermann B, Vinnikov IA, Madhusudhan T, Herzog S, Kashif M, Blautzik J, Corat MAF,

Zeier M, Blessing E, Oh J. 2007. Activated protein C protects against diabetic

nephropathy by inhibiting endothelial and podocyte apoptosis. Nature Medicine

13(11):1349-1358.

Iyengar SK, Abboud HE, Goddard KA, Saad MF, Adler SG, Arar NH, Bowden DW,

Duggirala R, Elston RC, Hanson RL and others. 2007. Genome-wide scans for di-

90

abetic nephropathy and albuminuria in multiethnic populations: the family inves-

tigation of nephropathy and diabetes (FIND). Diabetes 56(6):1577-85.

Janssen B, Hohenadel D, Brinkkoetter P, Peters V, Rind N, Fischer C, Rychlik I, Cerna

M, Romzova M, de Heer E and others. 2005. Carnosine as a protective factor in

diabetic nephropathy: association with a leucine repeat of the carnosinase gene

CNDP1. Diabetes 54(8):2320-7.

Kaufman RJ. 2002. Orchestrating the unfolded protein response in health and disease.

Am Soc Clin Investig. p 1389-1398.

Ketteler M, Bongartz P, Westenfeld R, Wildberger JE, Mahnken AH, Bohm R, Metzger

T, Wanner C, Jahnen-Dechent W, Floege J. 2003. Association of low fetuin-A

(AHSG) concentrations in serum with cardiovascular mortality in patients on di-

alysis: a cross-sectional study. Lancet 361(9360):827-33.

Knowler WC, Coresh J, Elston RC, Freedman BI, Iyengar SK, Kimmel PL, Olson JM,

Plaetke R, Sedor JR, Seldin MF. 2005. The Family Investigation of Nephropathy

and Diabetes (FIND): design and methods. J Diabetes Complications 19(1):1-9.

Kumar D, Robertson S, Burns KD. 2004. Evidence of apoptosis in human diabetic kidney.

Molecular and cellular biochemistry 259(1):67-70.

Laird NM, Horvath S, Xu X. 2000. Implementing a unified approach to family-based

tests of association. Genet Epidemiol 19 Suppl 1:S36-42.

Leak TS, Perlegas PS, Smith SG, Keene KL, Hicks PJ, Langefeld CD, Mychaleckyj JC,

Rich SS, Kirk JK, Freedman BI and others. 2009. Variants in intron 13 of the

ELMO1 gene are associated with diabetic nephropathy in African Americans.

Ann Hum Genet 73(2):152-9.

91

Lee JW, Kwak HJ, Lee JJ, Kim YN, Lee JW, Park MJ, Jung SE, Hong SI, Lee JH, Lee JS.

2008. HSP27 regulates cell adhesion and invasion via modulation of focal adhe-

sion kinase and MMP-2 expression. Eur J Cell Biol 87(6):377-87.

Longmate JA. 2001. Complexity and power in case-control association studies. Am J

Hum Genet 68(5):1229-37.

Lynch M, Walsh B. 1998. Genetics and Analysis of Quantitative Traits: {Sinauer Asso-

ciates}.

Mehrotra R, Westenfeld R, Christenson P, Budoff M, Ipp E, Takasu J, Gupta A, Norris K,

Ketteler M, Adler S. 2005. Serum fetuin-A in nondialyzed patients with diabetic

nephropathy: Relationship with coronary artery calcification. Kidney Int

67(3):1070-1077.

Nielsen DM, Ehm MG, Zaykin DV, Weir BS. 2004. Effect of two- and three-locus lin-

kage disequilibrium on the power to detect marker/phenotype associations. Genet-

ics 168(2):1029-40.

Noh M, Yip B, Lee Y, Pawitan Y. 2006. Multicomponent variance estimation for binary

traits in family-based studies. Genetic epidemiology 30(1).

Pawitan Y, Reilly M, Nilsson E, Cnattingius S, Lichtenstein P. 2004. Estimation of ge-

netic and environmental factors for binary traits using family data. Statistics in

medicine 23(3).

Pinheiro JC, Bates DM. 2000. Mixed-effects models in S and S-PLUS: Springer.

Prentice RL, Pyke R. 1979. Logistic disease incidence models and case-control studies.

Biometrika 66:403-411.

92

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. 2006. Princip-

al components analysis corrects for stratification in genome-wide association stu-

dies. Nat Genet 38(8):904-9.

Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. 2000. Association mapping in

structured populations. Am J Hum Genet 67(1):170-81.

Purcell S, Sham P, Daly MJ. 2005. Parental phenotypes in family-based association anal-

ysis. The American Journal of Human Genetics 76(2):249-259.

Rabe-Hesketh S, Skrondal A, Gjessing HK. 2008. Biometrical modeling of twin and fam-

ily data using standard mixed model software. Biometrics 64(1):280-8.

Rao CR. 1973. Linear statistical inference and its applications.

Sasieni PD. 1997. From genotypes to genes: doubling the sample size. Biometrics

53(4):1253-61.

Schaid DJ. 2004. Evaluating associations of haplotypes with traits. Genet Epidemiol

27(4):348-64.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. 2002. Score tests for as-

sociation between traits and haplotypes when linkage phase is ambiguous. Am J

Hum Genet 70(2):425-34.

Schultz DW, Weleber RG, Lawrence G, Barral S, Majewski J, Acott TS, Klein ML. 2005.

HEMICENTIN-1 (FIBULIN-6) and the 1q31 AMD locus in the context of com-

plex disease: review and perspective. Ophthalmic Genet 26(2):101-5.

Shimazaki A, Kawamura Y, Kanazawa A, Sekine A, Saito S, Tsunoda T, Koya D, Baba-

zono T, Tanaka Y, Matsuda M and others. 2005. Genetic variations in the gene

93

encoding ELMO1 are associated with susceptibility to diabetic nephropathy. Di-

abetes 54(4):1171-8.

Slager SL, Schaid DJ. 2001. Evaluation of candidate genes in case-control studies: a sta-

tistical method to account for related subjects. Am J Hum Genet 68(6):1457-62.

Slager SL, Schaid DJ, Wang L, Thibodeau SN. 2003. Candidate-gene association studies

with pedigree data: controlling for environmental covariates. Genet Epidemiol

24(4):273-83.

Song K, Elston RC. 2006. A powerful method of combining measures of association and

Hardy-Weinberg disequilibrium for fine-mapping in case-control studies. Stat

Med 25(1):105-26.

Spielman RS, McGinnis RE, Ewens WJ. 1993. Transmission test for linkage disequili-

brium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM).

Am J Hum Genet 52(3):506-16.

Stefan N, Hennige AM, Staiger H, Machann J, Schick F, Kr?er SM, Machicao F, Fritsche

A, H?ing HU. 2006. a? 2 ?-heremans-schmid glycoprotein/ fetuin-A is associated

with insulin resistance and fat accumulation in the liver in humans. Diabetes Care

29(4):853-857.

The International HapMap Project. 2005. A haplotype map of the human genome. Nature

437(7063):1299-1320.

The Wellcome Trust Case Control Consortium. 2007. Genome-wide association study of

14,000 cases of seven common diseases and 3,000 shared controls. Nature

447(7145):661-78.

94

Thompson CL, Klein BE, Klein R, Xu Z, Capriotti J, Joshi T, Leontiev D, Lee KE, Els-

ton RC, Iyengar SK. 2007. Complement factor H and hemicentin-1 in age-related

macular degeneration and renal phenotypes. Hum Mol Genet 16(17):2135-48.

Thornton T, McPeek MS. 2007. Case-control association testing with related individuals:

a more powerful quasi-likelihood score test. The American Journal of Human Ge-

netics 81(2):321-337.

Wang T, Zhu X, Elston RC. 2007. Improving power in contrasting linkage-

disequilibrium patterns between cases and controls. Am J Hum Genet 80(5):911-

20.

Wanic K, Placha G, Dunn J, Smiles A, Warram JH, Krolewski AS. 2008. Exclusion of

polymorphisms in carnosinase genes (CNDP1 and CNDP2) as a cause of diabetic

nephropathy in type 1 diabetes: results of large case-control and follow-up studies.

Diabetes 57(9):2547-51.

Weir BS. 1996. Genetic data analysis II : methods for discrete population genetic data.

Sunderland, Mass.: Sinauer Associates, Inc., Publishers.

Won S, Elston RC. 2008. The power of independent types of genetic information to

detect association in a case-control study design. Genet Epidemiol 32(8):731-56.

Xiong M, Zhao J, Boerwinkle E. 2002. Generalized T2 test for genome association stu-

dies. Am J Hum Genet 70(5):1257-68.

Yu J, Pressoir G, Briggs WH, Bi IV, Yamasaki M, Doebley JF, McMullen MD, Gaut BS,

Nielsen DM, Holland JB. 2005. A unified mixed-model method for association

mapping that accounts for multiple levels of relatedness. Nature genetics

38(2):203-208.

95

Zaykin DV. 2004. Bounds and normalization of the composite linkage disequilibrium

coefficient. Genet Epidemiol 27(3):252-7.

Zaykin DV, Meng Z, Ehm MG. 2006. Contrasting linkage-disequilibrium patterns be-

tween cases and controls as a novel association-mapping method. Am J Hum Ge-

net 78(5):737-46.

Zhao K, Nordborg M, Marjoram P. Genome-wide association mapping using mixed-

models: application to GAW15 Problem 3; 2007. BioMed Central Ltd. p S164.

Zheng G, Meyer M, Li W, Yang Y. 2008. Comparison of two-phase analyses for case-

control genetic association studies. Stat Med 27(24):5054-75.

Zheng G, Song K, Elston RC. 2007. Adaptive two-stage analysis of genetic association in

case-control designs. Hum Hered 63(3-4):175-86.

Zhou H, Pisitkun T, Aponte A, Yuen PST, Hoffert JD, Yasuda H, Hu X, Chawla L, Shen

RF, Knepper MA and others. 2006. Exosomal Fetuin-A identified by proteomics:

A novel urinary biomarker for detecting acute kidney injury. Kidney Int

70(10):1847-1857.

Zhu X, Zhang S, Zhao H, Cooper RS. 2002. Association mapping, using a mixture model

for complex traits. Genet Epidemiol 23(2):181-96.

Zöllner S, Pritchard JK. 2005. Coalescent-based association mapping and fine mapping

of complex trait loci. Genetics 169(2):1071-92.

96