ASCERTAINMENT IN TWO-PHASE SAMPLING DESIGNS

FOR SEGREGATION AND LINKAGE ANALYSIS

by

GUOHUA ZHU

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Robert C. Elston

Department of Epidemiology and Biostatistics

CASE WESTERN RESERVE UNIVERSITY

May, 2005

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the dissertation of

______

candidate for the Ph.D. degree *.

(signed)______(chair of the committee)

______

______

______

______

______

(date) ______

*We also certify that written approval has been obtained for any proprietary material contained therein.

TABLE OF CONTENTS

TABLE OF CONTENTS iii

LIST OF TABLES vii

LIST OF FIGURES xiv

ACKNOWLEDGEMENTS xv

ABSTRACT xvi

CHAPTER I. LITERATURE REVIEW AND STATEMENT OF THE PROBLEM

1.1 Introduction 1

1.2 The Ascertainment Problem in Segregation Analysis 2

1.2.1 Segregation Analysis 2

1.2.2 The Ascertainment Problem in Segregation Analysis 5

1.2.2.1 The Ascertainment Problem in the Segregation Analysis of Sibships 6

1.2.2.2 Correcting for Ascertainment Bias in the analysis of Sibships 9

1.2.2.3 Correcting for Ascertainment in Pedigree Analysis 13

1.3 The Ascertainment Problem in Linkage Analysis 17

1.3.1 Linkage Analysis 17

1.3.1.1 Model-Based Linkage Analysis 17

1.3.1.2 Model-Free Linkage Analysis 21

1.3.2 The Ascertainment Problem in Linkage Analysis 26

1.4 The Ascertainment Problem in Joint Segregation and Linkage Analysis 29

iii 1.4.1 Joint Segregation and Linkage Analysis 29

1.4.2 The Ascertainment Problem in Joint Segregation and Linkage Analysis 35

1.5 Optimum Study Design and Cost Effectiveness 35

1.6 Weighted Distributions and Two-Phase (Sampling) Designs 36

1.6.1 Weighted Distributions 37

1.6.2 Two-Phase (Sampling) Designs 38

1.7 Statement of the Problem 42

CHAPTER II. TWO-PHASE DESIGN LIKELIHOODS FOR SEGREGATION

ANALYSIS IN NUCLEAR FAMILIES

2.1 The Model and Its Assumptions 44

2.2 Simple Segregation Analysis 45

2.3 Complex Segregation Analysis 49

2.3.1 Estimation of Allele Frequencies from Segregation Analysis for a

Recessive Model with Incomplete Heterozygote Penetrance 53

2.3.2 Estimation of Allele Frequencies from Segregation Analysis for a

Dominant Model with Incomplete Heterozygote Penetrance 55

2.4 Summary 57

CHAPTER III. TWO-PHASE DESIGNS FOR SEGREGATION ANALYSIS IN

PEDIGREES

3.1 Introduction 58

3.2 Simulation Procedure 59

iv 3.3 Two-Phase (Sampling) Designs 61

3.4 Results 63

3.4.1 The Effects under Dominant Models 63

3.4.1.1 The Estimates of the Allele Frequency 63

3.4.1.2 The Estimates of the Penetrances 67

3.4.2 The Effects under Recessive Models 71

3.4.2.1 The Estimates of the Allele Frequency 71

3.4.2.2 The Estimates of the Penetrances 77

3.5 Summary and Discussion 83

CHAPTER IV. TWO-PHASE DESIGNS FOR LINKAGE ANALYSIS IN

PEDIGREES

4.1 Introduction 87

4.2 Simulations 88

4.3 Method of Analysis 88

4.4 Results 89

4.4.1 The Effects under Dominant Models 89

4.4.1.1 Estimates of the Recombination Fraction 89

4.4.1.2 Maximum Lod Scores 93

4.4.2 The Effects under Recessive Models 94

4.4.2.1 Estimates of the Recombination Fraction 94

4.4.2.2 Maximum Lod Score 100

4.5 Summary 101

v CHAPTER V. THE COST EFFECTIVENESS OF LINKAGE ANALYSIS IN

TWO-PHASE DESIGNS

5.1 Introduction 104

5.2 A Cost Function for Two-Phase Sampling Designs 105

5.3 Results 106

5.4 Summary 113

CHAPTER VI. CONCLUSIONS AND TOPICS FOR FURTHER STUDY

6.1 Conclusions 116

6.2 Topics for Further Study 118

BIBLIOGRAPHY 120

vi LIST OF TABLES

Table 1.1 Genetic Transition Matrix for Two Alleles at One Autosomal Locus. Each

Entry is a Genotypic Distribution [ pst1 pst 2 pst3 ] Conditional on Mating

Type s × t (Elston and Stewart 1971) 4

Table 2.1 Segregation Models for Recessive Inheritance with Incomplete Heterozygote

Penetrance 54

Table 2.2 Segregation Models for Dominant Inheritance with Incomplete Heterozygote

Penetrance 56

Table 3.1 Genetic Parameters for Simulation of Pedigrees 60

Table 3.2 Sample Pedigrees Simulated in Order to Produce Nm= 50 and 100, Respectively

61

Table 3.3 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Dominant Model for Segregation Analysis: Sample

with 50 Multiplex Families 64

Table 3.4 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Dominant Model for Segregation Analysis: Sample

with 100 Multiplex Families 64

Table 3.5 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Incompletely Dominant Model for Segregation

Analysis: Sample with 50 Multiplex Families 65

Table 3.6 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Incompletely Dominant Model for Segregation

vii Analysis: Sample with 100 Multiplex Families 66

Table 3.7 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Incompletely Dominant Model for Segregation

Analysis: Sample with 50 Multiplex Families 66

Table 3.8 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Incompletely Dominant Model for Segregation

Analysis: Sample with 100 Multiplex Families 67

Table 3.9 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f ) in the Dominant Model for Segregation Analysis: Sample

with 50 Multiplex Families 68

Table 3.10 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f ) in the Dominant Model for Segregation Analysis: Sample

with 100 Multiplex Families 68

Table 3.11 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f ) in the Incomplete Dominant Model for Segregation Analysis:

Sample with 50 Multiplex Families 69

Table 3.12 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f ) in the Incomplete Dominant Model for Segregation Analysis:

Sample with 100 Multiplex Families 69

Table 3.13 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f ) in the Incompletely Dominant Model for Segregation Analysis:

Sample with 50 Multiplex Families 70

viii Table 3.14 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f ) in the Incompletely Dominant Model for Segregation Analysis:

Sample with 100 Multiplex Families 71

Table 3.15 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Recessive Model for Segregation Analysis: Sample

with 50 Multiplex Families 72

Table 3.16 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Recessive Model for Segregation Analysis: Sample

with 100 Multiplex Families 73

Table 3.17 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Incompletely Recessive Model for Segregation

Analysis: Sample with 50 Multiplex Families 74

Table 3.18 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Incompletely Recessive Model for Segregation

Analysis: Sample with 100 Multiplex Families 75

Table 3.19 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Incompletely Recessive Model for Segregation

Analysis: Sample with 50 Multiplex Families 76

Table 3.20 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the

Allele Frequency (q) in the Incompletely Recessive Model for Segregation

Analysis: Sample with 100 Multiplex Families 77

Table 3.21 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

ix Penetrance (f ) in the Recessive Model for Segregation Analysis: Sample

with 50 Multiplex Families 78

Table 3.22 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f) in the Recessive Model for Segregation Analysis: Sample

with 100 Multiplex Families 79

Table 3.23 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f) in the Incompletely Recessive Model for Segregation Analysis:

Sample with 50 Multiplex Families 80

Table 3.24 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f) in the Incompletely Recessive Model for Segregation Analysis:

Sample with 100 Multiplex Families 81

Table 3.25 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f) in the Incompletely Recessive Model for Segregation Analysis:

Sample with 50 Multiplex Families 82

Table 3.26 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the

Penetrance (f) in the Incompletely Recessive Model for Segregation Analysis:

Sample with 100 Multiplex Families 83

Table 4.1 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the Dominant

Model for Linkage Analysis: Sample with 50 Multiplex Families 90

Table 4.2 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

x

Dominant Model for Linkage Analysis: Sample with 100 Multiplex

Families 90

Table 4.3 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Incompletely Dominant Model for Linkage Analysis: Sample with 50

Multiplex Families 91

Table 4.4 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Incompletely Dominant Model for Linkage Analysis: Sample with 100

Multiplex Families 91

Table 4.5 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Incompletely Dominant Model for Linkage Analysis: Sample with 50

Multiplex Families 92

Table 4.6 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Incompletely Dominant Model for Linkage Analysis: Sample with 100

Multiplex Families 93

Table 4.7 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Recessive Model for Linkage Analysis: Sample with 50 Multiplex Families 95

Table 4.8 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

xi Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Recessive Model for Linkage Analysis: Sample with 100 Multiplex Families 96

Table 4.9 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Incompletely Recessive Model for Linkage Analysis: Sample with 50

Multiplex Families 97

Table 4.10 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Incompletely Recessive Model for Linkage Analysis: Sample with 100

Multiplex Families 98

Table 4.11 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Incompletely Recessive Model for Linkage Analysis: Sample with 50

Multiplex Families 99

Table 4.12 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the

Recombination Fraction (θ) and Mean and Se of the Lod Score in the

Incompletely Recessive Model for Linkage Analysis: Sample with 100

Multiplex Families 100

Table 5.1 Estimated Numbers of Simplex and Multiplex Families to Achieve a Lod Score

of [3 + 1.96 Se(lod)]: Dominant Models 107

Table 5.2 Estimated Numbers of Simplex and Multiplex Families to Achieve a Lod Score

of [3 + 1.96 Se(lod)]: Recessive Models 108

Table 5.3 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost

xii

Function (5.2): Fully Penetrant Dominant Model 109

Table 5.4 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost

Function: Incompletely Penetrant Dominant Model, Penetrance (f) = 0.8 109

Table 5.5 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost

Function: Incompletely Penetrant Dominant Model, Penetrance (f) = 0.5 110

Table 5.6 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost

Function: Fully Penetrant Recessive Model, Penetrance (f) = 1.0 111

Table 5.7 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost

Function: Incompletely Penetrant Recessive Model, Penetrance (f) = 0.8 112

Table 5.8 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost

Function: Incompletely Penetrant Recessive Model, Penetrance (f) = 0.5 113

xiii

LIST OF FIGURES

Figure 3.1 The Simulated Pedigree Structure 60

xiv

ACKNOWLEDGEMENTS

I am very fortunate to be Dr. Robert C. Elston’s graduate student. In every respect, he is my great role model. I have been truly inspired by his marvelous academic achievements and full devotion to science and touched by his persistent diligence.

Whenever I have questions and difficulties, I am always able to get his prompt help and enthusiastic encouragement. I am very grateful to him for his consistent support, intelligent advice, and careful correction for my dissertation. His guidance is priceless.

I also sincerely thank my other advisory committee members: Dr. J. Sunil Rao,

Dr. Yuqun Luo, and Dr. Nidhan Choudhuri. I truly appreciate that they kindly agreed to be my committee members and spent their precious time on my research. Their wise suggestions and comments were very helpful.

Thanks also go to numerous other professors, staff members, and graduate students in the Department of Epidemiology and Biostatistics. I have benefited from their teaching, help, and scientific discussions.

Finally, I would like to thank my wife, Chun Gu, for her unconditional love, full support and confidence in me.

xv

ASCERTAINMENT IN TWO-PHASE SAMPLING DESIGNS FOR SEGREGATION AND LINKAGE ANALYSIS

Abstract

By

Guohua Zhu

The optimization of genetic study designs is critical to detecting and locating genes responsible for complex diseases. It can be affected by many factors. They include the choice of sampling units, sampling procedures, phenotyping, genotyping, and analysis schemes (segregation and linkage analysis, etc).

In this dissertation, first we define two-phase sampling designs based on weighted distributions and then compare a variety of sampling schemes using numerous combinations with different proportions of multiplex (M) and simplex (S) families.

Subsequently, we establish two-phase designs for segregation analysis and linkage analysis of general pedigrees by simulations and propose a cost function for linkage analysis in two-phase designs. Our new likelihood proposed for two-phase designs can provide consistent estimates of the parameters and optimal designs. Especially when there is incomplete penetrance, having more simplex families in the samples is necessary to obtain better parameter estimates. Under the fully penetrant dominant model, a sampling design with 9% simplex families provides good estimates of the allele frequency (q) and recombination fraction (θ). Under the dominant model with penetrance = 0.8, a sampling design with 18% simplex families yields better estimates of q and θ and is more cost effective. Under the dominant model with penetrance = 0.5,

xvi samples with 23% to 28% simplex families yield better estimates of q and θ and are more cost effective. Under the recessive model with the full penetrance, samples with

29% to 38% simplex families produce better estimates of q and θ and are more cost effective. Under the recessive models corresponding to a penetrance of 0.8, the effects of simplex families obviously increase; samples with 38% simplex families yield better estimates of q and θ and are most cost effective. Under the recessive models corresponding to a penetrance of 0.5, samples with 41% simplex families have a better capacity for producing consistent estimates of the parameters as well as a higher efficacy regarding cost for testing genetic models.

xvii CHAPTER I

LITERATURE REVIEW AND STATEMENT OF PROBLEM

1.1 Introduction

In genetic epidemiology, when a study is designed and conducted, the collected data need to be analyzed to answer the questions of interest. Fisher (1934) wrote in his classical paper on the effect of methods of ascertainment: “It is a statistical commonplace that the interpretation of a body of data requires knowledge of how it was obtained”. The method of analysis should be intimately related to the method of sampling if one wants to make appropriate inferences.

It would be inefficient to study a random sample of families from the population in order to investigate the familial aggregation of a very rare disease. Because most of the families sampled would be disease-free, such a sample of families would be virtually uninformative (Elston and Bonney 1984). Therefore, early geneticists studied only the families of affected individuals and developed methods of segregation analysis (e.g.,

Weinberg 1912) for this situation. Studies aimed at understanding the genetic basis of a disease may produce biased parameter estimates if subjects are selected according to the presence or absence of the disease in at least one family member. Persons through whom the rest of the pedigree is independently ascertained are often called probands. Sampling through affected persons considerably alters the expected ratio of affected to unaffected persons; moreover, affected persons ascertained in this manner may not be representative of affected cases in the general population, or even of the segment of the population from

1 which they were selected. This “ascertainment bias” is due to nonrandom sampling. To

the human geneticist, ascertainment may cause bias of parameter estimates in segregation

or linkage analysis (Elston 1995; Vieland and Hodge 1996). Ascertainment bias is also

relevant to the estimation of familial correlations in sibships, nuclear families, and larger

pedigree structures. Thus the problem of ascertainment should not be readily dismissed

or ignored (Keen and Elston 2001).

In section 1.2 we will review the ascertainment problem in segregation analysis.

In section 1.3 we will review the ascertainment problem in linkage analysis, including

model-free and model-based methods for qualitative and quantitative traits. In section 1.4

we will review the ascertainment problem in joint segregation and linkage analysis. In

section 1.5 we will review weighted distributions and two-phase sampling designs.

Finally, the proposed research will be described in section 1.6.

1.2 The Ascertainment Problem in Segregation Analysis

1.2.1 Segregation Analysis

Segregation analysis is the statistical methodology used to determine the mode of inheritance of a particular phenotype from family and pedigree data, especially with a view to elucidating the variation resulting from the genotypes at a single locus (Elston

1981).

An important feature of segregation analysis is that it may involve the estimation of many parameters, typically accomplished through maximum likelihood estimation

2 (MLE) procedures (Elston and Stewart 1971; Cannings et al. 1976; Lange and Boehnke

1983). Most algorithms limit the major gene component to a diallelic system and assume

Hardy-Weinberg equilibrium. Three general classes of parameters are required to

construct the likelihood of a general model of inheritance. The first two classes specify the frequencies with which the different types of individuals may occur. In pedigree data, there are two kinds of individuals: founders, whose parents are unknown, and nonfounders, who are the offspring or descendants of founders. Frequency parameters specify the probability of any founder having each of the essential types, called ousiotypes by Cannings et al. (1978). These frequency parameters may be simple products of binomial proportions, or more generally multinomial frequencies.

Transmission parameters, relevant for nonfounders, are defined as the probability of each parental type transmitting an underlying allele to an offspring. The third class of parameters, called penetrance parameters, specifies the conditional probability of observing the phenotype of an individual given the individual’s type.

There have been two main segregation models from which other models have been developed.

(1) The transmission probability model: The transition probabilities, i.e. the probabilities of genotypes conditional on parental genotypes, are expressed in terms of transmission probabilities, the probabilities that a person of given genotype transmits a particular allele; and these are allowed to vary in the interval [0, 1] instead of being set equal to the Mendelian values of 0, ½ or 1 (Elston and Stewart 1971). The genetic transition matrix for segregation of two alleles at one autosomal locus is listed in Table

1.1. We order the genotypes 1 = AA, 2 = Aa, and 3 = aa, so that the sth row corresponds

3 to genotype s for one of the parents and the tth column corresponds to genotype t for the

other parent. The (s, t) th entry in the matrix is the vector [ pst1 pst 2 pst3 ], i.e. the genotypic distribution of offspring from the mating s × t. Now define the transmission

probability τ t as the probability that an individual with genotype t transmits A to

offspring, so that 1 - τ t is the probability that the individual transmits a. Then the entries

in Table 1.1 can be generated simply from

[ pst1 pst 2 pst3 ] = [τ sτ t τ s (1 −ττtt) +(1 −τs) (1 −τ s )(1 −τ t )] (1.1)

Substitutingτ AA =1, τ Aa =12 andτ aa = 0, which are the appropriate Mendelian values if there is no mutation or meiotic drive.

Table 1.1 Genetic Transition Matrix for Two Alleles at One Autosomal

Locus. Each Entry is a Genotypic Distribution [ pst1 pst 2 pst3 ] Conditional on Mating Type s × t (Elston and Stewart 1971) t

s 1 = AA 2 = Aa 3 = aa

1 = AA [1 0 0] [1/2 1/2 0] [0 1 0]

2 = Aa [1/2 1/2 0] [1/4 1/2 1/4] [0 1/2 1/2]

3 = aa [0 1 0] [0 1/2 1/2] [0 0 1]

(2) The mixed model: In the mixed model, the phenotype may be the result of a

combination of a single major locus and a large number of loci whose alleles act

4 additively (Elston and Stewart 1971; Morton and Maclean 1974; Lalouel and Morton

1981).

Lalouel et al. (1983) proposed a unified model that combines the above two models. Bonney (1984, 1986, 1992) developed regressive models, which permit simultaneous adjustment for familial correlations and covariates while estimating the parameters of the various models in a segregation analysis. Many of the regressive models have been implemented in the software package S.A.G.E. (Statistical Analysis for

Genetic Epidemiology: 2004).

It is important to point out that designing and performing a complex segregation analysis is often a long and expensive process. It is critical that care be taken to clearly define the ascertainment of each pedigree, and that appropriate corrections for the ascertainment scheme be used.

1.2.2 The Ascertainment Problem in Segregation Analysis

The term “ascertainment” refers to a mode of sampling that depends on the outcome that we wish to analyze as a dependent variable (Burton et al. 2001).

Ascertainment may be relevant not only in segregation analyses, but also in the estimation of familial correlations and linkage analyses (Elston 1995). There are three sampling considerations that need to be addressed in the design of any epidemiological study: the definition of the sampling unit, how the units are to be sampled from the population, and the number of units that will be sampled (Elston and Bonney 1984).

5 Bateson (1909) first recognized the possibility of ascertainment bias, and

Weinberg (1912) considered more fully alternative sampling schemes and their effects on

the estimation of the frequency of, and testing the model of, a recessive trait. Hogben

(1931), Fisher (1934), and Haldane (1938) developed the theory of ascertainment

adjustment in the context of rare recessive disorders, while Smith (1959) and Morton

(1959, 1969) extended this theory to cases of complex genetic models. These authors

were concerned with segregation at a single locus and developed methods for the analysis

of independent sibships ascertained through probands, by conditioning the likelihood for

each family on each family containing at least one proband. The sampling unit was the

sibship, if the selection was through an affected child, or the nuclear family, if selection

was through an affected parent. Unless one allows for the ascertainment, inference

pertains to either a population of sibships each with at least one affected person, or a population of nuclear families each with at least one affected parent. Because the sample

is not randomly selected, an appropriate ascertainment correction must be derived if we

are to make inferences to the general population.

1.2.2.1 The Ascertainment Problem in the Segregation Analysis of Sibships

In the classical case of a rare recessive disease, when analyzing a simple discrete

phenotype (e.g., affected versus unaffected) in sibships from a single mating type, the

probability (p) of any given offspring being affected, i.e., the segregation parameter or

segregation ratio, is unknown, and the goal of the analysis is to estimate p and to test for

departure from Mendelian expectation. Assuming children within a sibship are

6 independent observations, the binomial distribution describes the probability of affected offspring in a sibship of size s and is given by:

s ⎛⎞ r sr− P(r; s, p) = ⎜⎟pp()1−=,r1,2,..., s. (1.2) ⎝⎠r

Fisher (1934) assumed that an affected person becomes a “proband” (i.e., an affected person who, independently of all others in a sibship, causes the sibship to enter the sample) with probability π (this is the classical definition of the term “proband”).

With ascertainment through affected offspring, three possibilities may be described. Let

A denote the event that the family is ascertained:

1. “Complete ascertainment” (or “truncate selection”) corresponds to the case where

π = 1. In this situation, the probability that a sibship enters the sample is

independent of the number of affected members it contains, provided it contains

at least one affected member. Then,

sr− ⎛⎞s ppr ()1− P(r| s, p, A) = ⎜⎟ s ,r=1,2,..., s. (1.3) ⎝⎠r 11−−()p

A simple estimator for the segregation proportion in the situation of sampling from a truncated binomial distribution was proposed by Li and Mantel (1969). Here, the estimator for the segregation proportion is computed as

7 R − J pˆ = 1 , (1.4) SJ− 1 where R is the total number of affected offspring in a sample of S offspring, and J1 represents the number of simplex sibships, that is, the number of sibships with only one affected member. The variance of this approximate estimator is generally greater than that of the regular maximum likelihood estimator, but less than that of Weinberg’s original estimator. For large samples, the large-sample variance of this estimator is

2 ()R −−JS1(R)2J2(S−R) Var ( pˆ ) = 3+ 4, (1.5) ()SJ−−11()SJ

where J2 is the number of sibships with exactly two affected individuals (Davie 1979). A sibship with two or more affected members is called multiplex.

2. “Single ascertainment” occurs in the limit as π → 0; it is named to reflect the fact

that, as π → 0, there is a very small probability than any family has more than a

single proband. As π → 0, the probability that the sibship enters the sample

becomes proportional to the number of affected members it contains. Then there

is one proband per family. Under this model,

s−1 ⎛⎞rs−−1 r P(r| s, p, A) = ⎜⎟pp(1−=) ,r1,2,..., s. (1.6) ⎝⎠r−1

8 3. “Multiple ascertainment”, corresponding to the case where 0 < π < 1, is often

assumed when one does not know the probability of ascertainment. Then, the

segregation model distribution of an ascertained family is

s ⎛⎞ sr− r ppr 11−−⎡⎤1−π ⎜⎟ ()⎣⎦() P(r| s, p, A) = ⎝⎠r ,r= 1,2,..., s. (1.7) 11−−()π p s

It can be shown that the two other ascertainment possibilities represent limiting cases of this one.

1.2.2.2 Correcting for Ascertainment Bias in the Analysis of Sibships

There are two assumptions essential for applying the binomial distribution to describe the distribution of probands among affected individuals. The first is to assume the existence of some fixed probability of an affected individual being a proband; the

second is to assume that independence occurs among all probands. However, these

assumptions are neither testable nor necessarily supported in family studies. Various

probabilities of being ascertained may exist, corresponding to different affected

individuals, depending on their access to medical care, the severity and the beginning age

of their disease, and so on. In addition, there are a number of sources of ascertainment available to any one study: clinic referrals, disease registries, vital records, population

surveys, etc. Each source may cause a different probability of identifying affected

individuals as probands, and the probabilities among all groups may not be constant

9 (Khoury et al. 1993). The information relating the probability of ascertaining an affected

individual as a proband (π) can be obtained by noting the number of overlapping

ascertainments for each proband, that is, how many probands were ascertained by only

one source, by two sources, and so on (Morton 1982). However, frequently the sources

of ascertainment are not mutually exclusive, resulting in an underestimate of π. In addition, it is difficult to prove the second assumption of complete independence

(conditional on being affected) among probands, especially within a sibship or pedigree.

In fact, in a family, a second affected child is usually brought to attention more easily than the first one because the parents are more experienced about the disease. This dependence may also occur beyond the sibship. For instance, a cousin of a proband may have a greater chance of being ascertained by a single clinic because the extended family is more informed about the disease. On the other hand, if a disease or trait is regarded as

“normal” by the family, the chance for a second affected child to enter the study becomes smaller. Greenberg (1986) showed that ignoring such dependence among potential probands can cause severe biases in the estimated segregation ratio and compromise tests of genetic mechanisms.

Under the binomial assumption, π approaching 0 leads to the probability of a sibship entering the study being proportional to the number of affected offspring (r) in the sibship. Sibships with two affected offspring are represented in the sample twice as frequently as those in the population with one affected offspring; sibships with three affected children are represented three times as frequently, and so on. However, distortions beyond the proportional pattern may also appear. For example, the probability of ascertaining a family may increase exponentially with the number of affected

10 individuals. Thus, multiplex sibships can be very greatly overrepresented in the sample.

Ewens and Shute (1986) proposed that some sets of family data appear to fit a quadratic function in which the probability of the family being ascertained rises with the square of the number of affected individuals. This quadratic pattern could easily be derived from two layers of single ascertainment. The first layer might occur at the clinic level, while the second could do so at the level of a collaborative study among clinics.

Ewens and Shute established a nonparametric approach to adjust for ascertainment (1986). Instead of assuming any specific form of the probability of ascertaining a family, the likelihood function of the genetic model is conditioned on the subset of the data relating to ascertainment. In terms of nuclear families ascertained through affected children, this subset would constitute the number of affected children and possibly the status of their parents; with respect to larger pedigrees, it might be sibships or families ascertained through a single source. It has been demonstrated that misspecifing the ascertainment function for family data can result in biases in estimates of genetic parameters (Shute and Ewens 1988a, 1988b). Conditioning on subsets of the data causes loss of precision in the final estimators, because the conditioning process removes the information contained in part of the data. The standard errors about the final estimators will therefore be greater than those obtained by an unconditional procedure

(Shute and Ewens 1988a).

Guo (1998) provided an example showing that incorrect modeling of ascertainment can increase the probability of falsely concluding that familial aggregation exists, even if no genetic or environmental correlation is present. He pointed out that failure to correct for ascertainment can bias estimates of sibling recurrence risk but did

11 not provide a means of adjusting for ascertainment bias. Olson and Cordell (2000)

explored the relationship between ascertainment of sibships and estimation and

interpretation of genetic risk parameters. They showed that in the absence of

ascertainment correction, unbiased estimation of the sibling recurrence risk and overall

sibling relative risk require single ascertainment, while unbiased estimation of locus-

specific relative risk requires complete ascertainment. Cordell and Olson (2000) showed

that relative risk estimates obtained using affected-sib-pair data are asymptotically

unbiased when each pair is given a weight inversely proportional to the sibship

ascertainment probability.

It is commonly assumed that the parameter estimates of a statistical genetic model

that has been adjusted for ascertainment will estimate parameters in the general

population from which the ascertained subpopulation was originally drawn. Burton et al.

(2001) recently demonstrated that, when there is unmodeled heterogeneity, the usual ascertainment adjustment leads to parameter estimates that do not reflect those in the

original population. This conclusion is certainly true and is a useful warning to avoid performing genetic analyses uncritically. Burton et al. (2001) went on to state the important finding that, given unmodeled heterogeneity, ascertainment-adjusted parameter estimates reflect parameter values in the ascertained subpopulation, and they supported

their claim with two examples. By revisiting these examples, Epstein et al. (2002)

demonstrated that: (1) if the genetic mechanism (including heterogeneity) and

ascertainment scheme is appropriately modeled, the genetic analysis should yield

estimates consistent with the parameter values in the original population; and (2) if not, the estimates using the conventional method cannot be expected to equal the parameters

12 in either the original population or the ascertained subpopulation. Burton et al (2002) countered that when heterogeneity is ignored, the resultant ascertainment-adjusted estimates reflect parameters in the ascertained subpopulation rather than those in the original population. Nie (2003) demonstrated that both conclusions in Burton et al. (2001) and Epstein et al. (2002) are correct, with respect to different sample spaces, the ascertained and the general population. When the sample size is moderate or small, the conditional approach of Burton et al. (2001) should work better than the marginal approach in Epstein et al. (2002), because the latter approach usually has a flatter integrated likelihood (Nie 2003).

1.2.2.3 Correcting for Ascertainment in Pedigree Analysis

Just as sampling sibships through a proband changes the expected distribution of affected and unaffected offspring in the final sample, so too does sampling extended pedigrees through an affected proband (or through a proband with an extreme value of a quantitative phenotype). Unfortunately, it is not possible to write the expected likelihood function for a general model of inheritance in any simple fashion for a “general” pedigree because the likelihood function depends on both the model and the particular structure of the individual pedigree. Therefore, it is difficult to write any simple form for the joint likelihood function for the proband (or the subset involved in the ascertainment) and the remainder of the pedigree, in the same way it is possible to do so for the analysis of sibship data (Khoury et al. 1993).

13 Elston and Sobel (1979) proposed conditioning on the pedigree containing at least

one proband among persons who could be probands, regardless of their phenotypes.

This, in effect, suggests that the classic method can be extended to pedigrees, without

regard to their sizes or structure, although it is not known what the actual sampling unit

may be. On the other hand, Lalouel and Morton (1981) treated the sibships within

pedigrees as the sampling units, in keeping with the classic approach. They noted, however, that some sibships in pedigrees may not include probands; they therefore based the ascertainment correction for such nuclear families on pointers: relatives (of extreme

phenotype) outside the nuclear family who may have caused the selection of those nuclear

families. In their likelihood formulation, each pedigree is broken into nuclear families,

and the latter handled as if they were independently selected from the population after

conditioning on parents’ and pointers’ phenotypes.

Cannings and Thompson (1977) showed that, under certain sequential sampling schemes, one only needs to condition on the initial ascertainment event, and that it is not necessary to break up the pedigree. They showed that independently sampled pedigrees that join up remain problematic. The ascertainment correction of the likelihood they derived depends critically on the implicit assumption of single ascertainment. Although the method was not presented in those terms, well-defined sampling units can be decided on and drawn sequentially, as they suggested.

Ewens and Shute (1986) and Shute and Ewens (1988a) showed, without

consideration of pedigree structure, that conditioning on parts of the data relevant to

ascertainment can lead to results robust to certain ascertainment models. There is loss of

14 efficiency in some important cases. Thompson (1988) proposed that their procedure can

be more appropriately viewed as an example of Cox's (1972, 1975) partial likelihood.

The methods of ascertainment correction, especially in pedigrees, have generated

controversy. Vieland and Hodge (1995) reported that the problem is intractable, largely because of difficulties in specifying the probability distribution for the structure of a pedigree. In view of the problems that they described, Vieland and Hodge recommended

that future research efforts should focus on the development of robust approximate

approaches to ascertainment using simulation studies, for use when exact approaches are

impractical or intractable. Robustness in ascertainment corrections is taken to mean that

Fisher's (1934) π model is not used but that, instead, conditioning on data relevant for ascertainment is used (Risch 1984; Ewens and Shute 1986; Shute and Ewens 1988a).

Elston (1995), in his invited editorial on the work of Vieland and Hodge, argued that the problem is not pedigree structure per se but, rather, the fact that independently sampled

branches can join up, as Cannings and Thompson (1977) had already noted in the context

of sequential sampling.

Another unsettling point that does not appear to have been made in the discussion

of ascertainment correction is the irony that the criticisms and the proposed robust

methods are both based on the work of Fisher (1934), who considered only the simple

genetic model of complete penetrance with a known mating type, so that the segregation

ratio is the only unknown genetic parameter needing to be statistically estimated (Bonney

1998). Most modern methods for segregation analysis (e.g., see Elston and Stewart 1971;

Morton and MacLean 1974; Lalouel et al. 1983; Bonney 1986) and linkage analysis

requiring ascertainment correction (Bonney et al. 1988) use models that are far more

15 complex than that used by Fisher. It is not known for certain that the so-called robust

approaches apply to the more complex models and pedigree structures.

Despite the conclusion of Vieland and Hodge (1995) that inherent intractability of the ascertainment problem for pedigree data exists, Rabinowitz (1996) suggested that the

Cannings and Thompson approach to correcting for ascertainment bias in family studies can be extended to settings with multiple ascertainment based on maximizing a pseudolikelihood. He described two approaches to computing standard errors for the maximum pseuolikelihood estimates by simulation.

Bonney (1998) undertook a systematic study of ascertainment corrections in

likelihood models for family data that can be made in terms of smaller distinct units, such

as sibships, nuclear families, or small pedigrees. He presented three principal results.

The first is that ascertainment corrections in likelihood models for family data can be

made in terms of smaller units, without breaking up the pedigree. The second is that the

appropriate correction for single ascertainment in a unit is the reciprocal of the sum of the marginal probabilities of all the persons relevant to its ascertainment, as if affected. The third is a generalization of the single ascertainment-correction formula to k-plex ascertainment, in which each unit has k or more affected members.

Ginsburg et al. (2003) investigated the general theoretical possibilities of correcting the likelihood for the sampling procedures defined by a preplanned sampling design. The sampling itself includes three different procedures: initial pedigree ascertainment, pedigree extension and pedigree censoring (exclusion of pedigree members from the sample analyzed). They showed that an adequate correction can be made by conditioning the pedigree likelihood on the structure of the pedigree data

16 relevant to sampling (RS) in either an ascertainment-model-based (AMB) or an

ascertainment-model-free (AMF) approach.

1.3 The Ascertainment Problem in Linkage Analysis

1.3.1 Linkage Analysis

When two loci are sufficiently close together on the same chromosome, their

alleles cosegregate when passed on to the next generation and we say that the two loci are

linked. In its simplest form, linkage analysis consists of counting recombinants and

nonrecombinants, estimating the recombination fraction (θ), and testing whether this

fraction is significantly < ½ (Elston 1998a). Linkage analysis can be model-based or

model-free. Model-based analyses are often described in the literature as lod score

methods, maximum likelihood methods or parametric analysis, and model-free analyses

have been called nonparametric and robust, even though they involve parameters and can

share with model-based methods many robust properties, especially with respect to test

validity (Elston 1998b).

1.3.1.1 Model-Based Linkage Analysis

The classical measure of genetic linkage is the recombination fraction (θ ).

Recombination occurs between two loci on a pair of homologous chromosomes when the chromosomes cross over an odd number of times between them. A nonrecombinant

17 offspring is one in which the parental type remains intact. Multiple crossovers can occur between two loci. If an even number of crossovers has occurred between the two loci, the resulting offspring is a nonrecombinant between the two loci. However, multiple crossovers between closely linked genes are very rare. The recombination fraction θ ranges from 0 for loci that are completely linked to 0.5 for unlinked loci on the same chromosome or on different chromosomes. The recombination fraction may depend on the sex, race, age, or some other characteristic of the transmitting person, and allowing for such a dependency can increase the power of linkage analysis (e.g., Cleves and Elston

1997).

The unit of measurement of genetic linkage is the centiMorgan (cM) (1/100 of a

Morgan). One map unit corresponds to 1 cM, which is close to 1% recombination. Small values of θ are equivalent to the actual map distance in Morgans. Between loci, recombination fractions are additive over small distances. For larger distances, recombination fractions are not additive because multiple crossovers are more likely.

When this is the case, a map function is used to transform θ values into actual map distance. Some commonly used map functions are those of Kosambi (1944) and Haldane

(1919). Haldane’s map function assumes no interference. The Kosambi map function takes into account interference, at least to some degree, and is the map function most often used in human genetics.

Morton (1955) introduced the concept of a lod score, Z(θ), to linkage analysis.

This is the common logarithm of the ratio of the likelihood for the data of general θ to that of the likelihood of θ = 0.5. When evaluated at θˆ, the maximum likelihood estimate of θ , the maximized lod score, Z(θˆ), is a measure of the strength of evidence for linkage.

18 Typically, a maximum lod > 3 implies, in a large sample, a P-value < 10 −4 (Chotai 1984), which has been used as proof of linkage for two-point analysis. A lod score of -2 or less has traditionally been considered evidence against linkage. When a genome-wide approach is used in a linkage study, the critical value of the lod score should be raised to account for the testing of multiple markers. A lod score of 3.3, corresponding asymptotically to P = 5 ×10-5, is equivalent in a dense genome scan to the recommended

genome-wide significance level of 5% (Lander and Kruglyak 1995). Several assumptions may be made in calculating the lod score for a model-based linkage analysis. The common assumptions include: the trait phenotype is controlled by segregation at only a

single locus; there is no pleiotropic effect of the marker locus on the trait; all pedigree

founders are unrelated to each other; and all parameters other than the recombination

fraction are known without error (this includes all penetrances and allele frequencies, at

both marker and trait loci) (Elston 1998b). The lod score method proved to be an

extremely powerful method for linkage analysis after Elston and Stewart (1971) provided

an algorithm by which the likelihood for extended pedigrees could be computed. When

the lod score is maximized over both the recombination fraction θ and genetic model

parameters, it is called a mod score (Clerget-Darpoux et al., 1986). The mod score can be

used not only to detect linkage, but also to determine the appropriate genetic model. It

has been shown that using the mod score approach to infer a true genetic model is

ascertainment assumption free (Hodge and Elston 1994).

19 Family Structure Designs for Model-Based Linkage Analysis

We address the question of whether there is an optimal sampling design in terms of family structure for linkage analysis. Under the rubric family structure we include not only the size of the sampling unit but also the distribution of phenotypes in the unit.

Clearly, not all families are equal with respect to the information they bring to a linkage analysis. By judiciously choosing the types of families we accept to study, we can dramatically lower or increase the probability of detecting linkage. The objective is simple. We wish to maximize the number of informative meioses at the loci that affect susceptibility. In humans, we are unable to arrange intercrosses or backcrosses at these loci experimentally, except inferentially through a “progeny test”.

It is obvious that certain sampling units, by virtue of their structure, have advantages over other sampling units. For example, the strength of three-generation families is that they provide direct information about linkage phase. Among two- generation families, larger sibships ought to be better than smaller sibships, because they bring more information than smaller sibships. Except when parents are genetically related or there is linkage disequilibrium, families with a single offspring provide no linkage information. Of course, there are many other factors in addition to family structure that will affect experimental design – not the least of which is the cost of genotyping and, importantly, the availability and expense of ascertaining the sample. For fully penetrant Mendelian traits the choice of families is straightforward. Families segregating simple dominant traits will contain at least one affected parent. Families with two affected parents are less informative. For rare recessive traits, virtually all

20 informative families in the population will contain two unaffected parents. Segregating families containing one parent with the recessive phenotype provide only half the information available from segregating families without an affected parent, while families with both parents affected provide no information.

The classical lod score or maximum likelihood method uses all the data available and is the best method available when the true model is known. Model-free linkage methods may be less powerful, but do not require specification of the mode of trait inheritance.

1.3.1.2 Model-Free Linkage Analysis

Model-free linkage analysis methods detect linkage by testing whether the inheritance pattern deviates from expectation under independent assortment.

Qualitative Traits

Sib-pair methods have long been used as a model-free approach to linkage analysis (Penrose 1935; 1953). Much of the more recent work in assessing power and sample size for complex diseases has focused on the use of affected sib pairs for qualitative traits.

To simplify the discussion, we restrict attention to a single marker for which the identity by descent (IBD) status of siblings is unambiguously determined. Suppose that we have typed marker alleles for a total of N independent sib pairs, all of whom are

21 affected with a given trait. We wish to determine whether the marker is near a gene that

predisposes to the trait. Let ni denote the number of sib pairs that inherit i marker alleles

IBD, i = 0, 1, 2, with n0 + n1 + n2 = N. Also, let zi denote the probability that two affected

sibs inherit i marker alleles IBD, i = 0, 1, 2, with z = (z0, z1, z2), and let zˆ = ( zˆ0 , zˆ1 , zˆ2 ) =

(1/n) (n0, n1, n2) represent the relative frequencies observed for the the sample of N sib

pairs. We want to test the null hypothesis z = (1/4, 1/2, 1/4), using a test that performs well regardless of the true value of z.

There are basically two approaches to determine which statistic is optimal: (1)

derive analytical arguments (usually involving asymptotic theory) or (2) perform

simulation studies. In the simplest case, where the numbers of pairs sharing 0, 1, or 2

alleles IBD can be unambiguously determined (i.e., the parents are typed and fully

informative), several simple tests have been developed to test for linkage between a

marker locus and a disease. They are the “proportion” test statistic (Day and Simons

1976; Suarez 1978), the “mean” statistic (De Vries et al. 1976; Green and Woodrow

1977) and a chi-square goodness-of-fit test statistic (Weitkamp et al. 1981).

Blackwelder and Elston (1985) examined the power of these three different sib-

pair linkage for several different genetic models and different sampling

schemes, assuming complete linkage to a completely informative marker locus (such as

HLA). They showed that the mean test is usually the most powerful. Schaid and Nick

(1990) showed that using the maximum of the mean and proportion statistics is a method

that is more robust than using either test individually. Knapp et al. (1994a) concluded

that the mean test is the uniformly most powerful test under a fully penetrant recessive

22 model without sporadic cases and is locally optimal otherwise. Whittemore and Tu

(1998) proposed a general form for the test statistics:

zˆ + w zˆ − E(zˆ + w zˆ ) T = 2 1 1 2 1 1 , (1.8) V (zˆ2 + w1 zˆ1 )

where E and V denote null expected value and variance, respectively. Setting w1 = ½ gives the mean test, while setting w1 = 0 forms the proportion test. They suggested a

“minmax” test in which the weight w1 = 0.275, which is very similar to the test proposed by Feingold and Siegmund (1997) in which w1 = 0.25. On the whole, these tests perform slightly better than the mean test, unless the true genetic model is additive or,

equivalently, one of the two homozygous genotypes at the susceptibility locus is virtually

nonexistent, in which case the mean test is optimal.

Gastwirth and Freidlin (2000) proposed that the maximum efficiency robust test

(MERT) yielded by the theory of efficiency robustness is applicable to a wide class of

genetic models, including those considered by Schaid & Nick (1990), Sham et al. (1997)

and Whittmore & Tu (1998). The approach is based on the correlation matrix of the

optimal test statistics, because the squares of the correlations are their relative asymptotic

efficiencies under each of the possible models. For affected sib pairs, the MERT test is

the robust linear combination of the mean and proportion tests obtained by Whittemore &

Tu (1998). Their results indicated that there can be a substantial loss in power when the

mean test is used in situations where the proportion test is optimal and vice-versa, and the

MERT efficiency robust procedures provide similar protection against loss of power

relative to the optimal test when the model is uncertain.

23 In many situations, however, marker genotype data are unavailable on one or both

parents of the affected sib pair. For this reason, methods that can use all the available data to make inferences about the IBD status of the affected pairs are desirable. Such methods were proposed in a seminal series of papers by Risch (1990a, 1990b). This method is the likelihood ratio test (also called the “maximum lod score test”), written as

L(data | zˆ , zˆ , zˆ ) 0 1 2 , (1.9) 1 1 1 L(data | 4 , 2 , 4 )

where zˆ0 , zˆ1 , zˆ2 now denote maximum likelihood estimates. However, not all values of

the zˆi are compatible with genetic models. Holmans (1993) showed under specific

assumptions that, for possible genetic models, the { zˆi } must satisfy 2 zˆ0 <= zˆ1 <= ½,

which he called the “genetically possible triangle”. Note that this implies that zˆ0 <= 0.25

and that zˆ2 >= 0.25. Holmans (1993) also showed, again under these assumptions, that

restricting the estimation of the { zˆi } to satisfy this criterion resulted in a more powerful test of linkage than the test based on unrestricted maximization. This constrained likelihood ratio test (CLR) is the test currently implemented in the computer program package MAPMAKER/SIBS (Kruglyak and Lander 1995). Likelihood-based analysis has been shown to have greater power than the affected sib pair (ASP) mean test to detect linkage, even when the assumed genetic model is sometimes mis-specified (Tiernet et al.1993; Dürner et al. 1999; Slager et al. 2001). The likelihood-based method shows some advantages because it can be easily extended to the case of multilocus models, that is, to situations in which a single trait is caused by multiple disease loci that may

24 themselves be linked or unlinked (Knapp et al. 1994b; Olson 1997; Cordell et al. 2000;

Elston and Cordell 2001).

Haseman-Elston Method and Extensions

Sib pairs with similar trait phenotypes will share with increased probability

marker alleles at loci linked to any locus for that trait. This idea also extends to

quantitative traits. Haseman and Elston (1972) proposed regressing the squared sib-pair

trait phenotype difference on the proportion of alleles the sibs are estimated to share IBD

at a marker locus. Let the ith sib pair have trait values (Xi1, Xi2). Define the squared trait

D 2 difference as Yi = (X i1 − X i2 ). Summarize the estimated IBD sharing at the locus for

1 D the pair asπ i = zi1 2 + zi2 . Conduct a simple of Yi on an estimate of

π i . Under the null hypothesis of no linkage, the regression slope is zero. Under the alternative hypothesis that this locus is linked to the trait, the regression slope should be negative, because similar trait values (and thus small squared trait differences) should be associated with high IBD sharing. Therefore, linkage can be tested with a one-sided t test

of the regression slope estimate. Under a single major gene model with no dominance

D 2 2 2 variance, E(Yi | π i ) = (σ e + 2σ g )− 2σ gπ i if π i is at the trait locus for which the genetic

2 2 variance is σ g and the environmental variance isσ e . Haseman and Elston pointed out that this result is still asymptotically true for a random sample of sib pairs if the

dominance variance is nonzero. If π i is measured at a marker that is not the trait locus,

D 2 then E(Yi | π i ) = α − βσ gπ i , where β is a positive constant that depends on the

25 recombination fraction between the marker and the trait locus. Therefore, the t statistic

2 for the slope estimate can be used to test the null hypothesis that σ g = 0, with the power of the test decreasing as one moves away from the trait locus. Elston et al. (2000) adopted Drigalenko’s suggestion of using the mean-corrected trait product

P Yi = [(X i1 − µ)(X i2 − µ)] as the dependent variable (Drigalenko 1998) and further put it in the context of a generalized regression model to form the revisited Haseman- Elston method.

Amos and Elston (1989) extended the Haseman- Elston method to other types of relative pairs. Olson and Wijsman (1993) further developed it to consider combining pairs of different types in the same analysis using generalized estimating equations. They applied their method to dependent pairs (i.e., to multiple pairs drawn from larger pedigrees) using general estimating equations with a working covariance matrix that assumed independence among the pairs.

1.3.2 The Ascertainment Problem in Linkage Analysis

There are few papers on the subject of ascertainment bias in linkage studies

(Fisher 1935; Edwards 1971; Gershon and Matthysse 1977). However, care over potential bias seems always to have arisen in connection with schemes that allow for selection of individuals for study on the basis of both trait and marker locus status. When the criteria by which individuals are selected for study make reference to only the trait data, conditioning likelihoods on all the trait data will be sufficient to ensure unbiased parameter estimation, not only of the recombination fraction, but also of parameters of

26 the trait distribution (such as penetrances, gene frequencies, etc.), regardless of the actual ascertainment procedure (Risch 1984; Greenberg 1989; Clerget-Darpoux and

BonaïtiPellié 1992; Hodge and Elston 1994). In practice the likelihoods used in both linkage analysis and segregation analysis are always conditioned on observed pedigree structure, and any computer program based on the Elston-Stewart algorithm (1971) will implicitly perform this conditioning. Vieland and Hodge (1996) stated that when pedigrees are ascertained through affected individuals, then under a wide range of circumstances (nonsingle and proband-dependent ascertainment) conditioning the likelihood on observed pedigree structure will produce asymptotically biased estimators of the recombination fraction. Slager and Vieland (1997) presented preliminary work aimed at quantifying the numerical magnitude of the bias introduced by proband dependent (PD) sampling and nonsingle ascertainment in linkage analysis. They found that, in a limited initial set of simulations, asymptotic bias in the recombination fraction

(θ) appears to be trivial, while PD sampling procedures can increase the efficiency of estimating θ. These preliminary results support the view that the advantages of unsystematic ascertainment may offset any small estimation bias that may arise.

Ginsburg et al. (2004) showed that in a pure linkage analysis (i.e. only parameters related to linkage are estimated), if pedigree ascertainment through probands and the selective inclusion of collected pedigrees are independent of the markers, then no sampling correction is needed to obtain a consistent estimator of the recombination fraction.

Badner et al. (1998) explored optimal ascertainment strategies to detect linkage to common disease alleles using five model-free programs. They simulated three single- additive-locus models of inheritance and two two-locus models with additive or

27 multiplicative interactions. In all cases, only the last two generations were assumed to be

genotyped. For single-locus, additive, and multiplicative models, they found that, when

the susceptibility allele was rare (frequency 0.0025), extended pedigrees had more power to detect linkage. However, when the susceptibility allele was common, in the single- locus, additive, and multiplicative two-locus models (frequency 0.25), extended pedigrees were no more powerful than nuclear families. There was also a decrease in power when the pedigrees had a greater number of affected individuals, more so for the single-locus and multiplicative models than for the additive model. They concluded that for single-locus, additive, and multiplicative models of qualitative traits with common alleles, there is no benefit to the collection of extended pedigrees, and there may be a loss of power in the collection of pedigrees with many affected individuals. However, the conclusions might be different if genotype information were available for three generations and if model-based methods were used, since more phase information would increase the power to detect linkage.

Palmer et al. (2000) pointed out that under situations of ascertainment or when strong residual familial correlations are present, the new Haseman-Elston test as described in Elston et al. (2000) may in fact be less powerful as a test of linkage than the original Haseman-Elston test. Blangero et al. (2001) compared the sample sizes for the variance component method with random sampling and ascertained sampling. Correction for ascertainment was made by conditioning the likelihood for the sibship on the phenotype of the proband (Hopper and Mathews 1982; Boehnke and Lange 1984). The result showed that over a considerable range of QTL heritability, the sample size required under ascertainment can be three to four times smaller than would be required under a

28 simple random sampling design, and the size of the sibship has little effect on the relative

increase in power of an ascertained sample over a random sample. Although

ascertainment on a focal phenotype can markedly reduce the number of sibships required

to detect linkage, selective sampling introduces some potential disadvantages that should

not be ignored. First, the appropriate correction for sampling bias can be difficult or

impossible to implement, but is crucial if population parameters such as the QTL effect

size are to be estimated accurately (Comuzzie and Williams 1999). Sampling only

extremely affected individuals can also markedly increase recruitment costs. Finally,

although the ascertained sample may be efficient for linkage analysis of the focal

phenotype, it will not in general be equally efficient for other phenotypes. In fact, the

ascertained sample can be seriously underpowered even for linkage analysis of a trait that

is highly correlated with the focal phenotype.

There is clearly a need to go back to basics and to derive ascertainment-correction approaches that satisfy the basic paradigms of statistical inference from a sample to a

defined population and that can be seen as broadly applicable.

1.4 The Ascertainment Problem in Joint Segregation and Linkage Analysis

1.4.1 Joint Segregation and Linkage Analysis

Segregation and linkage analyses are used to evaluate the role of genes in the

expression of traits. Segregation analysis is used to determine the mode of inheritance

and to estimate trait model parameters such as allele frequency and genotype-specific

means. Model-based linkage analysis is used to estimate the recombination fraction

29 between the trait locus and one or more marker loci. Traditionally, the trait model

parameters in a model-based linkage analysis are fixed to specific values, obtained either

from a previous study on another data set or from an earlier segregation analysis on the current data set. Clerget-Darpoux et al. (1986) showed that misspecification of the trait model parameters in a linkage analysis leads to biased estimates of the recombination fraction and changes the lod score distribution, in some cases leading to reduced power to detect linkage. Misspecification is possible if parameter estimates are derived from previously published estimates using a different data set. Model-free linkage methods were developed to avoid the problem of a misspecified trait model (Haseman and Elston

1972; Weeks and Lange 1988), but these can be less powerful than the model-based method for detecting linkage (Kruglyak et al. 1996) and they do not provide an estimate of the recombination fraction free of other model parameters.

Estimating both the trait and linkage model parameters from the same data set should be an improvement over using previously published trait model estimates.

However, the analyst still has the option of conducting segregation analysis first followed by linkage analysis (separate analyses), or joint segregation and linkage analysis. In the usual progression of study, segregation analysis is first used to determine the mode of

inheritance at the trait locus and the corresponding penetrances and allele frequencies.

Linkage analysis is then used to localize the trait gene, fixing the trait model parameters to

their maximum-likelihood estimates from the segregation analysis. Several authors showed that increased power to detect linkage derives from joint estimation of the trait and linkage model parameters (Tiret et al. 1992; Martinez et al. 1995; Craig et al. 1996;

Gauderman et al.1997). Joint segregation and linkage analysis can be used from the start,

30 providing a unified and more powerful framework for estimating the effects of trait genes

and for finding their locations.

Risch (1984) suggested conditioning on affected individuals when simultaneously

estimating the recombination fractions between marker/disease loci and the parameters at

the disease locus. However, there appears to be no systematic method for deciding on the

data relevant for ascertainment.

Bonney et al. (1988) extended regressive models to include multivariate data and

an arbitrary number of linked marker loci to allow an effective combination of

segregation and linkage analysis to study multifactorial disease. The approach they have

adopted to simultaneously estimate recombination, parameters of the major gene

segregation, and residual correlations is likely to be important for studies of diseases in

which polygenic and environmental sources of familial correlation could be significant.

However, the regressive model has difficulty in modeling data when there are missing

data, and in allowing for ascertainment correction. Guo and Thompson (1992)

introduced a Monte Carlo approach to combined segregation and linkage analysis of a

quantitative trait observed in an extended pedigree. The greatest attraction of this

approach is its ability to handle complex genetic models and large pedigrees.

Blangero (1995), in his review of analyses of a single simulated data set from

Genetic Analysis Workshop 9 (GAW9), indicated that joint analysis is a powerful

strategy for localizing a gene, even when the underlying penetrance model is

misspecified. Asymptotically, joint analysis will produce more efficient estimates than

separate analyses.

31 Gauderman et al. (1997) investigated the quantitative trait Q2, which was simulated with effects due to a measured environmental factor, a diallelic major gene, and a polygenic component. Their results verified that joint segregation and linkage analysis produces less biased and more efficient estimates of the recombination fraction, and increases power for detecting linkage, compared to separate segregation analysis followed by linkage analysis. The degree of improvement depends on how tightly the trait and marker loci are linked, and whether the data set includes nuclear families or extended pedigrees. In extended pedigrees with complete marker data, joint analysis was

16% more efficient than separate analyses for estimating the recombination fraction between the trait locus and a tightly linked marker, and 6% more efficient for a loosely linked marker. In nuclear families with some missing marker data, joint analysis was 6% more efficient for a tightly linked marker, and no more efficient for a loosely linked marker. These relative efficiencies translated into modest, but consistent, gains in power to detect linkage using joint analysis.

Although their results were based on the analysis of a quantitative trait phenotype, joint segregation and linkage analysis should also be advantageous for a qualitative trait

(e.g., breast cancer, multiple sclerosis, etc.). If pedigrees are sampled at random from the population, as is often the case for a continuous phenotype, there is no need for an ascertainment correction to the likelihood. However, for disease outcomes, pedigrees are often chosen based on the status of one or more of their members (e.g., bilateral breast cancer prior to age 50), so requiring an ascertainment correction to make the resulting trait model parameter estimates generalizable to the population of interest. If primary interest lies in the recombination fraction, investigators may focus on pedigrees that are

32 heavily loaded with diseased individuals, making proper ascertainment correction

difficult. In this case, estimates of the trait model parameters from prior analyses

(probably from another data set) will be necessary. The potential biases that are

introduced using this strategy have to be considered in light of the increased information

for linkage that comes with heavily loaded families (Gauderman et al. 1997).

Gauderman and Faucett (1997) compared approaches for the analysis of gene- environment (G × E) interaction, using segregation and joint segregation and linkage analyses of a quantitative trait. Several simulation studies indicated that joint segregation and linkage analysis leads to less-biased and more-efficient estimates of a G × E interaction effect, compared with segregation analysis alone. Depending on the heterozygosity of the marker locus and its proximity to the trait locus, they found joint

analysis to be as much as 70% more efficient than segregation analysis for estimation of a

G × E interaction effect. Over a variety of parameter combinations, joint analysis also led to moderate (5% 10%) increases in power to detect interaction. On the basis of these

results, they suggested the use of combined segregation and linkage analysis for improved

estimation of G × E interaction effects when the underlying trait gene is unmeasured.

Their simulation findings were based on the analysis of a quantitative trait and, although it

is likely that the results are generalizable to disease and other qualitative outcomes, additional research is required. If pedigrees are sampled at random from the population, as is often the case when quantitative traits are of primary interest, both trait and linkage

model parameters can be jointly estimated without ascertainment correction. However,

for disease traits, families are typically sampled on the basis of the status of one or more of their members (probands), thus requiring an ascertainment correction if one is to obtain

33 consistent estimates of the disease allele-frequency and penetrance parameters. Zhao et

al. (1997) have proposed, for disease outcomes, a class of population-based study designs

that facilitate ascertainment correction in the context of joint segregation and linkage

analysis. In some cases, though, when linkage analysis is the primary goal of the study, heavily disease-loaded families are collected, making ascertainment correction impossible

and thus precluding the use of joint analysis.

Martinez et al. (2001) compared two joint likelihood approaches, with complete

(L1) or without (L2) linkage disequilibrium, under different ascertainment schemes, for

the genetic analysis of the disease trait and marker gene 1. Joint likelihoods were

computed without a correction for the selection scheme. For the different sampling

schemes they explored, their results suggested that L1 is a more powerful approach than

L2 to detect major gene and covariate effects as well as to identify accurately gene ×

covariate interaction effects in a common and complex disease such as the GAW 12 MG6

simulated trait.

Joint analysis is computationally more demanding than separate analysis,

especially for extended pedigrees with missing marker data. For example, in the analysis

of these simulated nuclear and extended pedigree data sets, joint analysis required

approximately six times more computation time than separate analysis. The current

availability of fast computers should reduce the importance of computation time in

deciding which type of analysis to perform, allowing the analyst to put more emphasis on

issues of statistical efficiency when deciding on an analytic plan. However, if several

markers will be analyzed, it may be more feasible to use separate analyses on all markers,

followed by joint analysis in promising regions. In the event that computation of the

34 likelihood is infeasible due to model or pedigree complexity (e.g., analysis of inbred pedigrees), Monte Carlo techniques for joint segregation and linkage analysis can be utilized (Thomas and Cortessis 1992; Guo and Thompson 1992; Faucett et al. 1993;

Gauderman et al. 1995).

1.4.2 The Ascertainment Problem in Joint Segregation and Linkage Analysis

Although several researchers have emphasized the existence of biased estimation in joint segregation and linkage analyses, no ascertainment corrections have been conducted (Bonney et al. 1988; Guo and Thompson 1992; Gauderman et al. 1997;

Martinez et al. 2001).

1.5 Optimum Study Design and Cost Effectiveness

The optimization of genetic study designs is critical in order to detect and locate genes for complex diseases. Power and precision, as two important standards of a study design, are implicitly related to each other by the principle of uncertainty. Overemphasis on either one may result in suboptimal designs. Practical measures such as cost effectiveness must be used to balance power and precision in order to obtain optimality for an actual design. Many factors affect the optimality of a design. They include sampling units and sampling procedure, phenotyping and genotyping (e.g., definition and refinement of phenotypes, quality and density of markers), and the analytic scheme

(segregation and linkage analysis, genomic scanning, etc). The above process determines

35 the information content of a study design, which in turn regulates the maximum power a

study can achieve and the best precision the mapping can achieve.

A main aim of a study design is to find a pivot point by which uncertainty

between power and precision can be balanced to reach optimality. Although optimality

can be measured in several ways, such as relative statistical efficiency, cost effectiveness

and statistical power, we will focus on the cost effectiveness of a design as a common

scale for segregation and linkage analysis. Elston (1992) proposed a two-stage strategy as a cost-effective way of designing genomic scans: a relatively sparse marker map is used in the first stage to detect linkage signals, followed by a second stage with a denser marker map around the signals detected in the first stage. Later Elston et al. (1996)

showed that a two-stage procedure could be conducted for half the cost of a one-stage

procedure. Elston’s work on the cost consideration of a two-stage design for a global

genome search using a model-free method of analysis inspired many investigations of the

cost effectiveness of extreme sib-pair methods (Gu et al. 1996; Gu and Rao 1997; Zhao et

al. 1997; Liang et al. 2000). Only simple and unsystemic results regarding the cost of

model-based linkage analysis have been reported so far (Ginsburg and Axenovich 1997;

Goldgar and Easton 1997).

1.6 Weighted Distributions and Two-Phase (Sampling) Designs

The concept of weighted distributions can be traced to the study of the effects of

methods of ascertainment upon the estimation of parameters by Fisher in 1934, and it was

formulated in general terms by Rao (1965).

36 1.6.1 Weighted Distributions

The role of statistical methodology is to extract the relevant information from a given sample to answer specific questions about the parent population. For this purpose, it is necessary to identify all possible samples that can be observed from a population

(sample space) and to provide a stochastic model for attaching probabilities to different sets of samples. There does not always exist a suitable sampling frame for observing events and applying classical sampling theory. In practice, it is not always possible to observe and record all events that might occur. For example, an event may be observable only with a certain probability depending on the characteristics of the event, such as its conspicuousness and the procedure employed to observe it (unequal probability sampling).

In a classical paper, Fisher (1934) demonstrated the need for an adjustment in specification depending on the way data are ascertained. In extending the basic ideas of

Fisher, Rao (1965) introduced the concept of a weighted distribution as a method of ascertainment adjustment applicable to many situations. Let X be a random variable (rv) with p(x, φ) as the probability density function (pdf), where φ denotes unknown parameters. Suppose that when X = x occurs, the probability density of recording it is w(x, α), depending on the observed value x and possibly also on an unknown parameter α.

Then the pdf of the recorded rv X w is

wx(,α )p(x,ϕ ) Pxw (,ϕα, )= , Ew[(X,α )] (1.10)

37 where E[w(X ,α)] = ∑ w(x,α) p(x,α) . Although w(x,α) is chosen such that 0 ≤ w(x, α)

≤ 1, we can define the above formula for any arbitrary nonnegative weight function w(x, α) for which E[w(X, α)] exists. Weighted distributions have become a useful tool in the selection of appropriate models for observed data drawn without a proper sampling frame. In many situations the model given above is appropriate, and the statistical problems that arise are the determination of a suitable weight function, w(x, α), and

drawing inferences onϕ (Patil 1997).

1.6.2 Two-Phase (Sampling) Designs

Here, we give the definition of two-phase (sampling) designs that we shall use in

this dissertation. We suppose the first phase of the design yields through sampling via

probands a list of families and the number of affected and unaffected members in each

family. We envision two strata in the total population: one stratum of families which

reportedly have one affected member (simplex); and another of families which reportedly

have two or more affected members (multiplex). The numbers of families in the sample

with reportedly one affected member and two or more affected members are Ns(s) and

Nm(s), respectively; s in parentheses denotes the number of offspring in a family. In the second phase of the design, a random sample is taken from the stratum of families which

reportedly have one affected member with sampling fraction f1 , and all the families with

two or more affected members are used from the other stratum. Let Li(ϕ; Di) be the

likelihood of the parameter (vector) ϕ for the data Di on the i-th family. Using the

general weighted distribution (Rao 1965), the appropriate likelihood for this dataset is

38

f1NsmN ∏∏f1Li ()ϕ; Di L j ()ϕ; D j i j , (1.11) ∏ E(w)

where in the numerator the first product is over all the simplex families retained in the sample and the second is over all the multiplex families, the product in the denominator is

over all f1 N s + N m families, and Ew( ) is the expected value of a weight function that is

f1 for simplex families and 1 for multiplex families. Thus

Ew()= f1 P (family is simplex) + P (family is multiplex). (1.12)

We propose estimating these two probabilities from the sample itself, using the fact that the products being over the simplex and multiplex families, respectively. Thus the probabilities are:

1 f N ⎡ 1 s ⎤ f1Ns P (family is simplex) ∝ ⎢∏ Li (ϕ; Di )⎥ and ⎣ i ⎦

1 N ⎡ m ⎤ Nm P (family is multiplex) ∝ ⎢∏ L j (ϕ; D j )⎥ , ⎣ j ⎦

Using these expressions, we can formulate Ew( ) in the denominator of the appropriate likelihood above. By using MLE, we can obtain an asymptotically unbiased estimate ofϕ . With this substitution we obtain the complete likelihood of two-phase sampling designs as follows

39 f1NsmN ∏∏f1Li ()ϕ; Di L j ()ϕ; D j i j ⎧ 1 1 ⎫ f N Nm N ⎪ ⎡ 1 s ⎤ f1N s ⎡ ⎤ m ⎪ L ϕ; D L ϕ; D ⎪ ⎢∏ i ()i ⎥ ⎢∏ j ()j ⎥ ⎪ ⎪ i j ⎪ f ⎣ ⎦ + ⎣ ⎦ ∏⎨ 1 1 1 1 1 ⎬ ⎪ f1Ns Nm N f1Ns Nm N ⎪ ⎡ ⎤ f1Ns ⎡ ⎤ m ⎡ ⎤ f1Ns ⎡ ⎤ m ⎪ ⎪ ⎢∏ Li ()ϕ; Di ⎥ + ⎢∏ L j ()ϕ; D j ⎥ ⎢∏ Li ()ϕ; Di ⎥ + ⎢∏ L j ()ϕ; D j ⎥ ⎩⎪ ⎣ i ⎦ ⎣ j ⎦ ⎣ i ⎦ ⎣ j ⎦ ⎭⎪

(1.13)

The multiplex family ascertainment scheme that results in families heavily loaded with affected members, used for many linkage analyses to increase linkage information, must be allowed for; but such a sampling scheme containing no simplex families usually results in little useful information for estimating segregation analysis parameters such as allele frequencies and penetrances (Xu et al. 1998). Especially for rarer traits, the information that results from multiplex families alone is not enough to obtain good parameters estimates for some populations; and collecting more multiplex families for segregation analysis is very costly. Aitken et al. (1998) stated that their sample was weighted by self-reported positive history families in ratio of 5:1, negative history families were weighted by a factor of five in segregation analysis in order to estimate correctly the population gene frequency. This weighted likelihood without any theoretical basis is wrong and may cause wrong parameter estimates. However, our two- phase designs with an appropriate likelihood (1.13) for segregation analysis can solve these problems.

Current linkage analysis models, for both pure linkage analysis and simultaneous joint segregation and linkage analysis have some limitations. On the one hand, although pure linkage analysis using assumed values of genetic parameters can obviate some

40 theoretical problems, it can not solve the practical problems because the assumed parameter values may be wrong and hence lead to wrong estimates of the recombination fraction and the corresponding lod score. On the other hand, genotyping the simplex families to perform simultaneous joint segregation and linkage analysis will increase the cost because of linkage analysis. However, our two-phase designs for linkage analysis can avoid these problems. When only multiplex families enter the sample, we will carry out the linkage analysis using the estimated genetic parameters from the likelihood

(1.13).

For this general procedure, only a simple example has been examined so far.

Keen and Elston (2001) examined the effects of different two-phase sampling designs on estimation in the classical segregation model. They found by Monte Carlo simulations that the approximation described above for a two-phase design produces results close to those of the exact likelihood function in the case of the classical segregation likelihood model. This has implications for more complex models in which the computation of the exact likelihood is prohibitive, such as for the enhancement of a typical survey sampling plan designed initially for linkage analysis but then used retroactively for a combined segregation and linkage analysis. They suggested that it is difficult to compute the exact likelihood for data from multiplex pedigrees alone, in order to estimate parameters of a segregation and linkage model by joint analysis, except in the case that the sampling plan is one of single ascertainment. However, if a random sample from the stratum of simplex pedigrees is available, a joint segregation and linkage analysis can be carried out on the whole sample using the above approximate formula for a weighted likelihood function.

This may yield better consistent estimates of all the parameters, of both segregation and

41 linkage, than would be feasible without any information from the simplex pedigrees. The appropriate design of studies to be informative for linkage is different from that for segregation analysis. It is not easy to combine linkage analysis with segregation analysis efficiently to obtain unbiased estimates of parameters. Therefore, many problems in this area need to be investigated further.

1.7 Statement of Problem

This dissertation will present our studies of ascertainment in two-phase

(sampling) designs for segregation analysis and linkage analysis in various genetic models. Chapter II introduces a two-phase design likelihood for segregation analysis of nuclear families. In this respect, we first derive the likelihood for simple segregation analysis in two-phase designs; we then extend it to complex segregation analysis.

Chapter III reveals the effects of two-phase designs for segregation analysis of general pedigrees. Simulations are used to determine whether use of the newly developed two-phase likelihood provides consistent estimates of the segregation analysis parameters. Furthermore, the roles of both simplex and complex families are evaluated under various genetic models.

Chapter IV exhibits an investigation of the ascertainment of two-phase designs for linkage analysis in general pedigrees. We create two-phase designs yielding consistent estimates of the recombination fractions for linkage analysis; we also investigate the influence of different two-phase sampling designs on the lod score under various genetic models.

42 Chapter V shows our study of the cost effectiveness of linkage analysis in two- phase designs. Based on the lod scores deduced in Chapter IV, we first obtain the sample size necessary for reaching a significant level of linkage. Subsequently, we calculate the total cost of family phenotyping and genotyping for various sampling plans. Finally, we optimize the designs for linkage analysis.

Chapter VI gives a summary of our results as well as related discussion. We also make suggestions for further study in this Chapter.

43 CHAPTER II

TWO-PHASE DESIGN LIKELIHOODS FOR SEGREGATION ANALYSIS OF

NUCLEAR FAMILIES

2.1 The Model and Its Assumptions

In order to simplify the discussion we confine ourselves to data in which there are two types of individuals, unaffected and affected. The probability for an offspring to be affected given mating type is p (the segregation ratio). The number r of affected individuals in a sibship of size s is binomially distributed. Then the assumptions for the incomplete, multiple ascertainment model are the following. For each nuclear family:

1. Ascertainments take place through affected individuals.

2. Conditional on being affected, the ascertainments of the different probands take

place independently.

3. Given the phenotype, the ascertainment probability π is the same for all

affected children, and 0 for all unaffected children.

From these assumptions, the probability that a sibship with s children is ascertained from the sampling frame through c different probands and has r affected children is

s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c f ()p,π;c,r | s = ⎜ ⎟ p (1− p)⎜ ⎟π (1− π ), (2.1) ⎝ r ⎠ ⎝ c ⎠ where 0 ≤ c ≤ r ≤ s.

44 2.2 Simple Segregation Analysis

Given a phenotypic mating consistent with a single genotypic mating, the segregation distribution includes only one segregation parameter, p, that needs to be estimated. This is called simple segregation analysis. We need the conditional probability given that the distributions are truncated, since c ≥1, and equation (2.1) becomes

s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ⎜ ⎟ p ()1− p ⎜ ⎟π (1− π ) g()p,π;c,r | s = ⎝ r ⎠ ⎝ c ⎠ s r s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ∑∑⎜ ⎟ p ()1− p ⎜ ⎟π (1− π ) r==11c ⎝ r ⎠ ⎝ c ⎠

s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ⎜ ⎟ p ()1− p ⎜ ⎟π (1− π ) = ⎝ r ⎠ ⎝ c ⎠ , (2.2) 1− ()1− pπ s

where 1 ≤ c ≤ r ≤ s.

It sometimes happens that there is no record of the number of probands (c) in each family. A “reduced” model has been applied for this situation. This model has the likelihood

s s ⎛ ⎞ r s−r r ⎛ ⎞ r s−r r ⎜ ⎟ p ()1− p [1− (1− π )] ⎜ ⎟ p ()1− p [1− (1− π )] h()p,π;r | s = ⎝ r ⎠ = ⎝ r ⎠ , (2.3) s s s ⎛ ⎞ r s−r r 1− ()1− pπ ∑⎜ ⎟ p ()1− p []1− (1− π ) r=1 ⎝ r ⎠

45 where 1 ≤ r ≤ s. The omission of the number of probands represents a heavy loss of information as regards the parameter π. This loss of information has serious effects for estimation of the parameters p and π.

Here, we derive the likelihood functions for two-phase designs based on (2.2) for simple segregation analysis. The likelihood for a sibship in a random sample drawn from the stratum with exactly one affected member per sibship (simplex) is

s ⎛ ⎞ 1 s−1 ⎜ ⎟ p ()1− p π s−1 1 sπp(1− p) L ()p,π;c,r | s = ⎝ ⎠ = , (2.4) 1 s r s r s ⎛ ⎞ r s−r ⎛ ⎞ c r−c 1− (1− pπ ) ∑∑⎜ ⎟ p ()1− p ⎜ ⎟π (1− π ) r==11c ⎝ r ⎠ ⎝ c ⎠

where 1 = c = r ≤ s.

The likelihood for a sibship in a random sample drawn from the stratum with two or more affected members per sibship (multiplex) is

s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ⎜ ⎟ p ()1− p ⎜ ⎟π (1− π ) L ()p,π;c,r | s = ⎝ r ⎠ ⎝ c ⎠ 2 s r s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ∑∑⎜ ⎟ p ()1− p ⎜ ⎟π (1− π ) r==21c ⎝ r ⎠ ⎝ c ⎠

s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ⎜ ⎟ p ()1− p ⎜ ⎟π (1− π ) = ⎝ r ⎠ ⎝ c ⎠ , (2.5) 1− ()1− pπ s − sπp(1− p)s−1

where 1 ≤ c ≤ r and 2 ≤ r ≤ s.

46 The numerator of the likelihood for a two-phase design is

f1N s f2 Nm ∏ f1L1 ()p,π ;c,r | s ∏ f 2 L2 (p,π ;c, r | s). (2.6) i j

The denominator of the likelihood for a two-phase design is

∏ E()w = ∏[ f1 P(family is simplex) + f2 P(family is multiplex)]

s s r s r ⎛ ⎞ s−1 ⎛ ⎞ r s−r ⎛ ⎞ c r−c = ∏[ f1 ⎜ ⎟ p()1− p π + f2 ∑∑⎜ ⎟p ()1− p ⎜ ⎟π (1− π )] ⎝ 1 ⎠ r==21c ⎝ r ⎠ ⎝ c ⎠

s−1 s s−1 =∏[ f1sπp(1− p) + f 2 (1− (1− pπ ) − sπp(1− p) )]. (2.7)

If we do not know the number of probands, we use the reduced model likelihood

(2.3) to form the likelihood for two-phase designs. The likelihood for simplex families is

s ⎛ ⎞ 1 s−1 1 ⎜ ⎟ p ()1− p []1− (1− π ) s−1 1 sπp(1− p) L' = ⎝ ⎠ = , (2.8) 1 s s s ⎛ ⎞ r s−r r 1− (1− pπ ) ∑⎜ ⎟ p ()1− p []1− (1− π ) r=1 ⎝ r ⎠

where 1 = r ≤ s.

47 The likelihood for multiplex families is

s s ⎛ ⎞ r s−r r ⎛ ⎞ r s−r r ⎜ ⎟ p ()1− p []1− (1− π ) ⎜ ⎟ p ()1− p [1− (1− π )] L' = ⎝ r ⎠ = ⎝ r ⎠ , (2.9) 2 s s s s−1 ⎛ ⎞ r s−r r 1− ()1− pπ − sπp(1− p) ∑⎜ ⎟ p ()1− p []1− (1− π ) r=2 ⎝ r ⎠

where 2 ≤ r ≤ s.

The numerator of the likelihood for a two-phase design is then

f1N s f2 Nm ' ' ∏ f1L1 ()p,π ,r | s ∏ f 2 L2 (p,π ,r | s), (2.10) i j

and the denominator is

∏ E()w = ∏[ f1 P(family is simplex) + f2 P(family is multiplex)

s s s ⎛ ⎞ s−1 1 ⎛ ⎞ r s−r r = ∏[ f1 ⎜ ⎟ p()1− p [1− (1− π ) ] + f2 ∑⎜ ⎟p ()1− p [1− (1− π )] ] ⎝ 1 ⎠ r=2 ⎝ r ⎠

s−1 s s−1 =∏[ f1sπp(1− p) + f 2 (1− (1− pπ ) − sπp(1− p) )]. (2.11)

48 2.3 Complex Segregation Analysis

Methods of segregation analysis that take into account different genotypic mating types within a given phenotypic mating type are called complex segregation analysis.

Suppose that for a given model there are m distinct segregation patterns. Let φt be the expected proportion of families in the population with the t-th segregation pattern. We have

m ∑φt = 1. (2.12) t=1

Let π be the ascertainment probability. Assuming that π is constant and the

same for each genotype mating, that φt is the probability that a family of size s has the t-th segregation pattern, for complex segregation analysis the conditional probability k, given the distributions are truncated since c ≥ 1, is, based on (2.2):

m s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ∑φt ⎜ ⎟ pt ()1− pt ⎜ ⎟π (1− π ) ⎝ r ⎠ ⎝ c ⎠ k()p,π;c,r,t | s = t=1 , (2.13) m s r s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ∑φt ∑∑⎜ ⎟ pt ()1− pt ⎜ ⎟π (1− π ) t=1 r==11c ⎝ r ⎠ ⎝ c ⎠

where 1 ≤ t ≤ m and 1≤ c ≤ r ≤s.

49 If we do not know the number of probands, based on the reduced model likelihood (2.3) we obtain the conditional probability given that the distributions are truncated since r ≥ 1, as

m s ⎛ ⎞ r s−r r ∑φt ⎜ ⎟ pt ()1− pt [1− (1− π )] ⎝ r ⎠ l()p,π;r,t | s = t m s s ⎛ ⎞ r s−r r ∑φt ∑⎜ ⎟ pt ()1− pt []1− (1− π ) t r=1 ⎝ r ⎠

s m ⎛ ⎞ r r s−r ⎜ ⎟[1− (1− π ) ]∑φt pt ()1− pt ⎝ r ⎠ t=1 = m , (2.14) s 1− ∑φt ()1− ptπ t=1

where 1 ≤ t ≤ m and 1 ≤ r ≤ s.

Here, we derive the likelihood for a two-phase design in the case of complex segregation analysis. Based on (2.13), we obtain the likelihood for simplex families as follows

m s ⎛ ⎞ r s−1 ∑φt ⎜ ⎟ pt ()1− pt π '' ⎝ 1 ⎠ L ( p,π;c,r,t | s) = t=1 1 m s r s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ∑φt ∑∑⎜ ⎟ pt ()1− pt ⎜ ⎟π (1− π ) t=1 r==11c ⎝ r ⎠ ⎝ c ⎠

m s−1 sπ ∑φt pt ()1− pt t=1 = m , (2.15) s 1− ∑φt ()1− ptπ t=1

where 1 ≤ t ≤ m and 1 = c = r ≤ s.

50 The likelihood for multiplex families is

m s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ∑φt ⎜ ⎟ pt ()1− pt ⎜ ⎟π (1− π ) '' ⎝ r ⎠ ⎝ c ⎠ L ( p,π;c,r,t | s) = t=1 2 m s r s r ⎛ ⎞ r s−r ⎛ ⎞ c r−c ∑φt ∑∑⎜ ⎟ pt ()1− pt ⎜ ⎟π (1− π ) t=1 r==21c ⎝ r ⎠ ⎝ c ⎠

s r m ⎛ ⎞⎛ ⎞ c r−c r s−r ⎜ ⎟⎜ ⎟π (1− π ) ∑φt pt ()1− pt ⎝ r ⎠⎝ c ⎠ t=1 = m , (2.16) s s−1 1− ∑φt [()1− ptπ + sπp t (1− pt ) ] t=1

where 1 ≤ t ≤ m, 1 ≤ c ≤ r and 2 ≤ r ≤ s.

The numerator of the likelihood for a two-phase design is then

f1N s f2 Nm '' '' ∏ f1L1 ()p,π;c,r,t | s ∏ f 2 L2 (p,π;c,r,t | s), (2.17) i j

and the denominator of the likelihood for a two-phase design is

∏ E()w = ∏[ f1 P(family is simplex) + f2 P(family is multiplex)]

m s m s r s r ⎛ ⎞ 1 s−1 ⎛ ⎞ r s−r ⎛ ⎞ c r−c = ∏[ f1 ∑φt ⎜ ⎟ pt (1− pt )π + f2 ∑φt ∑∑⎜ ⎟pt ()1− pt ⎜ ⎟π (1− π )] t=1 ⎝ 1 ⎠ t=1 r==21c ⎝ r ⎠ ⎝ c ⎠

m m ⎡ s−1 s s−1 ⎤ =∏ ⎢ f1sπ ∑φt pt ()1− pt + f 2 (1− ∑φt [(1− ptπ ) + sπpt (1− pt ) ])⎥ . (2.18) ⎣ t=1 t=1 ⎦

51 If the number of probands is not available, we derive the likelihood functions based on the reduced model likelihood (2.14).

The likelihood for simplex families is

s m m ⎛ ⎞ 1 1 s−1 s−1 ⎜ ⎟[1− (1− π ) ]∑φt pt ()1− pt sπ ∑φt pt ()1− pt ''' ⎝ 1 ⎠ t=1 t=1 L1 ( p,π;r,t | s) = m = m , (2.19) s s 1− ∑φt ()1− ptπ 1− ∑φt ()1− ptπ t=1 t=1

where 1 ≤ t ≤ m and 1= r ≤ s.

The likelihood for multiplex families is

m s ⎛ ⎞ r s−r r ∑φt ⎜ ⎟ pt ()1− pt [1− (1− π )] ''' ⎝ r ⎠ L ( p,π;r,t | s) = t=1 2 m s s ⎛ ⎞ r s−r r ∑φt ∑⎜ ⎟ pt ()1− pt []1− (1− π ) t=1 r=2 ⎝ r ⎠

s m ⎛ ⎞ r r s−r ⎜ ⎟[1− (1− π ) ]∑φt pt ()1− pt ⎝ r ⎠ t=1 = m , (2.20) s s−1 1− ∑φt [()1− ptπ + sπp t (1− pt ) ] t=1

where 1 ≤ t ≤ m and 2 ≤ r ≤ s.

Then the numerator of the likelihood for a two-phase design is

f1N s f2 Nm ''' ''' ∏ f1L1 ()p,π ;r,t | s ∏ f 2 L2 (p,π;r,t | s), (2.21) i j

52 and the denominator of the likelihood for a two-phase design is

∏ E()w = ∏[ f1 P(family is simplex) + f2 P(family is multiplex)]

m s m s s ⎛ ⎞ 1 s−1 1 ⎛ ⎞ r s−r r = ∏[ f1 ∑φt ⎜ ⎟ pt ()1− pt [1− (1− π ) ] + f2 ∑φt ∑⎜ ⎟pt ()1− pt [1− (1− π )]] t=1 ⎝ 1 ⎠ t=1 r=2 ⎝ r ⎠

m m ⎡ s−1 s s−1 ⎤ =∏ ⎢ f1sπ ∑φt pt ()1− pt + f 2 (1− ∑φt [(1− ptπ ) + sπpt (1− pt ) ]⎥ . (2.22) ⎣ t=1 t=1 ⎦

Based on the likelihood for a two-phase design, by MLE we can obtain consistent estimates of the segregation ratio (p) and its variance from a set of collected family data.

2.3.1 Estimation of Allele Frequencies from Segregation Analysis for a Recessive

Model with Incomplete Heterozygote Penetrance

When selecting families, we usually observe the phenotypes of the parents. We can distinguish three phenotypic mating types: (a) Affected × Affected, (b) Unaffected ×

Affected and (c) Unaffected × Unaffected. Under a recessive model, we assume the penetrances P(affected|AA) = 0, P(affected| Aa) = 0 and P(affected| aa) = g. We estimate the incomplete penetrance (g) and allele frequence (q) in a complex segregation analysis.

Table 2.1 shows segregation models for recessive inheritance with incomplete penetrance. For the parental phenotypic mating unaffected × affected,

(1− q)(1− g) 2q φ = , φ = , p = g and p = 1 g . We substitute 1 2q + (1− q)(1− g) 2 2q + (1− q)(1− g) 1 2 2 these expressions into formulas (2.21) and (2.22), and obtain the estimates of the allele

53 Table 2.1 Segregation Models for Recessive Inheritance with Incomplete Heterozygote Penetrance

Expected Proportions of Mating Types Parental Producing Affected Children Segregation Segregation Mating Distribution Ratio ( pt ) In a Given Parental In the Population Mating Type (φt )

Affected × Unaffected

4 s 2 (1− q) g(1− g) ⎛ ⎞ r s−r aa ×aa (1− q)(1− g) ⎜ ⎟g (1− g) g ⎝ r ⎠ 2q + (1− q)(1− g)

s 3 2q 1 ⎛ ⎞ 1 r 1 s−r aa × Aa 4 q(1− q) g 2 g ⎜ ⎟( 2 g) (1− 2 g) 2q + (1− q)(1− g) ⎝ r ⎠

Total 2(1−q)3g[2q+(1−q)(1−g)] 1

Unaffected × Unaffected

2 2 s (1−q) (1− g) ⎛ ⎞ r s−r 4 2 g (1− g) aa × aa (1− q) (1− g) 2 g ⎜ ⎟ [2q + (1− q)(1− g)] ⎝ r ⎠

s 4q(1− q)(1− g) ⎛ ⎞ r s−r 3 1 g ( 1 g) (1− 1 g) aa × Aa 4q(1− q) (1− g) 2 2 ⎜ ⎟ 2 2 [2q + (1− q)(1− g)] ⎝ r ⎠

2 s 4q 1 ⎛ ⎞ 1 r 1 s−r 4 g ⎜ ⎟( g) (1− g) Aa × Aa 2 2 2 4 4 4q (1− q) [2q + (1− q)(1− g)] ⎝ r ⎠

Total (1−q)2[2q+(1−q)(1−g)]2 1

54 frequency (q) and penetrance (g). For the parental phenotypic mating unaffected ×

(1− q) 2 (1− g) 2 4q(1− q)(1− g) 4q 2 unaffected, φ1 = , φ2 = , φ3 = , [2q + (1− q)(1− g)]2 [2q + (1− q)(1− g)]2 [2q + (1 − q)(1− g)]2

1 1 p1 = g, p2 = 2 g , and p3 = 4 g . Placing these into formulas (2.21) and (2.22), we obtain consistent estimates of q and g.

2.3.2 Estimation of Allele Frequencies from Segregation Analysis for a Dominant

Model with Incomplete Heterozygote Penetrance

Under a dominant model, we assume the penetrances P(affected|AA) = 1,

P(affected| Aa) = f and P(affected| aa) = 0. We estimate the incomplete penetrance (f) and allele frequency (q) in complex segregation analysis.

Table 2.2 shows segregation models for dominant inheritance with incomplete penetrance. In the parental phenotypic mating unaffected × affected,

4q 3 (1− q)(1− f ) 2q 2 (1− q) 2 4q 2 (1− q) 2 f (1− f ) 4q(1− q)3 f φ1 = ,φ2 = , φ3 = , φ4 = , T1 T1 T1 T1

1 1 1 p1 = 2 (1+ f ) , p2 = f , p3 = 4 (1+ 2 f ) , and p4 = 2 f , where

3 2 2 2 2 3 T 1 = 4q (1− q)(1− f ) + 2q (1− q) + 4q (1− q) f (1− f ) + 4q(1− q) f . We substitute these into formulas (2.21) and (2.22), and obtain the estimates of the allele frequency (q) and penetrance (f). For the parental phenotypic mating unaffected × unaffected,

2 2 2 3 4q (1− q) (1− f ) 4q(1− q) (1− f ) 1 1 φ1 = , φ2 = , p1 = 4 (1+ 2 f ) and p2 = 2 f , where T2 T2

2 2 2 3 T2 = 4q (1− q) (1− f ) + 4q(1− q) (1− f ) . Placing these into formulas (2.21) and (2.22), we obtain consistent estimates of q and f.

55 Table 2.2 Segregation Models for Dominant Inheritance with Incomplete Heterozygote

Penetrance

Expected Proportions of Mating Parental Types Producing Affected Children Segregation Segregation Distribution Mating Ratio (pt ) In a Given Parental In the Population Mating Types

(φt )

Affected × Unaffected

3 s 4q (1−q)(1− f) ⎛ ⎞ 1 r 1 s−r 3 1 AA × Aa 4q (1 − q)(1 − f ) (1+ f ) ⎜ ⎟[ 2 (1+ f )] [1− 2 (1+ f )] T 2 ⎝ r ⎠ 1 s 2 2 ⎛ ⎞ r s−r 2q (1−q) f AA × aa 2 2 ⎜ ⎟ f (1− f ) 2 q (1 − q ) ⎝ r ⎠ T1

2 2 4q (1−q) f(1− f) 1 s Aa × Aa (1+ 2 f ) ⎛ ⎞ 1 r 1 s−r 2 2 4 [ (1+ 2 f )] [1− (1+ 2 f )] 4q (1−q) f (1− f ) T ⎜ ⎟ 4 4 1 ⎝ r ⎠ 3 3 4q(1− q) f s Aa aa ⎛ ⎞ 1 r 1 s−r × 4q(1−q) f 1 ⎜ ⎟( f ) (1− f ) 2 f 2 2 T1 ⎝ r ⎠

Total T1 1

Unaffected × Unaffected

Aa × Aa 2 2 2 2 2 2 1 s 4q (1−q) (1− f ) 4q (1−q) (1− f) 4 (1+ 2 f ) ⎛ ⎞ 1 r 1 s−r ⎜ ⎟[4 (1+ 2 f )] [1− 4 (1+ 2 f )] r T2 ⎝ ⎠ 3 s ⎛ ⎞ r s−r Aa × aa 4q(1− q) (1− f ) ⎜ ⎟( 1 f ) (1− 1 f ) 3 1 2 2 4q(1−q) (1−f) ⎝ r ⎠ T2 2 f

Total T2 1

3 2 2 2 2 3 T 1 = 4q (1− q)(1− f ) + 2q (1− q) + 4q (1− q) f (1− f ) + 4q(1− q) f 2 2 2 3 T2 = 4q (1− q) (1− f ) + 4q(1− q) (1− f )

56 2.4 Summary

In this chapter, we introduced the likelihoods for two-phase sampling designs according to their definition in chapter I for segregation analysis of sibships and nuclear families. First, we derived the likelihoods for simplex segregation analysis with known probands and unknown probands, respectively. Then, we extended this simple classical model to complex segregation analysis in two-phase designs. A method of parameter estimation for segregation analysis was proposed. We build a good basis for segregation analysis of general pedigrees in two-phase designs in the next two chapters.

57 CHAPTER III

TWO-PHASE SAMPLING DESIGNS FOR SEGREGATION ANALYSIS IN

PEDIGREES

3.1 Introduction

In Chapter II, we derived how to estimate the parameters of segregation analysis in two-phase designs for nuclear families. Since large pedigrees contain more information about the genetic parameters of interest than small nuclear families (Go et al.

1978), it is important to have the corresponding designs available for large pedigrees.

The general method of pedigree analysis introduced by Elston and Stewart (1971) has been applied to the study of a variety of situations (Elston and Yelverton 1975; Lange and Elston 1975), allowing the analyses of multigenerational pedigree data. Elston and

Sobel (1979) established the theory of the proband sampling frame (PSF) and built the basis for ascertainment sampling correction for extended pedigrees. Based on the theory of the PSF, Ginsburg et al. (2003) showed that the ascertainment problem in pedigree analysis is tractable. Also, sequential sampling provides a flexible design for segregation and linkage analysis (Boehnke et al. 1988; Kramer et al. 1989). Are there other better sampling designs?

In this chapter, in order to derive appropriate sampling designs yielding consistent estimates, we simulate samples consisting of three-generation pedigrees and conduct segregation analyses of these pedigrees in two-phase designs. We attempt to achieve two

58 objectives. The first is to assess the effects on parameter estimates of different two-phase sampling designs for segregation analysis; the second is to determine the appropriate proportions of simplex pedigree and multiplex families to obtain better consistent estimates of the parameters of the trait locus under different genetic models. At the same time we wish to determine how quickly (in terms of sample size) the likelihoods developed for extended pedigrees in Chapter II yield good estimates.

3.2 Simulation Procedure

We simulated data for each family from a single pair of founders. Each simulation replicate was started by using a random-number generator to assign genotypes to all founders (i.e., those individuals with only descendents), applying allele frequencies and assuming Hardy-Weinberg proportions as input to the program. The assignment of the genotypes of all other individuals in the pedigree was according to Mendelian segregation of the alleles from the founders through subsequent generations, by again using a random-number generator. The assignment of the phenotypes was based on the model penetrance functions, which are defined as the probability of an individual of a given genotype exhibiting the phenotype. The genotypes were assigned to founders for a disease trait locus. The disease is dichotomous (affected /unaffected). Each sample is composed of three-generation families with a fixed family size = 11. The pedigree structure is shown in Figure 3.1.

We simulated pedigrees until we had 50 or 100 multiplex families for each situation, with the average results seen in Table 3.2.

59

Figure 3.1 The Simulated Pedigree Structure

Table 3.1 Genetic Parameters for Simulation of Pedigrees

Model q fAA fAa faa

Dominant Model:

Fully Penetrant Model 0.1 1 1 0

Incomplete Penetrance Model I 0.1 0.8 0.8 0

Incomplete Penetrance Model II 0.1 0.5 0.5 0

Recessive Model:

Fully Penetrant Model 0.2 1 0 0

Incomplete Penetrance Model I 0.2 0.8 0 0

Incomplete Penetrance Model II 0.2 0.5 0 0 Note: q = Allele Frequency; f = Penetrance.

60 Table 3.2 Sample Pedigrees Simulated in Order to Produce Nm=50 and 100, Respectively

Model q f N (Nm, Ns, No ) N (Nm, Ns, No)

Dominant Model:

Fully Penetrant Model 0.1 1 108(50, 6, 52) 216(100, 12, 104)

Incomplete Penetrance Model I 0.1 0.8 128(50, 13, 65) 257(100, 26, 131)

Incomplete Penetrance Model II 0.1 0.5 204(50, 36, 118) 408(100, 72, 236)

Recessive Model:

Fully Penetrant Model 0.2 1 620(50, 89, 481) 1241(100,178, 963)

Incomplete Penetrance Model I 0.2 0.8 833(50,113, 670) 1667(100,227,1340)

Incomplete Penetrance Model II 0.2 0.5 1725(50,178,1497) 3450(100,356,2994)

Note: N = Total Sample Size Simulated, Nm =Multiplex pedigrees, Ns =Simplex pedigrees,

No =Pedigrees without affected offspring; q = Allele Frequency; f = Penetrance.

3.3 Two-Phase (Sampling) Designs

The total samples simulated consist of 50 or 100 multiplex pedigrees, simplex pedigrees and pedigrees without affected offspring. In the first phase, we collect multiplex pedigrees and simplex ones. Then, for the second phase, single ascertainment was carried out by selecting families based on the proportion of affected offspring. Each two-phase design consisted of a combinations of 50 or 100 multiplex (M) pedigrees and a different number of simplex (S) pedigrees determined as a percentage of the number of multiplex pedigrees defined as follows (Table 3.2):

61 In the fully penetrant dominant model:

(1a) 5 S + 50 M; (1b) 10 S + 100 M

In the incomplete penetrance dominant model with penetrance=0.8:

(1a) 5 S + 50 M; (1b) 10 S + 100 M

(2a) 10 S + 50 M; (2b) 20 S + 100 M

In the incomplete penetrance dominant model with penetrance=0.5:

(1a) 5 S + 50 M; (1b) 10 S + 100 M

(2a) 10 S + 50 M; (2b) 20 S + 100 M

(3a) 15 S + 50 M; (3b) 30 S + 100 M

(4a) 20 S + 50 M; (4b) 40 S + 100 M

(5a) 25 S + 50 M; (5b) 50 S + 100 M

(6a) 30 S + 50 M; (6b) 60S + 100 M

(7a) 35 S + 50 M; (7b) 70 S + 100 M

In the recessive model with penetrance = 1, 0.8, 0.5, respectively:

(1a) 5 S + 50 M; (1b) 10 S + 100 M

(2a) 10 S + 50 M; (2b) 20 S + 100 M

(3a) 15 S + 50 M; (3b) 30 S + 100 M

(4a) 20 S + 50 M; (4b) 40 S + 100 M

(5a) 25 S + 50 M; (5b) 50 S + 100 M

(6a) 30 S + 50 M; (6b) 60 S + 100 M

(7a) 35 S + 50 M; (7b) 70 S + 100 M

(8a) 40 S + 50 M; (8b) 80 S + 100 M

(9a) 45 S + 50 M; (9b) 90 S + 100 M

62 (10a) 50 S + 50 M; (10b) 100 S + 100 M

(Note: a represents samples with 50 multiplex pedigrees; b represents samples with 100 multiplex pedigrees).

The above two-phase sampling designs were produced by SAS version 8.2 (2001) and then put into the likelihood function for a two-phase design (1.13 in chapter I) to carry out estimation for two-phase sampling procedures.

We took affected offspring as comprising the proband sampling frame (PSF). We simultaneously obtained the maximum likelihood estimates of the allele frequency (q) and penetrance (f) in each of the sampling designs by using the two-phase sampling likelihood implemented in a special versions of the SEGREG program of S.A.G.E. version 4.5 (2003), making the proband the only member of the PSF, as is appropriate for single ascertainment. 100 replicates were performed to conduct a statistical analysis for each situation. Here, we report the average of q (an estimate of E (qˆ)) and f (an estimate of E (fˆ )), their standard errors (Se), estimated bias (bias(qˆ) = E (qˆ) - q), and root mean square error (RMSE = Bias2 + Se2 ) over 100 replicates.

3.4 Results

3.4.1 The Effects under Dominant Models

3.4.1.1 The Estimates of the Allele Frequency

Table 3.3 and Table 3.4 show the parameter estimates of the fully penetrant dominant model for segregation analysis in two-phase designs from samples with 50 and

63 100 multiplex families, respectively. The statistics including mean estimates and Se estimates were obtained over the 100 replicates. The allele frequency (q) was underestimated in sampling design 1, for the true q of 0.1 in Table 3.3. The results in

Table 3.4 suggest that increasing the total sample size makes the bias and RMSE smaller and the allele frequency closer to the true parameter value of 0.1.

Table 3.3 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Dominant Model for Segregation Analysis: Sample with 50 Multiplex Families Sampling Design n E(qˆ) Se (qˆ) Bias RMSE

1 (5 S + 50 M) 55 0.06651 0.01423 -0.03349 0.03639 ______Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

Table 3.4 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Dominant Model for Segregation Analysis: Sample with 100 Multiplex Families Sampling Design n E(qˆ) Se (qˆ) Bias RMSE

1 (10 S + 100 M) 110 0.08604 0.01299 -0.01396 0.01907 ______Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

The parameter estimates for incomplete penetrance dominant models are displayed in Tables 3.5 - 3.8. With the penetrance of 0.8 in the dominant model, the

64 allele frequency (q) was underestimated in the two sampling designs, for a true q of 0.1

(Table 3.5). The data reflect that the smaller the root mean square error (RMSE) is, the better is the performance of the model. The sampling design 2 (10 S + 50 M) provides better estimate of q than the sampling design 1 (5 S + 50 M). Enlarging the total sample size decreases the bias, RMSE, and the difference between the parameter estimate (q) and the true value (Table 3.6). With a penetrance of 0.5 in the incompletely dominant model

(Tables 3.7 and 3.8), the sampling designs 3 (15 S + 50 M) and 4 (20 S + 50 M) yield better estimates of q, based on a comparison of the bias and RMSE of q in the seven sampling designs. In addition, sampling designs 7 (35 S + 50 M) and 6 (30 S + 50 M) produce poorer estimates of q.

Table 3.5 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Incomplete Dominant Model for Segregation Analysis: Sample with 50 Multiplex Families Sampling design n E(qˆ) Se (qˆ) Bias RMSE

1 (5 S + 50 M) 55 0.05509 0.00909 -0.04491 0.04582 2 (10S + 50M) 60 0.06103 0.00728 -0.03897 0.03964 ______

Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

65 Table 3.6 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Incomplete Dominant Model for Segregation Analysis: Sample with 100 Multiplex Families Sampling design n E(qˆ) Se (qˆ) Bias RMSE

1 (10 S + 100 M) 110 0.07035 0.00796 -0.02965 0.03070 2 (20 S + 100 M) 120 0.08082 0.00710 -0.01918 0.02045 ______

Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

Table 3.7 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Incomplete Dominant Model for Segregation Analysis: Sample with 50 Multiplex Families Sampling design n E(qˆ) Se (qˆ) Bias RMSE

1 (5 S + 50 M) 55 0.06276 0.01226 -0.03724 0.03921 2 (10 S + 50 M) 60 0.08064 0.00824 -0.01936 0.02104 3 (15 S + 50 M) 65 0.09419 0.01470 -0.00581 0.01581 4 (20 S + 50 M) 70 0.11467 0.01078 0.01467 0.01821 5 (25 S + 50 M) 75 0.13575 0.01437 0.03575 0.03853 6 (30 S + 50 M) 80 0.15351 0.00951 0.05351 0.05435 7 (35 S + 50 M) 85 0.16576 0.01173 0.06576 0.06680 ______

Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

66 Table 3.8 Mean (qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Incomplete Dominant Model for Segregation Analysis: Sample with 100 Multiplex Families Sampling design n E(qˆ) Se (qˆ) Bias RMSE

1 (10 S + 100 M) 110 0.06872 0.00862 -0.03128 0.03245 2 (20 S + 100 M) 120 0.08791 0.00766 -0.01209 0.01431 3 (30 S + 100 M) 130 0.09729 0.01111 -0.00271 0.01144 4 (40 S + 100 M) 140 0.11087 0.00988 0.01087 0.01469 5 (50 S + 100 M) 150 0.12817 0.01288 0.02817 0.03098 6 (60 S + 100 M) 160 0.14972 0.00736 0.04972 0.05026 7 (70 S + 100 M) 170 0.16183 0.00877 0.06183 0.06245 ______

Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

3.4.1.2 The Estimates of the Penetrances

Tables 3.9 and 3.10 display the penetrance estimates of the fully penetrant dominant model for segregation analysis in two-phase designs corresponding to the samples with 50 and 100 multiplex families, respectively. We obtained the mean estimates and Se estimates over the 100 replicates. The penetrances (f) were underestimated for genotypes AA and Aa but overestimated for genotype aa in the sampling design in Table 3.9. With the total sample size larger, the bias declines and the penetrances are closer to the true parameter value (Table 3.10).

67 Table 3.9 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f) in the Dominant Model for Segregation Analysis: Sample with 50 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling Design n E (f AA = f Aa ) Se (f AA = f Aa ) E (f aa ) Se (f aa )

1 (5 S + 50 M) 55 0.74630 0.03285 0.06734 0.01351 ______Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

Table 3.10 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f) in the Dominant Model for Segregation Analysis: Sample with 100 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling Design n E (f AA = f Aa ) Se (f AA = f Aa ) E (f aa ) Se (f aa )

1 (10 S + 100 M) 110 0.8115 0.03041 0.05928 0.01116 ______Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

The penetrance estimates f for incompletely penetrant dominant models are shown in Tables 3.11 - 3.14. As the penetrance of the dominant model decreases to 0.8, the sampling design 1 (5 S + 50 M) yields the better estimates of f for AA and Aa, but the sampling design 2 (10 S + 50 M) results in the better estimates of f for aa (Table 3.11).

When the total sample size increases, the bias correspondingly decreases and the parameter estimate (f) is closer to the true value (Table 3.12). With a penetrance of 0.5 in the incompletely dominant model (Table 3.13), the sampling designs 3 (15 S + 50 M) and

68 4 (20 S + 50M) yield the best estimates of f for AA and Aa; in contrast, the sampling design 7 (35 S + 50 M) provides the poorest estimate of f. However, sampling design 7 provides the best estimate of f for aa and sampling design 1 (5 S + 50 M) gives the poorest estimate of f. When the total sample size increases, the bias correspondingly decreases, and the parameter estimate (f) is closer to the true value (Table 3.14).

Table 3.11 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f) in the Incomplete Dominant Model for Segregation Analysis: Sample with 50 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA = f Aa ) Se (f AA = f Aa ) E (f aa ) Se (f aa )

1 (5 S + 50 M) 55 0.59039 0.03022 0.07354 0.01064 2 (10S + 50M) 60 0.56374 0.03118 0.06551 0.00964 ______

Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

Table 3.12 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f ) in the Incomplete Dominant Model for Segregation Analysis: Sample with 100 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA = f Aa ) Se (f AA = f Aa ) E (f aa ) Se (f aa )

1 (10 S + 100 M) 110 0.63833 0.02914 0.06813 0.00887 2 (20 S + 100 M) 120 0.58776 0.02605 0.05706 0.00753 ______Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

69 Table 3.13 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f ) in the Incomplete Dominant Model for Segregation Analysis: Sample with 50 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA = f Aa ) Se (f AA = f Aa ) E (f aa ) Se (f aa )

1 (5 S + 50 M) 55 0.53728 0.02822 0.07112 0.01032 2 (10 S + 50 M) 60 0.50716 0.02609 0.06442 0.00685 3 (15 S + 50 M) 65 0.50201 0.02758 0.05605 0.00941 4 (20 S + 50 M) 70 0.49351 0.02666 0.04861 0.00795 5 (25 S + 50 M) 75 0.44544 0.02806 0.04336 0.00882 6 (30 S + 50 M) 80 0.43350 0.02715 0.03758 0.00819 7 (35 S + 50 M) 85 0.40678 0.02737 0.03083 0.00994 ______

Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

70 Table 3.14 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f) in the Incomplete Dominant Model for Segregation Analysis: Sample with 100 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA = f Aa ) Se (f AA = f Aa ) E (f aa ) Se (f aa )

1 (10 S + 100 M) 110 0.52081 0.02645 0.06545 0.00817 2 (20 S + 100 M) 120 0.50114 0.02435 0.05813 0.00521 3 (30 S + 100 M) 130 0.50102 0.02072 0.04995 0.00747 4 (40 S + 100 M) 140 0.50108 0.02276 0.04019 0.00515 5 (50 S + 100 M) 150 0.46188 0.02023 0.03527 0.00622 6 (60 S + 100 M) 160 0.46457 0.02373 0.03055 0.00631 7 (70 S + 100 M) 170 0.41860 0.02054 0.02457 0.00695 ______

Note: The allele frequency simulated (q) = 0.1. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

3.4.2 The Effects under Recessive Models

3.4.2.1 The Estimates of the Allele Frequency

Table 3.15 lists the average parameter estimate (q), its standard error, bias, and

RMSE under the fully penetrant recessive model for segregation analysis in two-phase designs with 50 multiplex pedigrees. By comparing the biases and RMSE of q of the ten two-phase sampling designs, we found that the sampling designs 4 (20 S + 50 M) and 5

(25 S + 50 M) result in better estimates of the allele frequency (q) than the others, which

71 implies that samples consisting of about 28% to 33% simplex families lead to good estimates of q. On the other hand, the sampling designs 9 (45 S + 50 M) and 10 (50 S +

50 M) yield poorer parameter estimates of q. Table 3.16 exhibits the corresponding results from the analogous study using a sample with 100 multiplex pedigrees. With this larger size, the bias and RMSE of q fall further, as we expected.

Table 3.15 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Recessive Model for Segregation Analysis: Sample with 50 Multiplex Families Sampling design n E(qˆ) Se (qˆ) Bias RMSE 1 (5 S + 50 M) 55 0.16625 0.01598 -0.03375 0.03734 2 (10 S + 50 M) 60 0.17932 0.02155 -0.02068 0.02987 3 (15 S + 50 M) 65 0.17564 0.01768 -0.02436 0.03010 4 (20 S + 50 M) 70 0.19011 0.01693 -0.00989 0.01961 5 (25 S + 50 M) 75 0.21017 0.01675 0.01017 0.01959 6 (30 S + 50 M) 80 0.22082 0.01701 0.02082 0.02689 7 (35 S + 50 M) 85 0.24270 0.01265 0.00470 0.04453 8 (40 S + 50 M) 90 0.28401 0.02247 0.08401 0.08696 9 (45 S + 50 M) 95 0.27940 0.01566 0.07940 0.08093 10 (50 S + 50 M) 100 0.30604 0.01414 0.10604 0.10698 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

72 Table 3.16 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Recessive Model for Segregation Analysis: Sample with 100 Multiplex Families Sampling design n E(qˆ) Se (qˆ) Bias RMSE

1 (10 S + 100 M) 110 0.16866 0.01223 -0.03134 0.03364 2 (20 S + 100 M) 120 0.18011 0.01797 -0.01989 0.02681 3 (30 S + 100 M) 130 0.18072 0.01636 -0.01928 0.02529 4 (40 S + 100 M) 140 0.20564 0.01484 0.00564 0.01588 5 (50 S + 100 M) 150 0.21003 0.01309 0.01003 0.01649 6 (60 S + 100 M) 160 0.21883 0.01423 0.01883 0.02360 7 (70 S + 100 M) 170 0.23709 0.01067 0.03709 0.03859 8 (80 S + 100 M) 180 0.27659 0.01941 0.07659 0.07901 9 (90 S + 100 M) 190 0.27484 0.01094 0.07484 0.07564 10 (100 S + 100 M) 200 0.29813 0.01285 0.09813 0.09897 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

As the penetrance decreases to 0.8 (Table 3.17), the allele frequency (q) estimates from all of the sampling designs are either underestimated or overestimated. In addition, sampling designs 5 (25 S + 50 M) and 6 (30 S + 50 M) yield the best estimates of q, which suggests that with a penetrance of 0.8, having more simplex families in the sample is necessary for producing consistent estimates of q. Compared to the sample size of 50 multiplex families, the total sample size of 100 multiplex families makes the estimates of q further approximate the true values (Table 3.18). In Tables 3.19 and 3.20, we show the average parameter estimate (q), its standard error, bias, and RMSE for the incompletely

73 recessive model with a penetrance of 0.5. The q estimates from all of the sampling designs are either underestimated or overestimated. Sampling designs 7 (35 S + 50 M) and 6 (30 S + 50 M) are the best, which implies that the effect of the simplex families is striking.

Table 3.17 Mean(qˆ), Standard Error (Se), Bias, and RMSE of the Estimates of the Allele Frequency (q) in the Incomplete Recessive Model for Segregation Analysis: Sample with 50 Multiplex Families Sampling design n E(qˆ) Se (qˆ) Bias RMSE

1 (5 S + 50 M) 55 0.13891 0.01615 -0.06109 0.06319 2 (10 S + 50 M) 60 0.14331 0.01911 -0.05669 0.05982 3 (15 S + 50 M) 65 0.16745 0.01695 -0.03255 0.03670 4 (20 S + 50 M) 70 0.17672 0.01643 -0.02328 0.02849 5 (25 S + 50 M) 75 0.19673 0.01836 -0.00327 0.01865 6 (30 S + 50 M) 80 0.22221 0.01494 0.02221 0.02677 7 (35 S + 50 M) 85 0.26932 0.01456 0.06932 0.07083 8 (40 S + 50 M) 90 0.26026 0.02021 0.06026 0.06356 9 (45 S + 50 M) 95 0.30585 0.02142 0.10585 0.10800 10 (50 S + 50 M) 100 0.31699 0.01737 0.11699 0.11827 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

74 Table 3.18 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Incomplete Recessive Model for Segregation Analysis: Sample with 100 Multiplex Families Sampling design n E (qˆ) Se (qˆ) Bias RMSE

1 (10 S + 100 M) 110 0.13978 0.01148 -0.06022 0.06130 2 (20 S + 100 M) 120 0.17736 0.01087 -0.02264 0.02511 3 (30 S + 100 M) 130 0.17433 0.01114 -0.02567 0.02798 4 (40 S + 100 M) 140 0.18104 0.01205 -0.01896 0.02246 5 (50 S + 100 M) 150 0.19729 0.01012 -0.00271 0.01048 6 (60 S + 100 M) 160 0.21612 0.01295 0.01612 0.02068 7 (70 S + 100 M) 170 0.24849 0.00915 0.04849 0.04935 8 (80 S + 100 M) 180 0.25994 0.00971 0.05994 0.06072 9 (90 S + 100 M) 190 0.28516 0.01272 0.08516 0.08611 10 (100 S + 100 M) 200 0.30219 0.01363 0.10219 0.10310 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

75 Table 3.19 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Incomplete Recessive Model for Segregation Analysis: Sample with 50 Multiplex Families Sampling design n E(qˆ) Se (qˆ) Bias RMSE

1 (5 S + 50 M) 55 0.09168 0.01755 -0.10832 0.10973 2 (10 S + 50 M) 60 0.11971 0.01862 -0.08029 0.08242 3 (15 S + 50 M) 65 0.15895 0.01802 -0.04105 0.04483 4 (20 S + 50 M) 70 0.14576 0.01823 -0.05424 0.05722 5 (25 S + 50 M) 75 0.16518 0.01646 -0.03482 0.03851 6 (30 S + 50 M) 80 0.17078 0.01505 -0.02922 0.03287 7 (35 S + 50 M) 85 0.19292 0.01777 -0.00708 0.01913 8 (40 S + 50 M) 90 0.26729 0.01999 0.06729 0.07020 9 (45 S + 50 M) 95 0.28254 0.02061 0.08254 0.08507 10 (50 S + 50 M) 100 0.31919 0.01976 0.11919 0.12082 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

76 Table 3.20 Mean(qˆ), Standard Error (Se), Bias and RMSE of the Estimates of the Allele Frequency (q) in the Incomplete Recessive Model for Segregation Analysis: Sample with 100 Multiplex Families Sampling design n E (qˆ) Se (qˆ) Bias RMSE

1 (10 S + 100 M) 110 0.10129 0.01362 -0.09871 0.09965 2 (20 S + 100 M) 120 0.12195 0.01424 -0.07805 0.07934 3 (30 S + 100 M) 130 0.16044 0.01385 -0.03956 0.04191 4 (40 S + 100 M) 140 0.15358 0.01580 -0.04642 0.04904 5 (50 S + 100 M) 150 0.17266 0.01320 -0.02734 0.03036 6 (60 S + 100 M) 160 0.18252 0.01201 -0.01748 0.02121 7 (70 S + 100 M) 170 0.20253 0.01197 0.00253 0.01223 8 (80 S + 100 M) 180 0.25121 0.01284 0.05121 0.05280 9 (90 S + 100 M) 190 0.27507 0.01255 0.07507 0.07611 10 (100 S + 100 M) 200 0.30357 0.01223 0.10357 0.10429 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

3.4.2.2 The Estimates of the Penetrances

Table 3.21 lists the average parameter estimate (f) and its standard error under the fully penetrant recessive model for segregation analysis in two-phase designs with 50 multiplex pedigrees. Based on the comparison of the biases of f from the ten two-phase sampling designs, we found that the sampling design 1 (5 S + 50 M) provides the better estimate of the penetrance (f) for AA than the other designs, and the design 10 (50 S + 50

77 M) results in poorer estimates of f for AA than the others. These data imply that the samples with more multiplex families lead to better estimates of f for AA. On the other hand, the sampling design 10 (50 S + 50 M) yields the best parameter estimates (f) for Aa and aa; in contrast, the sampling design 1 provides the poorest ones for Aa and aa. Table

3.22 exhibits the corresponding results from the analogous study using a sample with 100 multiplex pedigrees. This larger sample size lowers the bias of f further.

Table 3.21 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f ) in the Recessive Model for Segregation Analysis: Sample with 50 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA ) Se (f AA ) E (f Aa = f aa ) Se (f Aa = f aa )

1 (5 S + 50 M) 55 0.75996 0.03077 0.05554 0.00740 2 (10 S + 50 M) 60 0.72614 0.03011 0.05145 0.01188 3 (15 S + 50 M) 65 0.73186 0.02874 0.04491 0.01049 4 (20 S + 50 M) 70 0.71245 0.03189 0.04017 0.00946 5 (25 S + 50 M) 75 0.71194 0.03118 0.03635 0.00611 6 (30 S + 50 M) 80 0.71344 0.03170 0.03147 0.00646 7 (35 S + 50 M) 85 0.70921 0.03032 0.02811 0.00456 8 (40 S + 50 M) 90 0.68722 0.03213 0.02342 0.00789 9 (45 S + 50 M) 95 0.65731 0.03236 0.02014 0.00548 10 (50 S + 50 M) 100 0.61258 0.03182 0.01764 0.00501 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

78 Table 3.22 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f) in the Recessive Model for Segregation Analysis: Sample with 100 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA ) Se (f AA ) E (f Aa = f aa ) Se (f Aa = f aa )

1 (10 S + 100 M) 110 0.81332 0.02542 0.05518 0.00663 2 (20 S + 100 M) 120 0.81094 0.02907 0.04809 0.00906 3 (30 S + 100 M) 130 0.79708 0.02839 0.04314 0.00713 4 (40 S + 100 M) 140 0.77281 0.02759 0.03840 0.00717 5 (50 S + 100 M) 150 0.78808 0.02642 0.03397 0.00538 6 (60 S + 100 M) 160 0.76097 0.02968 0.03021 0.00476 7 (70 S + 100 M) 170 0.76904 0.02988 0.02643 0.00393 8 (80 S + 100 M) 180 0.73398 0.03068 0.02218 0.00508 9 (90 S + 100 M) 190 0.68416 0.02631 0.02005 0.00457 10 (100 S + 100 M) 200 0.63188 0.02646 0.01706 0.00383 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

As the penetrance decreases to 0.8 (Table 3.23), the penetrance (f) estimates for

AA of all of the sampling designs are underestimated. In addition, sampling designs 1 (5

S + 50 M) offers the best estimate of f for AA, but sampling design 10 (50S + 50 M) is the poorest one for it, which suggests that when the penetrance is 0.8, multiplex families are very influential for generating consistent estimates of f for AA. For Aa and aa, sampling design 10 provides the best estimate of f and sampling design 1 gives the poorest estimates of f. As the total sample size increases (Table 3.24), the estimates of f come closer to the true values. Tables 3.25 and 3.26 display the average parameter

79 estimate (f) and its standard error resulting from the incompletely penetrant recessive model with a penetrance of 0.5. Better f estimates for AA result from sampling designs 6

(30 S + 50 M) and 7 (35 S + 50 M), but the best estimate of f for aa comes from sampling design 10 (50 S + 50 M). The results suggest that both multiplex and simplex families are necessary to yield better estimates of the penetrance of AA.

Table 3.23 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f ) in the Incomplete Recessive Model for Segregation Analysis: Sample with 50 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA ) Se (f AA ) E (f Aa = f aa ) Se (f Aa = f aa )

1 (5 S + 50 M) 55 0.69985 0.03368 0.06548 0.00702 2 (10 S + 50 M) 60 0.68781 0.03462 0.06249 0.00641 3 (15 S + 50 M) 65 0.67678 0.03170 0.05746 0.00662 4 (20 S + 50 M) 70 0.65958 0.03546 0.05084 0.00452 5 (25 S + 50 M) 75 0.66879 0.03366 0.04628 0.00964 6 (30 S + 50 M) 80 0.65214 0.02919 0.04252 0.00472 7 (35 S + 50 M) 85 0.63604 0.03265 0.03894 0.00547 8 (40 S + 50 M) 90 0.59332 0.03412 0.03587 0.01272 9 (45 S + 50 M) 95 0.53437 0.03313 0.03308 0.01308 10 (50 S + 50 M) 100 0.50977 0.03122 0.03097 0.00399 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

80 Table 3.24 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f ) in the Incomplete Recessive Model for Segregation Analysis: Sample with 100 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA ) Se (f AA ) E (f Aa = f aa ) Se (f Aa = f aa )

1 (10 S + 100 M) 110 0.76791 0.02965 0.06261 0.00509 2 (20 S + 100 M) 120 0.72998 0.02828 0.05814 0.00422 3 (30 S + 100 M) 130 0.72175 0.02938 0.05596 0.00532 4 (40 S + 100 M) 140 0.72511 0.03022 0.04892 0.00383 5 (50 S + 100 M) 150 0.70175 0.02768 0.04466 0.00403 6 (60 S + 100 M) 160 0.68322 0.02854 0.04020 0.00221 7 (70 S + 100 M) 170 0.68378 0.02536 0.03647 0.00377 8 (80 S + 100 M) 180 0.65686 0.02506 0.03418 0.00550 9 (90 S + 100 M) 190 0.60408 0.02945 0.03127 0.00691 10 (100 S + 100 M) 200 0.54542 0.02891 0.02871 0.00203 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

81 Table 3.25 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f ) in the Incomplete Recessive Model for Segregation Analysis: Sample with 50 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA ) Se (f AA ) E (f Aa = f aa ) Se (f Aa = f aa )

1 (5 S + 50 M) 55 0.57459 0.03392 0.05576 0.00503 2 (10 S + 50 M) 60 0.54085 0.03479 0.04913 0.00314 3 (15 S + 50 M) 65 0.54570 0.03669 0.04453 0.00306 4 (20 S + 50 M) 70 0.54006 0.03317 0.04155 0.00345 5 (25 S + 50 M) 75 0.52016 0.03168 0.03874 0.00248 6 (30 S + 50 M) 80 0.51424 0.02756 0.03464 0.00286 7 (35 S + 50 M) 85 0.48413 0.03293 0.03035 0.00421 8 (40 S + 50 M) 90 0.46485 0.03488 0.02712 0.00513 9 (45 S + 50 M) 95 0.40597 0.03663 0.02495 0.00481 10 (50 S + 50 M) 100 0.36941 0.03310 0.02198 0.00526 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

82 Table 3.26 Mean (fˆ ), Standard Error (Se), Bias and RMSE of the Estimates of the Penetrance (f) in the Incomplete Recessive Model for Segregation Analysis: Sample with 100 Multiplex Families ˆ ˆ ˆ ˆ ˆ ˆ Sampling design n E (f AA ) Se (f AA ) E (f Aa = f aa ) Se (f Aa = f aa )

1 (10 S + 100 M) 110 0.55064 0.03195 0.05496 0.00308 2 (20 S + 100 M) 120 0.53798 0.03261 0.04739 0.00265 3 (30 S + 100 M) 130 0.53946 0.03104 0.04230 0.00229 4 (40 S + 100 M) 140 0.52721 0.03055 0.03996 0.00226 5 (50 S + 100 M) 150 0.51876 0.03089 0.03787 0.00207 6 (60 S + 100 M) 160 0.51184 0.02579 0.03315 0.00216 7 (70 S + 100 M) 170 0.48837 0.03019 0.02926 0.00249 8 (80 S + 100 M) 180 0.46718 0.03178 0.02641 0.00310 9 (90 S + 100 M) 190 0.41371 0.03231 0.02389 0.00268 10 (100 S + 100 M) 200 0.37118 0.03149 0.02014 0.00223 ______

Note: The allele frequency simulated (q) = 0.2. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

3.5 Summary

We extended the application of two-phase sampling designs for segregation analysis of sibships and nuclear families to general pedigrees. We have proposed two- phase sampling designs and a feasible likelihood to estimate the allele frequency (q) and the penetrance (f) for segregation analysis and compared different sampling designs.

Since the sample sizes of the different designs are not identical, we only draw

83 preliminary conclusions by comparing the results of these designs. Our results show that the proposed likelihood for two-phase samples appears to give consistent estimates, i.e. estimates that become closer to the true parameter values as the sample size increases.

Under the fully penetrant dominant models, multiplex families play a major role in yielding consistent estimates of q and f for AA and Aa. This role might result from the richer genetic information of multiplex families or their higher proportions in the samples.

However, as the penetrance of the dominant allele decreases to 0.8 or even 0.5, for yielding good estimates of q it is necessary to have a relatively higher percent of simplex families in the sample; for obtaining good estimates of f for AA and Aa, it is important to include more multiplex families in the sample; for obtaining good estimates of f for aa, more simplex families are necessary. In this respect, simplex families might function through their specific genetic nature or their increasing proportions under the incompletely dominant models. Although some biases of q consistently exist, they decrease as the sample sizes increase. Therefore, it is practicable to use these two-phase designs and method of analysis, shown in the fact that the sampling designs 3 (15 S + 50

M) and 4 (20 S + 50 M) yield consistent estimates of q and f for AA and Aa from general pedigrees at f = 0.5.

Under the recessive models, both multiplex and simplex families appear to be indispensable for producing good estimates of q; multiplex families play a major role in yielding good estimates of f for AA; simplex families are important for yielding estimates of the penetrance (f) for Aa and aa in two-phase designs. Under the fully penetrant recessive model, sampling designs 4 (20 S + 50 M) and 5 (25 S + 50 M) are the best choices of two-phase designs. When the penetrance decreases to 0.8, the two-phase

84 designs with both multiplex and simplex families result in good estimates of both q and f.

The sampling designs 5 (25 S + 50 M) and 6 (30 S + 50 M) offer the best estimates of q and good estimates of f for AA, Aa and aa. However, when the penetrance decreases to

0.5, simplex families show an obvious role in obtaining good estimates of q with most of the designs, and sampling designs 7 (35 S + 50 M) and 6 (30 S + 50 M) are the best ones to yield good estimates of q and f. In conclusion, the samples consisting of both simplex and multiplex families result in good estimates of the allele frequency and penetrances.

Therefore, the appropriate two-phase designs can function as a sampling method to efficiently improve the estimates of the allele frequency and penetrance in segregation analysis.

Why would one ever eliminate from his sample for segregation analysis simplex pedigrees that have already been collected? One possibility is to note that the method should still work if we replace “simplex” and “multiplex” families by families “reported to be simplex” and “reported to be multiplex”, and it may be expensive to determine exactly who is affected and who is unaffected (which the family members themselves only know approximately) – so that we want to minimize the total number of family members who need to be tested for an exact diagnosis. On the other hand, realistically, for research using larger pedigrees, only the available members, including probands, of families can be exactly diagnosed, while assessing the disease status of a proband’s second-degree and more distant relatives may be mainly based on the family history record. For instance, Seuchter et al (2000) used such an approach to determine the phenotypies of Tourette Syndrome for complex segregation analysis. Therefore, to exactly diagnose all members of a pedigree is almost impossible. We also meet this

85 problem for the phenotypical assessment of all individuals of a multigenerational pedigree when investigating a disorder with the late age of onset. Thus, in practice it can be very difficult to diagnose exactly who is affected and who is unaffected. However, our two-phase sampling designs can apply to these practical situations, yielding accurate estimates of parameters.

In addition, Jarvik (1998) suggested that the amount of necessary data is expected to be proportional to the number of parameters estimated, limiting our ability to evaluate more complicated models. For example, for rarer traits, the model estimates for use in model-based linkage analysis must often be drawn from complex segregation analysis performed on samples other than the sample of densely affected families collected for linkage analysis (Jarvik 1998). This means that simplex families must be used for the complex segregation analysis of rarer traits, because few multiplex families exist. Two- phase sampling can provide good estimates of a genetic model in this situation. Some researchers use both simplex and multiplex families for ascertainment adjustment, but they breake pedigrees into nuclear families, which leads to loss of information for segregation analysis (Blanco et al, 1998; Vanita et al, 1999). But our two-phase designs can avoid this problem.

86 CHAPTER IV

TWO-PHASE SAMPLING DESIGNS FOR LINKAGE ANALYSIS IN

PEDIGREES

4.1 Introduction

Ascertainment is relevant to not only segregation analysis, but also to the estimation of familial correlations and linkage analysis (Elston 1995). Traditionally, the trait model parameters in a model-based linkage analysis are fixed to specific values obtained either from a previous study on another data set, or from an earlier segregation analysis on the current data set. Clerget-Darpoux et al. (1986) showed that misspecification of the trait model parameters in a linkage analysis leads to biased estimates of the recombination fraction and changes the lod score distribution, in some cases leading to reduced power for detecting linkage. Some researchers have overemphasized the role of multiplex families for linkage analysis (Cox et al. 1988;

Slager and Vieland 1997). Should the role of simplex families be disregarded? In

Chapter III, we proposed two-phase sampling designs for segregation analysis in general pedigrees. Our two-phase designs can provide from segregation analysis the consistent estimates of the parameters that build a good basis for linkage analysis. Because simplex families supply little information for the lod score, we only genotype a set of markers on all the members of multiplex pedigrees for linkage analysis, in order to minimize the cost.

In this chapter, we simulate samples that consist of three-generation pedigrees and carry out linkage analysis in two-phase sampling designs. The purpose of this chapter is,

87 using likelihood (1.13), to examine (1) the accuracy of estimates of the recombination fraction (θ) for linkage analysis when combining information from segregation analysis in two-phase sampling designs, and (2) the effects on lod scores obtained from multiplex families under different genetic models, respectively.

4.2 Simulations

The characteristics of the six genetic models used to generate the sample data for this study have been described in Chapter III. Only the procedure for simulating linked markers is described in this chapter. Genotypes were assigned to founders for a disease trait locus and for a marker locus that does not affect expression of the trait. This marker locus has four alleles with equal frequencies, allowing virtually all transmission patterns to be identified unambiguously. The recombination fraction (θ) for the marker was simulated with a value of 0.1. The six genetic models simulated are presented in Table

3.1.

4.3 Method of Analysis

The two-phase sampling designs were generated (for details, see Chapter III).

Only multiplex families entered samples to conduct linkage analysis, using the program

LODLINK of S.A.G.E., version 4.5 (2003), which used the estimated allele frequency and penetrance files produced by the program SEGREG. We used the whole multiplex sample, with the modified likelihood (1.13), to estimate the linkage parameter. The

88 analyses used maximum likelihood to estimate the recombination fraction. We also computed maximum lod scores for the different designs under the six genetic models.

Thus, we report the average θˆ (an estimate of E(θˆ)), its standard error (Se), bias (bias =

E(θˆ) - θ ) , root mean square error (RMSE = Bias2 + Se2 ), the average maximum lod score (E(lod)) and Se(lod) over 100 replicates.

4.4 Results

4.4.1 The Effects under Dominant Models

4.4.1.1 Estimates of the Recombination Fraction

Table 4.1 displays the results for the parameter estimate (θ), its standard error, bias, and RMSE under the fully penetrant dominant models for linkage analysis in two- phase sampling designs. The recombination fraction (θ) is slightly underestimated under this sampling design. The sampling design1 (5S + 50M) provides good estimates of θ .

When the total sample size increases, the bias and RMSE are reduced and the recombination fraction is closer to the true value of 0.1 (Table 4.2).

89 Table 4.1 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Dominant Model for Linkage Analysis: Sample with 50 Multiplex Families Sampling Design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (5 S + 50 M) 55 0.07861 0.01020 -0.02139 0.02370 6.76 0.91 ______Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

Table 4.2 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Dominant Model for Linkage Analysis: Sample with 100 Multiplex Families Sampling Design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (10 S + 100 M) 110 0.10519 0.00988 0.00519 0.01116 15.13 1.90 ______Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

Tables 4.3 and 4.4 present the results of parameter estimates under the incompletely dominant model with a penetrance of 0.8. The data suggest that the recombination fraction is underestimated in all the sampling designs. By analyzing the biases and RMSEs of the different sampling designs, we found that sampling design 2

(10 S + 50 M) provides better estimate of θ and increasing the total sample size makes the parameter estimate (θ ) closer to the true value. Tables 4.5 and 4.6 show the parameter estimates (θ ) in the incompletely dominant model with a penetrance of 0.5.

90 On the one hand, biases of the estimated θ always exist; on the other hand, sampling design 4 (20 S + 50 M) yields the best estimate of θ , which suggests that with lower penetrance, the simplex families become more influential.

Table 4.3 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Incomplete Dominant Model for Linkage Analysis: Sample with 50 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (5 S + 50 M) 55 0.07691 0.01153 -0.02309 0.02581 3.42 0.45 2 (10S + 50M) 60 0.08324 0.01154 -0.01676 0.02035 3.84 0.47 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

Table 4.4 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Incomplete Dominant Model for Linkage Analysis: Sample with 100 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (10 S + 100 M) 110 0.08151 0.01110 -0.01849 0.02157 7.96 0.86 2 (20 S + 100 M) 120 0.09234 0.01124 -0.00766 0.01360 8.94 0.88 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

91 Table 4.5 Mean (θˆ), Standard Errors (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Incomplete Dominant Model for Linkage Analysis: Sample with 50 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (5 S + 50 M) 55 0.12815 0.01389 0.02815 0.03139 1.36 0.21 2 (10 S + 50 M) 60 0.11761 0.01385 0.01761 0.02240 1.41 0.22 3 (15 S + 50 M) 65 0.10903 0.01279 0.00903 0.01566 1.52 0.23 4 (20 S + 50 M) 70 0.10411 0.01309 0.00411 0.01372 1.60 0.25 5 (25 S + 50 M) 75 0.08906 0.01312 -0.01094 0.01708 1.55 0.24 6 (30 S + 50 M) 80 0.07675 0.01275 -0.02325 0.02652 1.58 0.25 7 (35 S + 50 M) 85 0.05696 0.01291 -0.04304 0.04494 1.59 0.25 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

92 Table 4.6 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Score in the Incomplete Dominant Model for Linkage Analysis: Sample with 100 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (10 S + 100 M) 110 0.11981 0.01274 0.01981 0.02355 3.41 0.38 2 (20 S + 100 M) 120 0.11442 0.01166 0.01442 0.01854 3.60 0.42 3 (30 S + 100 M) 130 0.10652 0.01072 0.00652 0.01255 3.71 0.43 4 (40 S + 100 M) 140 0.10143 0.01101 0.00143 0.01110 4.12 0.53 5 (50 S + 100 M) 150 0.09108 0.01125 -0.00892 0.01436 3.98 0.46 6 (60 S + 100 M) 160 0.08163 0.01034 -0.01837 0.02108 4.06 0.49 7 (70 S + 100 M) 170 0.06658 0.01075 -0.03342 0.03511 4.08 0.50 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

4.4.1.2 Maximum Lod Scores

Table 4.1 displays the average maximum lod scores under the fully penetrant dominant model in two-phase designs. From sampling design 1 (5 S + 50 M), we found that 50 multiplex pedigree lead to a lod score value of 6.76, so each multiplex pedigree contributes a lod score value of 0.1352 (E(lod)/50 = 6.76/50). Shown in Table 4.2, we observed that in sampling design 1, a sample size of 100 multiplex families contributes a lod score value of 15.13, which means that each multiplex pedigree leads to a lod score of

0.1513.

93 The lod score estimates for the incompletely dominant models are listed in Tables

4.3 - 4.6. As the penetrance of the dominant model declines to 0.8, compared to sampling design 1 (5 S + 50 M) in which each multiplex pedigree produces on average a maximum lod score of 0.0684 (3.42/50) (Table 4.3), sampling design 2 (10 S + 50 M) provides a higher lod score estimate. However, compared to the results of the fully penetrant dominant model above, a penetrance of 0.8 causes a loss about a half of the information of each multiplex pedigree (0.0684 vs. 0.1352). With a penetrance of 0.5, sampling design 4 (20 S + 50 M) yields the highest maximum lod score value of 1.60

(Table 4.5), which implies that simplex pedigrees become more important as the penetrance decreases because they affect the estimates of allele frequency and penetrances, and these impact the estimates of the recombination fraction and the lod score. The results from the corresponding analyses using a sample with 100 multiplex families suggest that sampling designs 4 (40 S + 100 M) and 7 (70 S + 100M) yield higher estimates of lod scores (Table 4.6).

4.4.2 The Effects under Recessive Models

4.4.2.1 Estimates of the Recombination Fraction

Table 4.7 summarizes the parameter estimates under the fully penetrant recessive model for linkage analysis in two-phase designs corresponding to a sample with 50 multiplex families. By comparing the biases and root mean squares (RMSE) of the estimates of the recombination fraction in the ten sampling designs, we discovered that

94 Table 4.7 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Recessive Model for Linkage Analysis: Sample with 50 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (5 S + 50 M) 55 0.05481 0.00798 -0.04519 0.04589 4.86 0.48 2 (10 S + 50 M) 60 0.06293 0.00702 -0.03707 0.03773 5.07 0.49 3 (15 S + 50 M) 65 0.06977 0.00662 -0.03023 0.03095 5.42 0.50 4 (20 S + 50 M) 70 0.07146 0.00729 -0.02854 0.02946 5.74 0.53 5 (25 S + 50 M) 75 0.07248 0.00754 -0.02752 0.02853 6.11 0.57 6 (30 S + 50 M) 80 0.07051 0.00521 -0.02949 0.02995 6.78 0.62 7 (35 S + 50 M) 85 0.06982 0.00671 -0.03018 0.03092 6.81 0.64 8 (40 S + 50 M) 90 0.06805 0.00809 -0.03195 0.03296 6.67 0.58 9 (45 S + 50 M) 95 0.06713 0.00810 -0.03287 0.03385 6.60 0.66 10 (50 S + 50 M) 100 0.06025 0.00718 -0.03975 0.04039 6.49 0.61 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

95 Table 4.8 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Recessive Model for Linkage Analysis: Sample with 100 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (10 S + 100 M) 110 0.05934 0.00733 -0.04066 0.04132 11.01 1.10 2 (20 S + 100 M) 120 0.06501 0.00646 -0.03499 0.03558 11.92 1.14 3 (30 S + 100 M) 130 0.07122 0.00595 -0.02878 0.02939 12.23 1.12 4 (40 S + 100 M) 140 0.08165 0.00665 -0.01835 0.01952 12.98 1.20 5 (50 S + 100 M) 150 0.08364 0.00637 -0.01636 0.01756 13.57 1.17 6 (60 S + 100 M) 160 0.08049 0.00425 -0.01951 0.01997 14.63 1.27 7 (70 S + 100 M) 170 0.07377 0.00598 -0.02623 0.02690 14.92 1.26 8 (80 S + 100 M) 180 0.07143 0.00722 -0.02857 0.02947 14.55 1.21 9 (90 S + 100 M) 190 0.06744 0.00695 -0.03256 0.03329 14.45 1.20 10 (100 S + 100 M) 200 0.06362 0.00574 -0.03638 0.03683 14.31 1.21 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 1.0. n = sample size. S = simplex family; M = multiplex family.

biases in the estimated recombination fractions (θ ) always exist in these sampling designs. Among all of the designs, sampling design 1 (5 S + 50 M) yields the poorest estimate of θ , so the samples with few simplex families cannot result in good estimates of θ, which suggests that simplex families, even if not typed for a linkage marker are relevant to the estimation of θ. With respect to improving the estimates of θ, the sampling designs 5 (25 S + 50 M) and 4 (20 S + 50 M) are the best, so both simplex and multiplex families appear to be necessary. As the total sample size gets larger, the bias

96 and RMSE decline and the recombination fraction is closer to the true parameter value of

0.1 (Table 4.8).

The estimates of θ for the incompletely recessive models with a penetrance of 0.8 are presented in Tables 4.9 and 4.10. Compared to the other designs, sampling design 10

(50 S + 50 M) produces a poorer estimate of θ , but designs 6 (30 S + 50 M) and 5 (25 S

+ 50 M) yield the best estimates of θ . The data reflect that more simplex families in the samples are required to produce good estimates of θ, compared to the fully penetrant

Table 4.9 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Incomplete Recessive Model for Linkage Analysis: Sample with 50 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (5 S + 50 M) 55 0.06119 0.00961 -0.03881 0.03998 2.85 0.34 2 (10 S + 50 M) 60 0.06352 0.00804 -0.03648 0.03736 3.03 0.36 3 (15 S + 50 M) 65 0.06409 0.00856 -0.03591 0.03692 3.82 0.37 4 (20 S + 50 M) 70 0.06942 0.00891 -0.03058 0.03185 4.07 0.37 5 (25 S + 50 M) 75 0.07291 0.00923 -0.02709 0.02862 4.56 0.39 6 (30 S + 50 M) 80 0.07473 0.00964 -0.02527 0.02705 5.02 0.42 7 (35 S + 50 M) 85 0.06814 0.00992 -0.03186 0.03337 5.12 0.42 8 (40 S + 50 M) 90 0.06318 0.01166 -0.03682 0.03862 4.90 0.39 9 (45 S + 50 M) 95 0.05962 0.01138 -0.04038 0.04195 4.95 0.40 10 (50 S + 50 M) 100 0.05137 0.00999 -0.04863 0.04965 4.92 0.40 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

97 Table 4.10 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Incomplete Recessive Model for Linkage Analysis: Sample with 100 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (10 S + 100 M) 110 0.06312 0.00751 -0.03688 0.03764 6.30 0.75 2 (20 S + 100 M) 120 0.06586 0.00710 -0.03414 0.03487 7.39 0.77 3 (30 S + 100 M) 130 0.06771 0.00824 -0.03229 0.03332 7.41 0.79 4 (40 S + 100 M) 140 0.08192 0.00797 -0.01808 0.01976 8.11 0.81 5 (50 S + 100 M) 150 0.08353 0.00774 -0.01647 0.01820 8.53 0.81 6 (60 S + 100 M) 160 0.08924 0.00788 -0.01076 0.01334 10.10 0.85 7 (70 S + 100 M) 170 0.08193 0.00732 -0.01807 0.01950 10.20 0.86 8 (80 S + 100 M) 180 0.07372 0.00817 -0.02628 0.02752 10.01 0.85 9 (90 S + 100 M) 190 0.06657 0.00727 -0.03343 0.03421 9.99 0.84 10 (100 S + 100 M) 200 0.05417 0.00724 -0.04583 0.04640 9.82 0.83 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 0.8. n = sample size. S = simplex family; M = multiplex family.

model. When the total sample size goes up, the biases and RMSE of the estimates of θ are further lowered in most of the sampling designs.

When the penetrance falls to 0.5, sampling designs 7 (35 S + 50 M) and 6 (30 S +

50 M) provide the best estimates of θ (Tables 4.11 and 4.12). More simplex families in the samples are necessary to obtain good estimates of θ, reflecting the importance of simplex families.

98 Table 4.11 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Incomplete Recessive Model for Linkage Analysis: Sample with 50 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (5 S + 50 M) 55 0.05539 0.01010 -0.04461 0.04574 1.25 0.21 2 (10 S + 50 M) 60 0.06681 0.01058 -0.03319 0.03484 1.50 0.23 3 (15 S + 50 M) 65 0.07542 0.01035 -0.02458 0.02667 1.62 0.23 4 (20 S + 50 M) 70 0.07965 0.00999 -0.02035 0.02267 1.78 0.24 5 (25 S + 50 M) 75 0.08003 0.01123 -0.01997 0.02291 1.79 0.24 6 (30 S + 50 M) 80 0.08581 0.01018 -0.01419 0.01746 1.98 0.25 7 (35 S + 50 M) 85 0.08593 0.01077 -0.01407 0.01772 2.29 0.25 8 (40 S + 50 M) 90 0.07894 0.01084 -0.02106 0.02369 2.34 0.26 9 (45 S + 50 M) 95 0.06452 0.00969 -0.03548 0.03678 2.23 0.25 10 (50 S + 50 M) 100 0.05896 0.01045 -0.04104 0.04235 2.10 0.25 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

99 Table 4.12 Mean (θˆ), Standard Error (Se), Bias, RMSE of the Estimates of the Recombination Fraction (θ) and Mean and Se of the Maximum Lod Scores in the Incomplete Recessive Model for Linkage Analysis: Sample with 100 Multiplex Families Sampling design n E(θˆ) Se(θˆ) Bias(θˆ) RMSE E(lod) Se(lod) 1 (10 S + 100 M) 110 0.05801 0.00946 -0.04199 0.04304 2.58 0.40 2 (20 S + 100 M) 120 0.06972 0.01002 -0.03028 0.03190 3.17 0.41 3 (30 S + 100 M) 130 0.08263 0.00996 -0.01737 0.02002 4.06 0.43 4 (40 S + 100 M) 140 0.08314 0.00917 -0.01686 0.01919 4.11 0.45 5 (50 S + 100 M) 150 0.08475 0.00941 -0.01525 0.01792 4.44 0.45 6 (60 S + 100 M) 160 0.09021 0.00930 -0.00979 0.01350 4.82 0.46 7 (70 S + 100 M) 170 0.09117 0.00983 -0.00883 0.01321 5.36 0.49 8 (80 S + 100 M) 180 0.08425 0.00969 -0.01575 0.01849 5.43 0.52 9 (90 S + 100 M) 190 0.07711 0.00902 -0.02289 0.02460 5.30 0.49 10 (100 S + 100 M) 200 0.06785 0.00973 -0.03215 0.03359 5.12 0.47 ______

Note: The recombination fraction simulated (θ) = 0.1. The penetrance (f) = 0.5. n = sample size. S = simplex family; M = multiplex family.

4.4.2.2 Maximum Lod Scores

Table 4.7 lists the average maximum lod scores and their standard errors under the fully penetrant recessive model in two-phase designs with 50 multiplex pedigrees.

The sampling designs 7 (35 S + 50 M) and 6 (30 S + 50M) contribute the highest lod score values. The sampling design 1 (5 S + 50 M) leads to the lowest lod score. As the total sample size rises (Table 4.8), the results of the corresponding analyses are similar to

100 those with a sample with 50 multiplex pedigrees introduced above. In addition, for samples with 100 multiplex pedigrees, E(lod) almost doubles.

Table 4.9 shows E(lod) and its standard error (Se(lod)) corresponding to a penetrance of 0.8. Under this condition, sampling designs 7 (35 S + 50 M) and 6 (30 S +

50 M) contribute the highest lod score values of 5.12 and 5.02, respectively. Sampling design 1 (5 S + 50 M) leads to the lowest lod score, 2.85. We observed similar results in the corresponding analyses of a samples with 100 multiplex pedigrees (Table 4.10). As the penetrance decreases to 0.5 (Table 11), sampling designs 8 (40 S + 50 M) and 7 (35 S

+ 50 M) yield the highest lod score values of 2.34 and 2.29, respectively; and sampling design 1 (5 S + 50 M) provides the lowest lod score value of 1.25. This suggests that untyped simplex families can play an important role in the lod score. As the total sample size increases (Table 4.12), the corresponding analyses lead to the results similar to those with the sample above.

4.5 Summary

We extended the application of two-phase sampling designs from segregation analysis to linkage analysis. We estimated the recombination fraction (θ) for linkage analysis under six genetic models and identified effective sampling designs to produce good estimates of θ. Under the fully penetrant dominant models, sampling design 1 (5 S

+ 50 M) provides a good estimate of the recombination fraction (θ). We speculate that the richer genetic information in the multiplex families, as well as their larger proportions in the samples, might be responsible for good estimation of θ . But simplex families also

101 reveal some effect on obtaining better estimates of θ . When the penetrance of dominant models decreases to 0.8, sampling design 2 (10 S + 50 M) offers a better estimate of θ and a higher lod score. As the penetrance declines further to 0.5, sampling design 4 (20 S

+ 50 M) provides the best estimate of θ and the highest lod score value. Thus, untyped simplex families show obvious effects. The increasing percent of simplex families in the sample contributes to its effect on estimating θ and the lod score because they affect the estimates of allele frequencies and penetrances. Slight biases of the estimates of θ always occur, but they become smaller as the sample size becomes larger.

Under the fully penetrant recessive models, sampling designs 5 (25 S + 50 M) and

7 (35 S + 50 M) provide the best estimate of θ and the highest lod score, respectively.

This suggests that both multiplex and simplex families have an impact on yielding good estimates of θ. As the penetrance falls to 0.8, the role of simplex families appears to be more remarkable for obtaining better estimates of θ. Sampling design 6 (30 S + 50 M) provides the best estimate of θ and design 7 (35 S + 50 M) provides the largest lod score value. When the penetrance decreases to 0.5, the sampling designs 7 and 6 provide better estimates of θ , and sampling designs 8 (40 S + 50 M) and 7 result in higher lod score values. These results imply that untyped simplex families, which are relatively cheap to collect, can be beneficial for linkage analysis.

In summary, our results suggest that appropriate two-phase sampling designs may provide asymptotically good estimates for linkage analysis. In addition, multiplex families play a direct role, but simplex ones are also needed, especially in reduced penetrance models. Note that simplex families play an indirect role by providing better estimates of allele frequencies and penetrances because they do not enter the samples for

102 linkage analysis. Some researchers only used multiplex or highly dense families as samples to conduct pure linkage analysis (Slager and Vieland 1997; Hodge et al. 1997).

No matter how they assumed the values of parameters needed for linkage analysis, these studies only solved some theoretical problems if there was no information about the genetic model. Our two-phase sampling designs incorporate the information from simplex families to provide consistent estimates of recombination fraction and reduce possible false linkages, in which simplex families exert an indirect role. Therefore, these designs can solve practical problems, especially for rarer traits. In addition, simultaneous joint segregation and linkage analysis provides a powerful method to detect linkage for quantitative traits, but it is not easy to an make ascertainment correction for multiplex families when analyzing a qualitative trait (Gauderman et al. 1997); however, our two- phase designs can overcome this difficulty for linkage analysis.

103 CHAPTER V

THE COST EFFECTIVENESS OF LINKAGE ANALYSIS IN TWO-PHASE

DESIGNS

5.1 Introduction

Segregation and linkage analysis of pedigree samples are commonly conducted to characterize and locate genes. But collecting, phenotyping and genotyping pedigree samples are quite costly. Therefore, to evaluate the feasibility of a sampling scheme, a prior assessment of pedigree informativity is indispensable. Various patterns of phenotypes within a pedigree are differently informative, supplying a basis for some qualitative recommendations regarding an optimal sampling strategy. The connection between the size of a sample and its capacity to provide information can be established by examining a particular method of making a linkage decision (i.e., type I and type II errors). Alternatively, Elston and Bonney (1984) proposed another way to choose the best sampling strategy, based on the lod score of the sample collected. We showed in

Chapter IV that using likelihood (1.13) two-phase designs can provide consistent estimates of the recombination fraction from linkage analysis. We are interested in knowing the practical value of these designs regarding the cost of a study. Specifically, it is of interest to discover the optimal combinations of multiplex and simplex families in these sampling designs that lead to desirable cost efficacy. In this chapter, we investigate

104 a formalization of the cost of the sample and show how different two-phase sampling designs influence the cost of a linkage analysis.

5.2 A Cost Function for Two-Phase Sampling Designs

In Chapter III, we use sampling designs comprising a fixed number of multiplex families combined with a fraction of the available simplex families to estimate the parameters of segregation analysis, such as the allele frequency and the penetrance. Then, in Chapter IV only the multiplex families are genotyped and the information from them is combined with the information from segregation analysis of a larger sample that contains simplex families to conduct linkage analysis. We set a fixed family size of 11 in the samples so that we could conveniently compare the costs of different sampling plans. Let

C p be the cost of phenotyping and ascertaining one family. In order to allow for statistical variation, we set the lod score at [3 + 1.96 Se(lod)] rather than 3 as the level

required to detect significant linkage. Assume that nm is the number of multiplex families that need to be genotyped to achieve a lod score of [3.0 + 1.96 Se(lod)] in

linkage analysis, and nm is also the number of multiplex families to be phenotyped and

ascertained for segregation analysis. Let ns be the expected number of simplex families phenotyped and ascertained, for inclusion in a segregation analysis. Then the total cost of phenotyping and ascertaining the sample is

CTP = (ns + nm )C p . (5.1)

105 Assume that C g is the cost of genotyping one family, and C p = l C g (l > 0), where l is the cost ratio of phenotyping and ascertaining vs. genotyping. Depending on the nature of the traits to be phenotyped, the corresponding phenotyping cost varies from relatively inexpensive (e.g., height and weight or simple blood tests) and moderately expensive (multiple blood tests) to expensive (provocative testing) and very expensive

(detailed physiological tests and imaging studies) (Zhao et al. 1997). Here, we set l =

0.05, 0.5, 1, 5, 50, respectively and, because simplex families do not enter the samples to be genotyped, the total cost of a linkage study using a two-phase design is

CT = (ns + nm )C p +nmC g = (ns + nm )lC g + nmC g

=[(ns + nm )l + nm ]C g (5.2)

By (5.2), we are able to estimate the cost of segregation and linkage analysis. Also, we are trying to find that a sample with how many simplex families in two-phase designs is most cost effective.

5.3 Results

Based on the results of lod scores described in the last chapter, we estimated the

approximate numbers of multiplex families ( n m ) to achieve a lod score of [3 + 1.96

Se(lod)]. We first calculated the value of n m from each two-phase sampling design with

50 multiplex families in each sample. Based on this value, we calculated the approximate

values of ns corresponding to the various proportions of simplex families in the samples,

106 respectively. For example, for sampling design 1 (5S + 50 M) under the fully penetrant

dominant model, n m = [(3 + 1.96 ×0.91)/6.76]50= 36; ns = (5/50) n m = (5/50)36= 4.

We then compared the costs of different sampling designs. Tables 5.1 and 5.2 display the estimated numbers of simplex, and multiplex families required to achieve a lod score of [3 + 1.96 Se(lod)].

Table 5.1 Estimated Numbers of Simplex and Multiplex Families to Achieve a Lod Score of [3 + 1.96 Se(lod)]: Dominant Models Model ______Sampling Design D (f = 1) ID (f = 0.8) ID (f = 0.5)

( ns , nm ) ( ns , nm ) ( ns , nm ) 1 (5 S + 50 M) (4, 36) (6, 57) (13, 126) 2 (10 S + 50 M) (11, 52) (31, 122) 3 (15 S + 50 M) (35, 114) 4 (20 S + 50 M) (44, 110) 5 (25 S + 50 M) (56, 112) 6 (30 S + 50 M) (67, 111) 7 (35 S + 50 M) (77, 110) ______Note: D= Dominant Model; ID= Incomplete Penetrance Dominant Model. f = Penetrance.

ns = the numbers of simplex family; nm = the numbers of multiplex family.

107 Table 5.2 Estimated Numbers of Simplex and Multiplex Families to Achieve a Lod Score of [3 + 1.96 Se(lod)]: Recessive Models Model ______Sampling Design D (f = 1) ID (f = 0.8) ID (f = 0.5)

( ns , nm ) ( ns , nm ) ( ns , nm ) 1 (5 S + 50 M) (5, 41) (7, 65) (14, 137) 2 (10 S + 50 M) (8, 40) (13, 62) (24, 116) 3 (15 S + 50 M) (12, 37) (15, 49) (32, 107) 4 (20 S + 50 M) (15, 36) (19, 46) (39, 98) 5 (25 S + 50 M) (17, 34) (21, 42) (49, 97) 6 (30 S + 50 M) (20, 32) (24, 39) (53, 89) 7 (35 S + 50 M) (23, 32) (27, 38) (54, 77) 8 (40 S + 50 M) (26, 32) (31, 39) (60, 75) 9 (45 S + 50 M) (29, 32) (35, 39) (71, 79) 10 (50 S + 50 M) (32, 32) (39, 39) (84, 84) ______Note: R= Recessive Model; ID= Incomplete Penetrance Recessive Model. f = Penetrance.

ns = the numbers of simplex family; nm = the numbers of multiplex family.

Using the results in Tables 5.1 and 5.2 and the cost function (5.2), assuming C g =

$200.00 we calculated the cost for each of the two-phase sampling designs, for each of the six genetic models, and for each of five values of l, the cost ratio of phenotyping and ascertaining vs. genotyping costs. The results are summarized below.

Under the fully penetrant dominant model, we calculated the costs of sampling design 1 for the five cost ratios (l) (Table 5.3). As l becomes larger, the cost increases

Under the dominant model corresponding to the penetrance = 0.8 (Table 5.4), sampling design 2 (10 S + 50 M) is more cost effective than sampling design 1 (5 S + 50 M) for all

108 the cost ratios. Under the dominant model with the penetrance = 0.5 (Table 5.5), sampling design 4 (20 S + 50 M) is the most economical for l = 0.05 to 0.5; sampling designs 3 (15 S + 50 M) and 4 is the most economical for l = 1; sampling designs 1 (5 S +

50 M) and 3 are the most economical for l = 5 to 50.

Table 5.3 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost Function (5.2): Fully Penetrant Dominant Model Expected Cost ($1000s) ______Sampling Design l=0.05 l= 0.5 l= 1 l= 5 l= 50 1 (5 S + 50 M) 7.60 11.20 15.20 47.20 407.20 ______Note: l is the cost ratio of phenotyping and ascertaining vs. genotyping.

C g =$200.00.

Table 5.4 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost Function (5.2): Incompletely Penetrant Dominant Model, Penetrance (f) = 0.8 Expected Cost ($1000s) ______Sampling Design l=0.05 l= 0.5 l= 1 l= 5 l= 50 1 (5 S + 50 M) 12.03 17.70 24.00 74.40 641.40 2 (10 S + 50 M) 11.03 16.70 23.00 73.40 640.40 ______Note: l is the cost ratio of phenotyping and ascertaining vs genotyping.

C g =$200.00.

109 Table 5.5 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost Function (5.2): Incompletely Penetrant Dominant Model, Penetrance (f) = 0.5 Expected Cost ($1000s) ______Sampling Design l=0.05 l= 0.5 l= 1 l= 5 l= 50 1 (5 S + 50 M) 26.59 39.10 53.00 164.20 1415.20 2 (10 S + 50 M) 25.93 39.70 55.00 177.40 1554.40 3 (15 S + 50 M) 24.29 37.70 52.60 171.80 1512.80 4 (20 S + 50 M) 23.54 37.40 52.80 176.00 1562.00 5 (25 S + 50 M) 24.08 39.20 56.00 190.40 1702.40 6 (30 S + 50 M) 23.98 40.00 57.80 200.20 1802.20 7 (35 S + 50 M) 23.87 40.70 59.40 209.00 1892.00 ______Note: l is the cost ratio of phenotyping and ascertaining vs genotyping.

C g =$200.00.

Under the fully penetrant recessive model (Table 5.6), when l =0.05, sampling design 6 (30 S + 50 M) is the most cost effective and sampling designs 7 is the second most cost effective; when l is 0.5 – l, sampling design 6 is the least costly and sampling design 5 is the second least costly; when l is 5 – 50, sampling design 1 is the most effective. Under the incomplete penetrance recessive models with the penetrance = 0.8

(Table 5.7), sampling design 7 is the most economical and sampling design 6 is the second for l = 0.05; sampling designs 7 and 6 are the most effective for l = 0.5 to 1; sampling design 6 is the most effective one for l = 5 to 50. Under the incomplete penetrance recessive model with the penetrance = 0.5 (Table 5.8), sampling design 8 is the most cost effective and sampling designs 7 is the second for l =0.05; sampling designs 8 and 7 are the most cost effective for l =0.5; sampling 7 is the most cost effective and sampling design 8 is the second for l = 1 – 50;

110 Table 5.6 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost Function (5.2): Fully Penetrant Recessive Model, Penetrance (f) = 1.0 Expected Cost ($1000s) ______Sampling Design l=0.05 l= 0.5 l= 1 l= 5 l= 50 1 (5 S + 50 M) 8.66 12.80 17.40 54.20 468.20 2 (10 S + 50 M) 8.48 12.80 17.60 56.00 488.00 3 (15 S + 50 M) 7.89 12.30 17.20 56.40 497.40 4 (20 S + 50 M) 7.71 12.30 17.40 58.20 517.20 5 (25 S + 50 M) 7.31 11.90 17.00 57.80 516.80 6 (30 S + 50 M) 6.92 11.60 16.80 58.40 526.40 7 (35 S + 50 M) 6.95 11.90 17.40 61.40 556.40 8 (40 S + 50 M) 6.98 12.20 18.00 64.40 586.40 9 (45 S + 50 M) 7.10 12.50 18.60 67.40 616.40 10 (50 S + 50 M) 7.40 12.80 19.20 70.40 646.40 ______Note: l is the cost ratio of phenotyping and ascertaining vs genotyping.

C g =$200.00.

111 Table 5.7 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost Function (5.2): Incompletely Penetrant Recessive Model, Penetrance (f) = 0.8 Expected Cost ($1000s) ______Sampling Design l=0.05 l= 0.5 l= 1 l= 5 l= 50 1 (5 S + 50 M) 13.72 20.20 27.40 85.00 733.00 2 (10 S + 50 M) 13.15 19.90 27.40 87.40 762.40 3 (15 S + 50 M) 10.44 16.20 22.60 73.80 649.80 4 (20 S + 50 M) 9.85 15.70 22.20 74.20 659.20 5 (25 S + 50 M) 9.03 14.70 21.00 71.40 658.40 6 (30 S + 50 M) 8.43 14.10 20.40 70.80 637.80 7 (35 S + 50 M) 8.25 14.10 20.60 72.60 657.60 8 (40 S + 50 M) 8.50 14.80 21.80 77.80 707.80 9 (45 S + 50 M) 8.54 15.20 22.60 81.80 747.80 10 (50 S + 50 M) 8.58 15.60 23.40 85.80 787.80 ______Note: l is the cost ratio of phenotyping and ascertaining vs genotyping.

C g =$200.00.

112 Table 5.8 Estimated Cost to Achieve a Lod Score of [3 + 1.96 Se(lod)] for the Cost Function (5.2): Incompletely Penetrant Recessive Model, Penetrance (f) = 0.5 Expected Cost ($1000s) ______Sampling Design l=0.05 l= 0.5 l= 1 l= 5 l= 50 1 (5 S + 50 M) 28.91 42.50 57.60 178.40 1537.40 2 (10 S + 50 M) 24.60 37.20 51.20 163.20 1423.20 3 (15 S + 50 M) 22.79 35.20 49.20 160.40 1411.40 4 (20 S + 50 M) 20.97 33.30 47.00 156.60 1389.60 5 (25 S + 50 M) 20.86 34.00 48.60 165.40 1479.40 6 (30 S + 50 M) 19.22 32.00 46.20 159.80 1437.80 7 (35 S + 50 M) 16.71 28.50 41.60 146.40 1325.40 8 (40 S + 50 M) 16.35 28.50 42.00 150.00 1365.00 9 (45 S + 50 M) 17.30 30.80 45.80 165.80 1515.80 10 (50 S + 50 M) 18.48 33.60 50.40 184.80 1696.80 ______Note: l is the cost ratio of phenotyping and ascertaining vs genotyping.

C g =$200.00.

5.4 Summary

In order to discover the optimum sampling strategy, we explored the impact of the cost ratios of phenotyping and ascertaining vs. genotyping on the cost efficacy of two- phase designs used for mapping a disease trait locus. The relevant results based on the cost function (5.2) are described below.

Today, the great progress in the techniques in molecular biology makes genotyping much less costly, so we rarely see the case l = 0.05 in practice. Under the fully penetrant dominant model, only sampling design 1 (5 S + 50 M) is feasible and its

113 cost has been mentioned above. Under the dominant model with the penetrance = 0.8, sampling design 2 (10 S + 50 M) is more cost effective for all cost ratios. Under the dominant model with the penetrance= 0.5, sampling design 4 (20 S + 50 M) is the best for l = 0.05 – 0.5; sampling design 3 (15 S + 20 M) is the best one for l = 1 – 50. So more simplex families increase the cost of phenotyping, but decrease the total cost.

Under the fully penetrant recessive model, sampling design 6 (30 S + 50 M) is the best, and sampling design 7 (35 S + 50 M) is the second when l = 0.05 – 1. Under the incomplete penetrant recessive model with penetrance= 0.8, sampling designs 7 and 6 are the most economical. Under the incomplete recessive model with penetrance= 0.5, sampling designs 8 and 7 are the best ones. Compared to the cost under dominant models, under recessive models we need more simplex families to reach the most cost efficiency.

More simplex families seem to increase the cost of phenotyping but in the end reduce the total cost. Simplex families show obvious roles to achieve economical costs in the case of reduced penetrances.

Here, we set the lod score at [3 + 1.96 Se(lod)] rather than 3 as the level to detect significant linkage. Therefore, the costs calculated in this study are somewhat conservative in that it may be possible to decrease the sample size, thus reducing costs.

Our results were derived by assuming that the recombination fraction between the trait locus and the marker is 0.1 and that a marker with four alleles is fully polymorphic.

It is not difficult to adjust the sample size for different situations. This will change the costs, but will not affect our general conclusions regarding the sampling designs.

Only few results about the cost of pure linkage analysis have been reported

(Ginsburg and Axenovich 1997; Goldgar and Easton 1997). Golgar and Easton admitted

114 that the true genetic model underlying disease susceptibility was assumed to be known for lod score calculations in order to estimate costs, but this situation unlikely to occur in practice. We have proposed two-phase designs that provide a more realistic and economic cost estimates. In addition, Golgar and Easton (1997) stated that if either a wrong model was used for model-based linkage analysis or a model-free approach was taken, the cost was likely to be higher than the real cost. The maximum lod score is usually underestimated in linkage analysis using wrong genetic models and more families are needed to reach linkage level (Xu et al. 1998), so the cost is higher. In our two-phase designs, simplex families play an important role in obtaining consistent estimates of parameters. Although appropriately increasing the number of simplex families in the samples increases the cost of phenotyping, it decreases the total cost for segregation and linkage analysis, especially in the case of reduced penetrance models. On the other hand, in simultaneous joint segregation and linkage analysis, both simplex and multiplex families enter the samples for parameter estimation, so its cost would be higher than that of our two-phase designs.

115 CHAPTER VI

CONCLUSIONS AND TOPICS FOR FURTHER STUDY

6.1 Conclusions

In this dissertation, we have developed two-phase sampling designs for segregation analysis and linkage analysis. We compared a variety of sampling schemes using numerous combinations of multiplex and simplex families in different proportions.

The results of the likelihoods for segregation analysis derived from sibships and nuclear families suggest that two-phase designs are computationally practical for the classical model. Furthermore, our new likelihood proposed for two-phase designs can be used to carry out segregation analysis and linkage analysis of general pedigrees; it can provide comprehensive information for yielding consistent estimates of the parameters and optimal designs.

Our two-phase designs incorporating the information of simplex families are useful for overcoming the following significant difficulties caused by exclusively relying on the information in multiplex families. First, for rarer traits multiplex families are more rarely found in the population, so that the information from these families alone can be too limited to estimate parameters . Second, even if there are enough multiplex families in some samples, there are no good methods of ascertainment adjustment for segregation analysis. Third, the cost to diagnose affected and unaffected individuals exactly is very high for some diseases.

116 Under a dominant model with penetrance = 1, sampling design1 (5 S + 50 M) or

(10 S + 100 M) provides good estimates of the allele frequency (q) and recombination fraction (θ), which suggests that the percent of simplex families in each sample at about

9%, which results from the calculation, 5/(5+50) = 9%, is enough to yield good parameter estimates. Under a dominant model with penetrance = 0.8, the sampling designs with higher proportions of multiplex families combined with somewhat lower proportions of simplex families, such as sampling design 2 (10 S + 50 M), provide better estimates of q and θ, and the design 2 is also more cost-effective, which suggests about 18% as the percent of simplex families necessary for a better design. Under a dominant model with penetrance = 0.5, sampling designs with higher proportions of simplex families, such as sampling designs 3 (15 S + 50 M) and 4 (20 S + 50 M), yield consistent estimates of q and θ and, in addition, sampling design 4 is more cost-effective than the other designs for most cost ratios, so that samples with 23% - 28% simplex families are good choices.

Under a recessive model with full penetrance, sampling designs 4 (20 S + 50 M) and 5 (25 S + 50 M) produce the best parameter estimates, and sampling design 6 (30 S +

50 M) is most cost-effective for most cost ratios, so that samples containing about 29%

(20/70) to 38% (30/80) simplex families are good designs. When the penetrance decreases to 0.8, sampling designs 5 and 6 provide better estimates of q and θ, and sampling designs 7 (35 S + 50 M) and 6 are more cost effective than the other designs; therefore, samples with 38% (30/80) simplex families represented by design 6 are best choices. When the penetrance decreases to 0.5, simplex families become indispensable in achieving good parameter estimates and lod scores while multiplex families are more informative for yielding lod scores. For example, sampling designs 7 and 6 provide

117 better parameter estimates and sampling designs 8 (40 S + 50 M) and 7 are more cost- effective than the other designs; thus, samples having about 41% (35/85) simplex families (seen in design 7) are good designs.

Therefore, the proposed two-phase designs combining a high proportion of multiplex families and a small proportion of simplex families are better for producing consistent estimates of the parameters and higher cost efficacy for testing genetic models.

Especially, when there is incomplete penetrance, having more simplex families in the samples is necessary to obtain good parameter estimates. Guo and Elston (2000) found that, for model-free methods of analysis, optimal two-stage designs including both affected and discordant relative pairs are usually more cost effective than the optimal designs having either affected or discordant relative pairs alone. This suggests that samples with a combination of two kinds of families usually provide good designs.

The optimal two-phase designs were established here for qualitative traits.

Special versions of the programs SEGREG and LODLINK from S.A.G.E. version 4.5

(2003) were used to study these two-phase designs. We used the program SEGREG to estimate the parameters such as the allele frequencies and penetrances. Then, linkage analysis was conducted using the program LODLINK, which used the estimated allele frequency and penetrance files automatically produced by the program SEGREG.

6.2 Topics for Further Study

The sampling scheme is an important factor affecting the choice of optimum study designs and certainly deserves more investigation. In this dissertation, we have

118 investigated two-phase designs for linkage analysis in combination with segregation analysis using either nuclear families or three-generation pedigrees. It is always desirable to have large pedigrees for linkage analysis, although they will take more computation time. Therefore, for further study, first of all it would be necessary to incorporate large pedigrees in the sampling designs. The efficiency of two-phase designs using large pedigrees could be evaluated by simulation.

We are interested in increasing the power of linkage analysis. We have used two-point linkage analysis for two-phase designs. However, multipoint linkage analysis is usually more powerful than two-point linkage analysis, for the former makes better use of the information available in the markers (Kruglyak and Lander 1995). To develop our results, we would consider extending two-phase designs to the case of using multipoint linkage analysis as the second aspect for further study. This could be achievable by a study combining the use of the SEGREG and MLOD programs in S.A.G.E.

We assumed that only one trait locus is involved in the etiology of the disease.

However, genetic heterogeneity, which means that two or more loci can cause the same disease, has long been recognized as a factor affecting linkage analysis (Morton 1956;

Badner et al. 1998). Thus, two-phase designs for locating two linked or unlinked disease loci could be a third aspect for further investigation.

We have established the optimal two-phase designs for qualitative traits. For quantitative traits, combining extremely concordant and extremely discordant relative pairs is more cost effective than using extremely discordant relative pairs alone in model- free linkage analysis (Gu et al. 1996). Therefore, finally, it would be worthwhile to carry out the investigation of two-phase designs for quantitative traits.

119 BIBLIOGRAPHY

Aitken JF, Bailey-Wilson J, Green AC, MacLennan R, Martin NG (1998) Segregation analysis of cutaneous melanoma in Queensland. Genet Epidemiol 15:391–401.

Amos CI, Elston RC (1989) Robust methods for the detection of genetic linkage for quantitative data from pedigrees. Genet Epidemiol 6:349–360.

Badner JA, Gershon ES, Goldin LR (1998) Optimal ascertainment strategies to detect linkage to common disease alleles. Am J Hum Genet 63:880-888.

Bailey NTJ (1951) A classification of methods of ascertainment and analysis in estimating the frequencies of recessives in man. Ann Eugenics 16:223 225.

Bateson W (1909) Mendel’s principles of heredity. 2nd ed, Cambridge University Press.

Blackwelder WC, Elston RC (1985) A comparison of sib-pair linkage tests for disease susceptibility loci. Genet Epidemiol 2:85 98.

Blanco R, Arcos-Burgos M, Paredes M, Palomino H, Jara L, Carreno H, Obreque V, Munoz MA (1998) Complex segregation analysis of nonsyndromic cleft lip/palate in a Chilean population. Genet Mol Biol 21:1-10.

Blangero J (1995) Genetic analysis of a common oligogenic trait with quantitative correlates: summary of GAW9 results. Genet Epidemiol 12:689-706.

Blangero J, Williams JT, Almasy L (2001) Variance component methods for detecting complex trait loci. Advances in Genetics 42:151-181.

Boehnke M, Lange K (1984) Ascertainment and goodness of fit of variance components models for pedigree data. In Rao DC, Elston RC, Kuller LH, Feinleib M, Carter C, Havlik R (eds). Genetic Epidemiology of Cornary Heart Disease: Past, Present and Future. New York, Alan R Liss, pp 173-192.

Boehnke M, Young MR, Moll PP (1988) Comparison of sequential and fixed-structure sampling of pedigrees in complex segregation analysis of a quantitative trait. Am J Hum Genet 43:336-343.

Bonney GE (1984) On the statistical determination of major gene mechanisms in continuous human traits: Regression models. Am J Med Genet 18:731-749.

Bonney GE (1986) Regressive logistic models for familial disease and other binary traits. Biometrics 42:611 625.

120 Bonney GE (1992) Compound regressive models for family data. Hum Heredity 42:28- 41.

Bonney GE (1998) Ascertainment corrections based on smaller family units. Am J Hum Genet 63:1202-1215.

Bonney GE, Lathrop GM, Lalouel J-M (1988) Combined linkage and segregation analysis using regressive models. Am J Hum Genet 43:29 37.

Burton PR, Palmer LJ, Jacobs K, Keen KJ, Olson JM, Elston RC (2001) Ascertainment adjustment: where does it take us? Am J Hum Genet 67:1505-1514.

Burton PR, Palmer LJ, Keen KJ, Olson JM, Elston RC (2002) Response to Epstein et al. Am J Hum Genet 71:441-442.

Canings C, Thompson EA, Skolnick M (1976) The recursive derivation of likelihoods on complex pedigrees. Adv Appl Prob 8:622-625.

Canings C, Thompson EA, Skolnick M (1978) Probability functions on complex pedigrees. Adv Appl Prob 10:26-91.

Cannings C, Thompson EA (1977) Ascertainment in the sequential sampling of pedigrees. Clin Genet 12:208-212.

Chotai J (1984) On the lod score method in linkage analysis. Ann Hum Genet 48:359- 378.

Clerget-Darpoux F, Bonaïti-Pellié C, Hochez J (1986) Effects of misspecifying genetic parameters in lod score analysis. Biometrics 42:393-399.

Clerget-Darpoux F, Bonaïti-Pellié C (1992) Strategies on marker information for the study of human disease. Ann Hum Genet 46:145-153.

Cleves MA, Elston RC (1997) Alternative test for linkage between two loci. Genet Epidemiol 14:117-131.

Comuzzie AG, Williams JT (1999) Correcting for ascertainment bias in the COGA data set. Genet Epidemiol 17 (suppl 1):S109-S114.

Cordell HJ, Olson JM (2000) Correcting for ascertainment bias of relative-risk estimates obtained using affected-sib-pair linkage data. Genet Epidemiol 18:307-321.

Cordell HJ, Wedig GC, Jacobs KB, Elston RC (2000) Multilocus linkage tests based on affected relative pairs. Am J Hum Genet 66:1273-1286.

121 Cox DR (1972) Regression models and life-tables (with discussion). J R Stat Soc Ser B 34:187 220.

Cox DR (1975) Partial likelihood. Biometrika 62:269 276.

Cox NJ, Hodge SE, Marazita ML, Spence MA, Kidd KK (1988) Some effects of selection strategies on linkage analysis. Genet Epidemiol 5:289-297.

Craig JE, Rochette J, Fisher CA, Weatherall DJ, Marc S, Lathrop GM, Demenais F, et al. (1996) Dissecting the loci controlling fetal haemoglobin production on chromosomes 11p and 6q by the regressive approach. Nat Genet 12:58-64.

Davie AM (1979) The ‘single’ method for segregation analysis under incomplete ascertainment. Ann Hum Genet 42:507-512.

Day NE, Simon MJ (1976) Disease susceptibility genes - their identification by multiple case family studies. Tissue Antigens 8:109-119.

DeVries RRP, Fat RFMLA, Nijenhuis LE, Van Rood JJ (1976) HLA-linked genetic control of host response of Mycobacterium leprae. Lancet ii:1328-1330.

Drigalenko E (1998) How sib pairs reveal linkage. Am J Hum Genet 63:1242-1245.

Dürner M, Vieland VJ, Greenberg DA (1999) Further evidence for the increased power of LOD scores compared with nonparametric methods. Am J Hum Genet 64:281-289.

Edwards JH (1971) The analysis of X-linkage. Ann Hum Genet 34:229-250.

Elston RC (1981) Segregation analysis. Advances in Human Genetics 11: 63-120.

Elston RC (1992) Design for the global search of the human genome by linkage analysis. In: Proceedings of the XVIth International Biometric Conference. Hamilton, New Zealand, pp 39-51.

Elston RC (1995) 'Twixt cup and lip: How intractable is the ascertainment problem? Am J Hum Genet 56:15 17.

Elston RC (1998a) Methods of linkage analysis and the assumptions underlying them. Am J Hum Genet 63:931 934.

Elston RC (1998b) Linkage and Association. Genet Epidemiol 15:565-576.

Elston RC, Bonney GE (1984) Sampling considerations in the design and analysis of family studies. In Rao DC, Elston RC, Kuller LH, Feinleib M, Carter C, Havlik R (eds). Genetic Epidemiology of Coronary Heart Disease: Past, Present, and Future. New York, Alan R Liss: 349-371.

122 Elston RC, Buxbaum S, Jacobs KB, Olson JM (2000) Haseman and Elston revisited. Genet Epidemiol 19:1–17.

Elston RC, Cordell HJ (2001) Overview of model-free methods for linkage analysis. Advances in Genetics 42:135-150.

Elston RC, Guo X, Williams LV (1996) Two-stage global search designs for linkage analysis using pairs of affected relatives. Genet Epidemiol 13:535-558.

Elston RC, Sobel E (1979) Sampling considerations in the gathering and analysis of pedigree data. Am J Hum Genet 31:62 69.

Elston RC, Stewart J (1971) A general model for the genetic analysis of pedigree data. Hum Hered 21:523-545.

Elston RC, Yelverton KC (1975) General models for segregation analysis. Am J Hum Genet 27:31-45.

Epstein MP, Lin X, Boehnke M (2002) Ascertainment-adjusted parameters revisited. Am J Hum Genet 70:886-895.

Ewens WJ, Shute NC (1986) A resolution of the ascertainment sampling problem. I. Theory. Theor Popul Biol 30:388 412.

Faucett CL, Gauderman WJ, Thomas DC, Ziogas A, Sobel E (1993) Combined segregation and linkage analysis of late-onset Alzheimer's disease in Duke families using Gibbs sampling. Genet Epidemiol 10:489 494.

Feingold E, Siegmund DO (1997) Strategies for mapping heterogeneous recessive traits by allele-sharing methods. Am J Hum Genet 60:965 978.

Fisher RA (1934) The effect of methods of ascertainment upon the estimation of frequencies. Ann Eugenics 6:13 25.

Fisher RA (1935) The detection of linkage with “dominant” abnormalities. Ann Eugenics 6:187-201.

Gastwirth JL, Freidlin B (2000) On power and efficiency robust linkage tests for affected sibs. Ann Hum Genet 64: 443-453.

Gauderman WJ, Faucett CL (1997) Detection of gene-environment interactions in joint segregation and linkage analysis. Am J Hum Genet 61:1189-1199.

Gauderman WJ, Faucett CL, Morrison JL, Carpenter CL (1997) Joint segregation and linkage analysis of a quantitative trait compared to separate analyses. Genet Epidemiol 14:993-998.

123 Gauderman WJ, Morrison JL, Carpenter CL, Thomas DC (1997) Analysis of gene- smoking interaction in lung cancer. Genet Epidemiol 14:199 214.

Gauderman WJ, Witte JS, Faucett CL, Morrison JL, Thomas DC (1995) Genetic epidemiologic analysis of quantitative phenotypes using Gibbs sampling. Genet Epidemiol 12:747 752.

George VT, Elston RC (1991) Ascertainment: an overview of the classical segregation analysis model for independent sibships. Biometrical J 33:741 753.

Gershon ES, Matthysse S (1977) X-linkage: Ascertainment through doubly ill probands. J Psychiatr Res 13:161-168.

Ginsburg E, Axenovich TI (1997) Sample size required for predefined linkage decision quality. Genet Epidemiol 14:479-491.

Ginsburg E, Malkin I, Elston RC (2003) Sampling correction in pedigree analysis. Stat Appl Genet Mol Biol 2 http://www.bepress.com/sagmb/vol2/iss1/art2.1–21.

Ginsburg E, Malkin I, Elston RC (2004) Sampling correction in linkage analysis. Genet Epidemiol 27:87-96.

Go RCP, Elston RC, Kaplan EB (1978) Efficiency and robustness of pedigree segregation analysis. Am J Hum Genet 30:28-37.

Goldgar DE, Easton DF (1997) Optimal strategies for mapping complex diseases in the presence of multiple loci. Am J Hum Genet 60:1222-1232.

Goldstein DR, Dudoit S, Speed TP (2001) Power and robustness of a score test for linkage analysis of quantitative traits using identity by descent data on sib pairs. Genet Epidemiol 20:415–431.

Green JR, Woodrow JC (1977) Sibling method for detecting HLA-linked genes in disease. Tissue Antigens 9:31-35.

Greenberg DA (1986) The effect of proband designation on segregation analysis. Am J Hum Genet 39:329-339.

Greenberg DA (1989) Inferring mode of inheritance by comparison of lod score. Am J Med Genet 35:480-486.

Gu C, Rao DC (1997) A linkage strategy for detection of human quantitative-trait loci. II. Optimization of study designs based on extreme sib pairs and generalized relative risk ratios. Am J Hum Genet 61:211-222.

124 Gu C, Todorov A, Rao DC (1996) Combining extremely concordant sibpairs with extremely discordant sibpairs provides a cost effective way to linkage analysis of quantitative trait loci. Genet Epidemiol 13:513-533.

Guo SW (1998) Inflation of sibling recurrence-risk ratio, due to ascertainment bias and /or overreporting. Am J Hum Genet 63:252-258.

Guo SW, Thompson EA (1992) A Monte Carlo method for combined segregation and linkage analysis. Am J Hum Genet 51:1111 1126.

Guo X, Elston RC (2000) Two-stage global search designs for linkage analysis. II. Including discordant relative pairs in the study. Genet Epidemiol 18:111-127.

Haldane JBS (1919) The combination of linkage values and calculation of distance between the loci of linked factors. J Genet 8:299-309.

Haldane JBS (1938) The estimation of the frequencies of recessive conditions in man. Ann Eugenics 8:255 262.

Hanis CL, Chakraborty R (1984) Nonrandom sampling in hum genetics: Familial correlations. IMA J Math Appl Med Biol 1:193-213.

Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2:3–19.

Hodge SE (1988) Conditioning on subsets of the data: Application to ascertainment and other genetic problems. Am J Hum Genet 43:364-373.

Hodge SE, Elston RC (1994) Lods, wrods, and mods: The interpretation of lod scores calculated under different models. Genet Epidemiol 11:329-342.

Hodge SE, Abreu PC, Greenberg DA (1997) Magnitude of type I error when single-locus linkage analysis is maximized over models: A simulation study. Am J Hum Genet 60:217-227.

Hogben (1931) The genetic analysis of familial traits. I. Single gene substitutions. J Genet 25:97.

Holmans P (1993) Asymptotic properties of affected-sib-pair linkage analysis. Am J Hum Genet 52:362 374.

Hopper JL, Mathews JD (1982) Extension to multivariate normal models for pedigree analysis. Ann Hum Genet 39:485-491.

Jarvik GP (1998) Complex segregation analysis: uses and limitations. Am J Hum Genet 63:942 946.

125 Keen KJ, Elston RC (2001) A problem in ascertainment. Commun Statist-Theory Meth 30:1615-1631.

Khoury MJ, Beaty TH, Cohen BH (1993) Fundamentals of genetic epidemiology. New York, NY: Oxford University Press.

Knapp M, Seuchter S, Baur M (1994a) Linkage analysis in nuclear families. 1. Optimality criteria for affected sib-pair tests. Hum Hered 44:37 43.

Knapp M, Seuchter S, Baur M (1994b) Two-locus disease models with two marker loci: The power of affected sib-pair tests. Am J Hum Genet 55:1030-1041.

Kosambi DD (1944) The estimation of map distances from recombination values. Ann Eugen 12:172-175.

Kramer PL, Pauls DL, Price RA, Kidd KK (1989) Estimation of segregation and linkage parameters in simulated data. I. Segregation analyses with different ascertainment schemes. Am J Hum Genet 45:83-94.

Kruglyak L, Lander ES (1995) Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am J Hum Genet 57:439–454.

Lalouel JH, Morton NE (1981) Complex segregation analysis with pointers. Hum Hered 31:312 321.

Lalouel JM, Rao DC, Morton NE, Elston RC (1983) A unified model for complex segregation analysis. Am J Hum Genet 35:816 826.

Lander ES, Kruglyak (1995) Genetic dissection of complex traits: Guideline for interpreting and reporting linkage results. Nat Genet 11:241-247.

Lange K, Boehnke M (1983) Extensions to pedigree analysis. IV. Covariance components models for multivariate traits. Am J Med Genet 14:513-524.

Lange K, Elston RC (1975) Extensions to pedigree analysis I - Likelihood calculation for simple and complex pedigrees. Hum Hered 25:95-105.

Li CC, Mantel N (1968) A simple method of estimating the segregation ratio under complete ascertainment. Am J Hum Genet 20:61-81.

Martinez M, Abel L, Demenais F (1995) How can maximum likelihood methods reveal candidate gene effects on a quantitative trait? Genet Epidemiol 12:789 794.

Martinez M, Goldstein AM, O’Connell JR (2001) Comparison of likelihood approaches for combined segregation and linkage analysis of a complex disease and a candidate gene

126 marker under different ascertainment schemes. Genet Epidemiol 21 (Suppl 1):S760- S765.

Morton NE (1955) The detection and estimation of linkage between the genes for elliptocyosis and the Rh blood type. Am J Hum Genet 8:80-96.

Morton NE (1956) Sequential tests for the detection of linkage. Am J Hum Genet 7:277- 318.

Morton NE (1959) Genetic tests under incomplete ascertainment. Am J Hum Genet 11:1 16.

Morton NE (1969) Segregation analysis. In Morton NE (ed): Computer applications in genetics. Honolulu: University of Hawaii Press, pp 129.

Morton NE (1982) Outline of genetic epidemiology Basel, S Karger.

Morton NE, Maclean C (1974) Analysis of family resemblance. III. Complex segregation of quantitative traits. Am J Hum Genet 26:489-503.

Nie L (2003) Parameter estimation in ascertainment adjustment in complex diseases. Genet Epidemiol 25:388-391.

Olson, JM (1997) Likelihood-based models for genetic linkage analysis using affected sib pairs. Hum Hered 47:110-120.

Olson JM, Cordell (2000) Ascertainment bias in the estimation of sibling genetic risk parameters. Genet Epidemiol 18:217-235.

Olson JM, Wijsman EM (1993) Linkage between quantitative trait and marker loci: Methods using all relative pairs. Genet Epidemiol 10:87–102.

Palmer LJ, Jacobs KB, Elston RC (2000) Haseman and Elston revisited:The effects of ascertainment and residual familial correlations on power to detect linkage. Genet. Epidemiol 19:456–460.

Patil GP (1997) Weighted distributions. In encyclopedia of biostatistics, vol. 6, P. Armitage & T. Colton, eds, Wiley, Chichester, pp. 4735-4738.

Penrose LS (1935) The detection of autosomal linkage in data which consist of pairs of brothers and sisters of unspecified parentage. Ann Eugenics 6:133-138.

Penrose LS (1953) The general purpose sib-pair linkage test. Ann Eugenics 18:120 124.

Rabinowitz D (1996) A pseudolikelihood approach to correcting for ascertainment bias in family studies. Am J Hum Genet 59:726-730.

127 Rao CR (1965) On discrete distributions arsing out of methods of ascertainment. In G.P.Patil(ed.), Classical and Contagious Discrete Distributions, Statistical Publication Society: Calcutta. 320-332.

Risch N (1984) Segregation analysis incorporating linkage markers. I Single-locus models with an application to type I diabetes. Am J Hum Genet 36:363 386.

Risch N (1990a) Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am J Hum Genet 46:229 241.

Risch N (1990b) Linkage strategies for genetically complex traits. III. The effect of marker polymorphism on analysis of affected relative pairs. Am J Hum Genet 46:242- 253.

S.A.G.E. (2003) Statistical analysis for genetic epidemiology, http://darwin.cwru.edu/sage/

Schaid DJ, Nick TG (1990) Sib-pair linkage tests for disease susceptibility loci: common test vs. the asymptotically most powerful test. Genet Epidemiol 18:33 47.

Seuchter SA, Hebebrand J, Klug B, Knapp M, Lehmkuhi G, Poustka F, Schmidt M, Remschmidt H, Baur MP (2000) Complex segregation analysis of families ascertained through Gilles de la Tourette Syndrome. Genet. Epidemiol 19:456–460.

Sham PC, Zhao JH, Curtis D (1997) Optimal weighting scheme for affected sib-pair analysis of sibship data. Ann Hum Genet 61:1661-1668.

Shute NCE, Ewens WJ (1988a) A resolution of the ascertainment sampling problem. II. Generalizations and numerical results. Am J Hum Genet 43:374 386.

Shute NCE, Ewens WJ (1988b) A resolution of the ascertainment sampling problem. III. Pedigrees. Am J Hum Genet 43:387 395.

Slager SL, Vieland VJ (1997) Investigating the numerical effects of ascertainment bias in linkage analysis: development of methods and preliminary results. Genet Epidemiol 14:1119-1124.

Slager SL, Huang J, Vieland VJ (2001) Power comparisons between the TDT and two likelihood-based methods. Genet Epidemiol 20:192-209.

Smith (1959) A note on the effects of method of ascertainment on segregation ratios. Ann Hum Genet 23:311.

Suarez BK, Rice J, Reich T (1978) The generalized sib pair IBD distribution: its use in the detection of linkage. Ann Hum Genet 42:87 94.

128 Thomas DC, Cortessis V (1992) A Monte Carlo Bayesian method for genetic linkage analysis. Hum Hered 42:63 76.

Thompson EA (1988) Partial and conditional likelihoods in pedigree analysis. Tech rep 141, Department of Statistics, University of Washington, Seattle.

Tierney C, McKnight B (1993) Power of affected sibling method tests for linkage. Hum Hered 43:276 287.

Tiret L, Rigat B, Visvikis S, Breda C, Corvol P, Cambien F, Soubrier F (1992) Evidence, from combined segregation and linkage analysis, that a variant of the angiotensin I converting enzyme (ACE) gene controls plasma ACE levels. Am J Hum Genet 51:197 205.

Vanita, Singh JR, Singh D (1999) Genetic and segregation analysis of congenital cataract in the Indian population. Clin Genet 56:389-393.

Vieland VJ, Hodge SE (1995) Inherent intractability of the ascertainment problem for pedigree data: A general likelihood framework. Am J Hum Genet 56:33 43.

Vieland VJ, Hodge SE (1996) The problem of ascertainment for linkage analysis. Am J Hum Genet 58:1072-1084.

Weeks and Lange K (1988) The affected-pedigree-member method of linkage analysis. Am J Hum Genet 42:315-326.

Weinberg W (1912) Weitere Beiträge zur Theorie der Vererbung. IV. Über Methode und Fehlerquellen der Untersuchung auf Mendelsche Zahlen beim Menschen. Arch Rassen Gesellschaftsbiol 9:165 174.

Weitkamp LR, Stancer HC, Persad E, Flood C, Guttormsen S (1981) Depressive disorders and HLA:A gene on chromosome 6 that can affect behavior. N Engl J Med 305:1301-1306.

Whittemore AS, Tu IP (1998) Simple, robust linkage tests for affected sibs. Am J Hum Genet 62:1228-1242.

Xu J, Meyers DA, Pericak-Vance MA (1998) Lod score analysis. Approaches to gene mapping in complex human Diseases, edited by Haines JH and Pericak-Vance MA.

Zhao H, Zhang H, Rotter JI (1997) Cost-effective sib pair designs in the mapping of quantitative-trait loci. Am J Hum Genet 60:1211-1221.

Zhao LP, Hsu L, Davidov O, Potter J, Elston RC, Prentice RL (1997) Population-based family study designs: An interdisciplinary research framework for genetic epidemiology. Genet Epidemiol 14:368-388.

129