BC-NAM Analysis

Statistical models and processes

1. Phenotype data analysis

Phenotypic data from 20 field trials planted in the four years 2005-2008 was used to ensure flowering data contained all 24 BC-NAM families. Table 1 (in the paper)details the number of observations within each BC-NAM family within each year, within a year all genotypes are in common.

DTF ~ Site + Site:Genotype + Site:Spatial +Site:Error

where Site is a fixed effect with 20 levels allowing for different mean DTF at each site, Site:Error is a set of random effects allowing for error at each site. Site:Spatial is a set of terms unique to each site allowing for possible spatial field variations including column and row random effects and natural variation due to genotype placement in the field. The Site:Genotype term is a random effect which is given a structure to allow for GxE. In this case we have used a factor analytic (FA) structure (Smith et al 2001).

The variance explained by a FA model of order 1 was 75.5% and the FA order 2 was 81.7%, so the most appropriate model was the FA2. Table 2 (in the paper) detals the genetic variance and heritability for each site along with their respective first order loadings (correlations with the average site). There is some GxE observed for flowering but by fitting the FA2 model we can make satisfactory comparison of results across the sites which consequently allows for comparisons across all BC-NAM families.

2. Single Marker Analysis

Each BC-NAM family was genotyped separately so when joined together there are large amounts of missing data and there are many markers that were only polymorphic in one BC-NAM family but were at the same location as other markers.

The pops are unbalanced across the sites (Table 1 in the paper) and the markers are unbalanced across the pops (Table 3 in the paper). There were a total of 932 individual markers, with between 146 and 489 markers in each BC-NAM family (average 373). The concurrence between markers was also unbalanced with between 56 and 389 concurrent markers between each BC-NAM family (average 207). These imbalances cause computational difficulties when attempting to fit a statistical model containing all BC-NAM families and all sites. To overcome these difficulties we fitted a linear mixed model with an aim to find significant markers within each individual BC-NAM family.

To obtain values that indicate significant marker effects within a single BC-NAM family, we subsetted the marker data for eachBC-NAM familyby selecting only the markers that were between 5% and 95% polymorphic. For example for the BC-NAM family Ai4, we selected a subset of 489 markers from the original 932 (see Table 3 in the paper). The phenotypic data however still contained information from all 24 BC-NAM families so we needed to include a family effect in our statistical model. We fit a linear mixed model of the form:

DTF ~ Fam + Fam:Marker + Site + Site:Genotype + Site:Spatial + Site:Error

where Fam is a fixed effect allowing for mean differences betweenBC-NAM families and Fam:Marker is a fixed effect interaction for the single marker with all the BC-NAM families.Site is a fixed mean flowering effect for each site, Site:Genotype is a random variance component term, Site:Spatial are the spatial effects for each site obtained by the MET in section 1 and Site:Error are the error effects for each site also found in the MET in section 1. The model is fit for each single marker and then we select the effect, standard error and Z ratio terms for the particularBC-NAM family of interest.

3. Concatenating the single marker array

Our aim was to identify genomic regions of interest and possible QTL within each BC-NAM family. Our 93224 Marker by BC-NAM family matrix of Z ratio terms was converted into Pvalues by using a normal distribution in order to complete our multiple comparisons measure. These Pvaluestested for significant differences in flowering between the marker alleles.

Since each BC-NAM family was genotyped separately, there were many markers that were at the same cM location within a LG but appeared in different BC-NAM families. A concatenating process was completed to create a unique marker at each location containing as many BC-NAM families as possible. We selected the marker with the lowest Pvalue for each BC-NAM family by combining 5cM locations moving along each LG in 1cM steps. For example, we selected the marker with the lowest Pvalue for each BC-NAM family in LG 1 from 0cM to 5cM, and then repeated to find the lowest Pvalue from 1cM to 6cM and so on. Following this process, we generated a marker by BC-NAM family matrix of dimension 100124 Pvalues. This matrix was still not a complete table of values, at each cM location there are between 1 and 24 values with an average cM location containing between 15 and 18 values.

The table belowdetails the number of markers for each LG along with their respective length in cM for our new set of concatenated marker locations. The discrepancy between our new set and the original set lies in the fact that the original set did not have equally spaced markers along the LG and also had many markers located at the same location. This new set had markers at least every 1cM but in some places there may be a gap where no markers were present, in other places our 5cM window caused a spread of values that filled previously empty gaps.

Table FileS1. Details of DArT markers used genome-wide in BC-NAM data analysis

Original Set / Concatenated Set
LG / # markers / Length (cM) / # markers / Length(cM)
SBI-01 / 99 / 181.7 / 123 / 181
SBI-02 / 135 / 223.32 / 135 / 223
SBI-03 / 91 / 153.7 / 116 / 153
SBI-04 / 98 / 166.7 / 117 / 166
SBI-05 / 111 / 118.5 / 98 / 118
SBI-06 / 68 / 164.1 / 106 / 164
SBI-07 / 72 / 132.8 / 69 / 132
SBI-08 / 128 / 131.9 / 89 / 131
SBI-09 / 62 / 135.6 / 83 / 135
SBI-10 / 68 / 112.2 / 65 / 112
Totals / 932 / 1520.52 / 1001 / 1515

In order to identify genomic areas of interest across BC-NAM families we needed to find a value that is representative of the significance level at each cM location. We conducted a Fishers combined probability test by calculating a Fisher test statistic at each location with n number of pops. The Fisher statistic can be calculated using the following.

where nis a number between 1 and 24 representing the number of BC-NAM families for each cM location. We then obtained a Pvalue for the significance by considering the Fisher statistic as a chi-squared distribution with 2n degrees of freedom (). This generated a value at each cM location that may be used to examine cM locations of interest across all 24 BC-NAM families.

The false discovery rate varies for each cMlocationand is dependent on the number of observations. At cM locations where n=1, no multiple comparisons are made therefore the significance level may be maintained the same as the criteria used for the individual (in our case  = 0.001 or 0.1%). As n increases we need to adjust  according to the formula, for example for n=18 our significant threshold becomes 0.001*19/36=0.000528. To maintain a 0.1% significance we adjusted each Fisher Pvalue according to the size of n at each cMlocation.