Quality Control and Quality Assurance (QA/QC) Report

Total Page:16

File Type:pdf, Size:1020Kb

Quality Control and Quality Assurance (QA/QC) Report

Supplementary Materials

Quality Control and Quality Assurance (QA/QC) Report

QA/QC Methods

Computer Software

SAS 9.3 (SAS Institute, Cary, NC, USA) and R (R Core Team, 2013) were used for initial genotype data processing and preparation prior to QA/QC steps. Unless otherwise noted,

PLINK (Purcell et al., 2007) was used to generate the QA/QC metrics and R was used for analysis and graphical display of QA/QC information.

Targeted Sample for Genotyping

In all, a total of 2,020 DNA samples were genotyped which includes duplicate DNA samples from the same individual to be used for quality assurance purposes. The final number of non-duplicate individuals with acceptable genotype quality is presented below. The targeted sample for genotyping was the National Longitudinal Study of Adolescent Health (Add Health) sibling pairs who provided a saliva sample and consented to participate in a genome-wide association project during the Wave IV data collection (Harris et al., 2013). Note that this sample includes not only sibling pairs, but also other types of putative relationship pairs who share a common Add Health family identifier. Known MZ twin pairs were not part of the targeted genotyping sample.

Sample Preparation and Genotyping

Saliva was collected during Wave IV using the Oragene collection methods (Oragene,

DNAgenotek, Ottawa, Ontario, Canada) and genomic DNA isolated from the Oragene solutions using ZymoResearch (Irivine, CA, USA). Silicon-A™ plates were used according to protocols supplied by the manufacturer at the Institute for Behavioral Genetics (IBG) at the University of

1 Colorado Boulder. Extracted DNA was normalized to 50 ng/µl using Picogreen® fluorescence and sent to Expression Analysis, Inc. (Durham, NC, USA) for genotyping. The genome-wide platform used for this study is the Illumina HumanOmni1-Quad v1 (Illumina Inc., San Diego,

CA, USA), which includes a total of 1,134,514 genetic and structural variants. Clustering, calling and scoring genotypes were performed using Illumina’s GenCall software with version 1.0 (H) product files. The initial genotypes were called using Illumina’s “TOP” strand designation. For further details on strand designation as it relates to “TOP”, “(fwd)” and “(+)”, see Nelson et al.

(2012). Each 96-well plate contained one inter-plate duplicate that was randomly assigned and one CEPH and non-template control.

SNP Marker Set

We removed markers with no reliable map location based upon the Human Genome

Reference Build 37.1 (9,837). The genotyping platform includes structural variation such as copy-number variants (CNVs) and Insertion/Deletion variants (INDELs). While potentially useful and interesting for future projects, the initial focus was on biallelic single-nucleotide polymorphisms (SNPs) to establish sample quality, quality control parameters as well as to conduct the initial genome-wide association analysis. There are 126,969 “markers of interest” identified in Illumina’s product files with majority of these being intensity-only probes

(123,295). There are an additional 128 markers identified as INDELs. While potentially interesting for future association analyses, the focus of the present study is on SNP markers and therefore, the INDELs were removed. Removal of these markers of interest and INDELs resulted in 1,000,970 SNP markers. Further, for the purposes of this study, we chose to remove the Y chromosome SNPs (Y; 1,209), mitochondria SNPs (MT; 25 markers) as well as SNPs on the pseudo-autosomal region of X (XY; 872 markers). This step results in a total of 998,864 SNPs

2 (2106 markers removed). Additionally, we selected markers that map to the 1000 genomes project (phase 1, version 3) reference set. This was done to ensure that only the most validated

SNPs are used for quality checks and analyses, but also for strand alignment and to update

Illumina marker names with a corresponding dbSNP identifier. Out of the 998,864 markers,

959,527 markers were found in the 1000 genomes reference database (phase 1, version 3).

Illumina markers without a valid dbSNP identifier (i.e. markers with the prefix “kgp” or “GA”) were linked to a dbSNP identifier by merging them based on map location (chromosome and base-pair location). The final set of genotypes used for the QC/QA steps use SNP markers from chromosomes 1-22 and the X chromosome. This set of markers allows us to reliably perform various genome-wide quality assessments in addition to checks of biological sex. In supplemental table 1, we provide a breakdown of the number of markers per chromosome used for this study. Further pruning of markers was conducted at various stages for QC/QA purposes and noted accordingly throughout the text.

QA/QC Results

Missing Data Rate and Sample Quality

To generate missing data rates for individual samples, we focused on the 959,527 SNP markers across chromosome 1-22 and X. The average missing data rate for the 2020 samples is

0.0126 (SD=0.0481) with a range of 0.0009 to 0.6294. The missing data thresholds that are often used for GWAS range from 3-5%. Further, data is often inspected for the distribution of mean heterozygosity across the autosomes. In general, samples exhibiting excess heterozygosity may be an indicator of sample contamination while less than expected heterozygosity is thought to be an indicator for inbreeding. Mean heterozygosity in this context is defined as (N-O)/O, where N

3 is the number of non-missing genotypes and O is the observed number of homozygous genotypes. Supplemental figure 1 displays the proportion of missing genotypes (log-scale on the x-axis) versus heterozygosity rate (y-axis). The two horizontal dotted lines indicate 2x the standard deviation of the mean heterozygosity in this sample. Based upon the distribution of missing data it could be argued that a missing data rate of anywhere from 0.03-0.05 could be reasonably adopted. A threshold of 0.03 (vertical dotted line) would remove 101 samples while a threshold of 0.05 would remove 76 samples (~5% and 4% of the total sample respectively), which is well within the range of acceptable sample loss among traditional genome-wide association studies. However, rather than removing respondents at this stage, we opted for an iterative procedure to come to a final set of samples to be removed that includes first removing low quality SNP markers.

Final QA/QC Marker Set

We initially assessed marker quality by calculating the missing data rate for each SNP using only samples that had a genotyping call rate of at least 90%. This step generated a list of

SNP markers that could be considered of low quality using a SNP marker call rate threshold of

95%. A total of 18,665 SNP markers were removed using the 95% call rate threshold leaving

940,862 SNP markers across chromosomes 1-22 and X (see Supplemental Table 1 for a breakdown by chromosome).

Final QA/QC Sample

We then recalculated the missing data rate for samples after removal of the low-quality markers to get a more accurate estimate of the sample missing data rates. Using a marker set with low-call markers removed, a 0.03 threshold would remove 98 samples while a 0.05 threshold would remove 74 samples. Two different criteria were used to flag individual samples for

4 potential removal from the analysis data set. First, missing data rate > 0.05. Second, individual mean heterozygosity rate that exceeds ± 2(SD) of the mean heterozygosity rate of the entire sample. For this sample, there were no samples removed because of excess heterozygosity.

However, based upon the missing data rate of > 0.05, we removed 74 genotyped samples. As noted above, this results in a sample loss of approximately 3-4%. Therefore, the sample used for subsequent QC/QA steps is N=1,946. Note this number includes duplicate samples, as part of the

QC/QA is to assess duplicate concordance. The number of individuals (excluding duplicates) that meet the thresholds above is 1,888 (note that this includes one known MZ twin pair). In situations where there were duplicate samples from the same respondent, we used the set of genotypes (DNA sample) that yielded the higher genotyping call rates.

Sex Checks

Using the X chromosome to check for biological sex can be useful to identify problems related to sample mix-up and/or misaligned coding files. After removal of 74 samples based on missing data rates, 138 samples were “flagged” by PLINK using an inbreeding (homozygosity;

F) estimate for the X chromosome. PLINK conservatively expects a homozygosity estimate >

0.80 for males and < 0.20 for females and essentially, any homozygosity estimates between 0.20 and 0.80 are flagged. Supplemental figure 2 displays the X chromosome homozygosity for Add

Health respondents coded as male (supplemental figure 2A) and females (supplemental figure

2B) respectively. Out of the 1,888 individuals, 138 individuals were flagged for further inspection. The first conclusion to be drawn from this is that there is no evidence of a widespread issue linking samples to coded information such as biological sex. Of the 138 flagged individuals, there were only 4 self-reported males. Of these 4 males, 3 would be considered female based upon their X chromosome (F = 0.16, 0.10 and 0.01). One of these male respondents

5 exhibits a relatively ambiguous sex via genotype (F = 0.36). Of the 134 self-reported females, nearly all of them fall within the expected null distribution and therefore, they are likely to have been flagged unnecessarily. However, there are 2 females who also have similar homozygosity estimates (F = 0.58 and 0.38) as the ambiguous male noted previously. Collectively, these three individuals (coded as one male and two female) exhibit heterozygosity estimates that are consistent with chromosomal abnormalities. No samples were removed at this step, as there is no evidence of widespread sample mix-ups or linking file issues.

Duplicate Concordance

There were 53 duplicate pairs that passed initial QA/QC included in the Add Health sibling pairs file. Pairwise mean IBD (PI_HAT) values that exceed 0.90 are thought to be either duplicate samples or MZ twins (PI_HAT values have a maximum of 1.0 indicating perfect concordance). Equivalently, pairwise measures of Kinship above 0.354 are considered duplicate samples or MZ twins (Manichaikul et al., 2010) (Kinship estimates have a maximum of 0.5 indicating perfect concordance). The average mean IBD (PI_HAT) for the 53 duplicate pairs is

0.9995 with a minimum of 0.9987 and a maximum of 1.0). The average Kinship estimate for the same 53 duplicate pairs is 0.4998 with a minimum of 0.4996 and maximum of 0.5). Overall, the concordance for these duplicate pairs is quite high using both estimation methods.

6 Supplemental Table 1: Number of SNP markers per chromosome for the QA/QC Report

Chromosome All Call Rate > 95% 1 79,267 77,907 2 74,584 73,218 3 60,733 59,576 4 56,854 55,756 5 55,042 54,014 6 71,959 70,424 7 50,291 49,240 8 50,192 49,295 9 44,306 43,535 10 50,524 49,570 11 47,380 46,475 12 46,078 45,254 13 33,572 32,933 14 29,098 28,613 15 28,416 27,962 16 30,304 29,753 17 26,876 26,391 18 26,841 26,388 19 21,317 20,803 20 26,195 25,749 21 13,503 13,257 22 13,646 13,396 Total (1-22) 936,978 919,509 X 22,549 21,353 Total (1-22, X) 959,527 940,862

7 Supplemental Figure 1: Proportion of missing genotypes versus Heterozygosity rates among the 2020 samples. Horizontal lines correspond to 2x the standard deviation of heterozygosity rate while the vertical line corresponds to a missing data rate of 0.03 (97% genotyping call rate). 5

7 ● . ● 0

● 0 7

. ● ● 0 ● 5 6 .

0 ● 0 6 . 0 ●●

● 5 5 . 0 e

0 ● t 5 a . r

0 y t i s 5 o 4 g .

y ● 0 ● ● z

o ● ● r ● ● e 0 ● ● t ● 4 e .

0 ●

H ● ● ● ●

5 ● ● ● 3

. ● ● ● ● ● 0 ●● ● ● ● ● ● ●●● ● ●● ●● ● ●● ● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● 0 ●●● ● ● ● ● ● ● ●● ●●●● ●● ● ● ● ● ● ●● ● ● ● ●●● ●●● ●●●●●● ●●●●●●●●● ●● ●●●●● ● ●● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ● ●●● ●●●● ●● ● ● ● ● ● ● ● 3 ● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●●● ● ● ● ● ● ● ●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ●● ● ● ● ● . ●● ●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ●● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●●●● ● ●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●● ● ● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●● ● ● ● ● ● 0 ● ● ●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●● ●●● ● ● ●● ● ●● ● ●●●●●●●●● ●●●● ●●●●●● ●●●● ● ● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●●● ●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ● ● ●●● ●●●● ● ● ● ● ● ● ● 5 ● ● ●● ● ● ● ●● ●●●●●● ●●● ● ● ● ● ● ●●●●●●● ● ●●● 2 ● ●● ● ● ●● ● ● ● ● ●● ● ● ● . ● ● ● ● ● ●● ● ● ● ●● ● 0 ● ● ● ● ● ● 0 2 . 0 5 1 . 0

0.001 0.01 0.1 1 Proportion of missing genotypes

8 Supplemental Figure 2: X chromosome homozygosity for individuals who are coded in Add Health as male (A) and female (B).

A B

All Female Samples 0 4 1 0 2 1 0 0 1 0 8 y c n e u q e r 0 F 6 0 4 0 2 0

−0.2 0.0 0.2 0.4 0.6 X Chromosome Homozygozity Estimate

All Male Samples 0 0 8 0 0 6 y c n e u q e r 0 F 0 4 0 0 2 0

0.0 0.2 0.4 0.6 0.8 1.0 X Chromosome Homozygozity Estimate

9 10 Supplemental Figure 3: MDS principal coordinate (PC) estimates generated by KING.

A: PC1 vs PC2

11 B: PC2 vs PC3

12 C: PC3 vs PC4

13 D: PC4 vs PC5

14 Supplemental Figure 4: Genetic ancestry by self-identified ethnic group where CEU = Europe, CHB = China, JPT = Japan, YRI = Africa and AMR = America.

4A: Self-Identified White

15 4B: Self-Identified Black

16 4C: Self-Identified Hispanic

17 4D: Self-Identified Native American

18 4E: Self-Identified Asian

19 Supplemental Figure 5: Quantile-Quantile Plot of the unweighted GWAS p-values

20

Recommended publications