Example 1: a Friend Sends You a Manuscript from a Student and Wants You to Comment on It
Total Page:16
File Type:pdf, Size:1020Kb
Example 1: A friend sends you a manuscript from a student and wants you to comment on it. The paper discusses results from a Genome-wide association study (GWAS) for milk yields in three different cattle breeds, Swedish Red, Holstein Frisian and Gir/Gyr (number of animals: 200, 150 and 50, respectively). The paper states that all cattle were genotyped using the same marker-panel, a cattle SNP array with 60,000 SNPs. There is no further information on the markers/genotypes. The graph (a so-called Manhattan plot) shows three significant peaks, one is a single SNP (p- value 0.000001); at the second peak a total of 10 SNPs exceeded the significance threshold of which five were highly significant. The third peak included 20 significant as well as five highly significant SNPs. How could your friend test the reliability of the results? a) Which steps are required when analysing results from a SNP array? How can the reliability of SNPs be evaluated (name at least four and add detailed arguments of their relevance for at least two)? Quality control (QC): call rate, minor allele frequency, HWE, inheritance (if possible); double controls on the platform; CR might be problem due to calling in the platform algorithms Doing a validation study using data from another population Make sure the p-values were corrected for false positives (e.g. permutation) b) There were different populations and numbers of individuals included in the analysis. What could be solutions to check for the impact of the data structure on the results? There might be a biased; breeds differ; different numbers of observations The population structure should be checked and included in the analysis, alternatively could be analysis be performed for each population separately c) Do you think information on the linkage disequilibrium (LD) is required in the study? Which result suggests that this might be needed? Explain briefly what LD implements One peak included only a single SNP, this could indicate a problem with the SNP; also for the others multiple SNP were significant, it might be useful to plot the LD and check if those with the same p values are in complete LD exists usually between two close markers; two linked loci, association between variants; if marker in LD with disease locus → trait associated with marker; mutation is shared by affected individuals through common descent; markers in LD = haplotype; shared alleles at nearby loci representing haplotype of the ancestral chromosome on which the mutation first occurred; measured in R2 or D’ d) Make a final comment on how you would suggest the study to be performed, including information on which steps are relevant when doing a Genome-wide association study. Run the analysis for each population separately, compare with the combined results. 1. Plan the study: choose the appropriate design to target your hypothesis 2. Collect data (based on the design choose the data, e.g. number of markers) 3. Remove ‘problematic’ data from the dataset (prepare genotypes but also phenotypes) 4. Identify other pedigree related problems e.g. population stratification 5. Association analysis 6. Correction of results to minimize false positives Example 2: You have received the cow data during the exercise. Please run the GWAS using Plink. a) How many cows were included in the analysis? 100 males and 400 females: Open plink, go to the right directory, add plink --file cow --no-pheno --pheno cow.phe --noweb b) Where any individuals removed because of poor quality? How many and what is the reason? You could add criteria for correction, e.g. plink --file cow --no-pheno --pheno cow.phe –-noweb --geno 0.25 --maf 0.05 --mind 0.25 for example would give you: 0 animals were removed c) How many SNPs were included in the analysis and how many passed the quality filter? You can also use the information from b) 852 SNP failed the test for minor allele frequency, none the other tests d) How many SNP are significantly associated with the different phenotypes before and after adjustment? plink --file cow --no-pheno --pheno cow.phe –-noweb --geno 0.25 --maf 0.05 --mind 0.25 –all-pheno --assoc --qt-means -- mperm 1000 you can alternatively run the adjustment statement plink --file cow --no-pheno --pheno cow.phe –-noweb --geno 0.25 --maf 0.05 --mind 0.25 –all-pheno --assoc --qt-means – adjust transfer the results in Excel and check how many SNP exceeded a p-value of 0.05 e) Choose one trait and check if the 10 most significant markers are in LD e.g. for trait 1 (and I sued the non-adjusted data): SNP9309, SNP9295, SNP9308, SNP7145, SNP9497, SNP1683, SNP2946, SNP9321, SNP8151, SNP8599 plink --file cow --no-pheno --pheno cow.phe –-noweb --ld SNP9309 SNP9295 run this for each pair of SNP f) Let’s assume the trait number 3 is related to fat content in the milk. One significantly associated SNP1169 would be marker rs110075862. Please identify 5 candidate genes in the region of the marker and discuss the most interesting of the markers in relation to fat content in the milk. Check the position of the marker on Ensembl and then identify genes in close approximation. Examples of genes: DGAT1, HSF1, BOP1, SCRT1, TMEM249 Check for example publications on NCBI/ pubmed: ‘DGAT1, cattle, milk fat’, or just the function of the gene in genecards or OMIM DGAT1 is a strong candidate for fat content in the milk Example 3: You ran over many years a number of studies to identify genetic regions for an inherited disease (BLAD) and for milk yield in cattle. You have found that both traits have a genetic background, thus selection is possible. Now the cattle industry wants to apply your results in their breeding program. Which method(s) would you suggest as being useful for the selection of cattle for trait 1 (inherited disease) and for trait 2 (milk yield)? Describe briefly the methods and the information needed in order to use the information for selection in breeding populations. Methods suggested would be a marker-assisted selection or the use of a marker test to select breeding stock and avoid the inherited disease. BLAD (Bovine leukocyte adhesion deficiency) is a recessive disorder and a gene test is available. Depending on the test and the mutation, such a gene test can be used only within or across populations. It can be a direct or indirect (LD or linkage) test. For the other trait, milk yield, this is likely a complex trait. Having found that there is a genetic background, selection is possible. However, a marker test is unlikely for such a complex trait, suggested is therefore the use of genomic selection. For genomic selection we need a training (or reference) population and the selection individuals. Animals in the training need to have phenotypes and genotypes, while the selection candidates need to be genotyped. Genomic breeding values (GEBV) will be estimated for all selection candidates based on the marker effects derived from the training population. There are certain assumptions which need to be considered. Further information required are pedigree information, good genotype data (to cover the LD), reliable phenotypes… .