Biomathematics 207B / Biostatistics 237 / Human Genetics 207B
Total Page:16
File Type:pdf, Size:1020Kb
Biomathematics 207B / Biostatistics 237 / Human Genetics 207B PROJECT Winter 2004
Hand in your paper by Wednesday noon of finals week (3/24/04). Drop off paper at the Biostatistics Office (5th floor) or to room 5357C Gonda.
Remember I will apply a penalty if your paper is late and you didn't clear it with me before 3/22/04. Maximum points are 60. Penalty: 1 point off per hour up to 6pm then 12 points off if handed in after 10 am on Thursday; 18 points off if handed in after 10 am on Friday.
The purpose of this project is to locate genes associated with quantitative trait X or qualitative trait Y relative to 10 co-dominant markers.
You may use any of the gene mapping methods available in the software MENDEL (www.genetics.ucla.edu/software to obtain a free copy or run it from your genetics account), SAGE or GAP that you feel is appropriate to the data. To run either SAGE or GAP you will have to reformat the files and, for SAGE 3.1, make your own par files. So if you decide to use SAGE or GAP, you will need to carefully read the manuals and figure out what are the correct formats. You can use the program mega2 (available on calypso.genetics.ucla.edu) to convert formats. You will be given a small amount of extra credit if you can master the reformatting needed for SAGE or GAP (if these programs are appropriate) or if you appropriately use MENDEL options that we did not run in lab.
Your assignment is to (1) choose a method that will give you accurate results based on power, model assumptions, robustness, etc., (2) determine the most likely locations of putative trait gene(s), and (3) write up your results in a clear and convincing manner. You should choose your method before running the linkage analysis and it is extremely important to include a justification for your choices in your paper.
NOTES: (1) It is not acceptable to: (a) Run all possible methods and then choose the method that gave the most significant results. The most appropriate method may not always give the most significant p-value or LOD score. (b) Include results from several methods without providing compelling justification for using multiple methods. To argue that you are running more than one method for confirmation purposes is not an appropriate justification. It's a bad policy. If, on the other hand, the methods provide different information, then running more than one method can be useful. (2) There is more than one "right" approach for analyzing these data. You will be graded on (a) how well you demonstrate that you understand the methods you used, (b) how you argue for your approach and (c) how well you point out their strengths and weaknesses, rather than whether you find the absolutely true location of the gene(s). (3) Your data set may differ from your classmates so don't be alarmed if you get different results than another person. If you lose, destroy or can't read a file, DON'T USE
1 YOUR CLASSMATES' DATA FILES. Ask me for another copy of the data if needed. (4) It's possible that you may find no significant evidence for linkage anywhere along the chromosome. It is not an error - some data sets contain no evidence of linkage. This is still a "publishable result" (at least as far as this class is concerned).
The data come from approximately 600 randomly ascertained pedigrees in which quantitative trait X has been measured. For your convenience, I have also provided a dichotomized trait Y that is highly correlated with X. Don't use Y without examining its relationship to X. There are 10 co-dominant markers, all from the same chromosome. They are spaced ~5-20 cM apart. Again don't be alarmed if you get different answers from your classmates as your data sets differ.
Take into account the following background information:
For your projects, suppose the following scenario is true: You are a graduate student and part of your project is to test whether trait X or trait Y is controlled by any genes on chromosome Z. (Please substitute your favorite quantitative trait for X, an appropriate corresponding qualitative trait for Y and an appropriate chromosome - see below). You have collected around 600 nuclear families.
Several years ago, another research group has published segregation analysis results using trait X and families collected from a highly homogenous population (an isolate). They conclude that there is evidence that the trait is transmitted in a Mendelian manner. 2 They find for an additive gene with aa = 23.5, =13.5 and =16 and allele frequency pA = 0.30. Based on these results, your advisor's research group conducted a genome wide linkage analysis for trait X. They used model based linkage analysis on pedigrees from a different isolate but from the same geographic region. They found 2 chromosome locations with significant evidence of linkage (LOD >3). Recently a third group conducted a genome wide linkage analysis using trait Y. They used a model free approach and found no evidence of linkage (using LOD>3) but they had only 100 nuclear families from Los Angeles (a highly genetically heterogeneous population). You want to test whether one of the chromosomal locations originally found by your research group can be confirmed in 600 families from Scotland.
You must decide whether to use the quantitative trait X or to use qualitative trait Y. Many physicians use trait X as an indicator of risk for disease Y choosing to treat patients with high trait X values to prevent Y. Trait Y is a complex trait with low penetrance, however, so any individual who is not clinically affected is usually given a missing or unknown phenotype designation.
File Structure:
For your convenience, I have included data and example input files needed to analyze the data using MENDEL 5.0. Examine all the files carefully, particularly the example control file. You need to alter these files so that they run the exact analysis you wish to
2 run. The genotype data have errors - you will need to run the mistyping option of MENDEL or some other error detection program.
(1) MENDEL example command file that runs mistyping analysis for a single marker (option 5 model 1), controlmis.in
(2) Marker locus file. The adjacent markers are ~5-20cM apart. locus.in contains marker names. You can use MENDEL to estimate the allele frequencies.
(3) Marker Map file. This file, map.in, provides the marker order and the recombination fractions between adjacent markers.
(4) A variable file. This file, variable.in, provides the quantitative trait designation and allowable range for values.
(5) Phenotype and genotype data are in pedigree.in. The order of the variables is:
Variable comments pedigree id person id father id “ ”=founder mother id “ ”=founder sex 1=male, 2=female marker 1 missing = " " codominant, phase unknown marker 2 missing =" " codominant, phase unknown . . . marker 10 missing = " " codominant, phase unknown trait value F7.3 (trait X) missing = " " aff. status 1=disease, unknown = " "
Because the file is comma delimited, I added the command "pedigree_list_read=true" to the control file. I also changed the allele separator from a "/" to a "-".
PROJECT REPORT:
Write your paper in the style of a human genetics journal article. Please remember that the maximum length is 7 pages of text (excluding figures or tables), double paced with 12 point font and 1’’ margins. Do not include computer printouts. Recall that if you were actually publishing a paper your readers will not be familiar with these programs and will not bother to decipher the printouts. Instead present the important information from these printouts in graphs, and tables. To demonstrate that you understand the biological implications of your findings and to make your paper more readable, assign a real or hypothetical trait to X, and a real or hypothetical disorder toY. As an example, one could “study” a trait like LDL cholesterol
3 and a disorder like familial combined hyperlipidemia. Don't be concerned if the average value of your trait isn't the same as the real traits' average value. After all, you may be measuring your trait in nonstandard units. Amusing traits are acceptable (even encouraged). The only requirement for the phenotype is that it fit the observed data. For example, do not pick a disease rarely expressed in males, like breast cancer, if both males and females are affected in your data set. Specifically include in your paper: (1) Introduction: Briefly describe the data and your goals. Include a very brief background that describes any previous findings concerning your trait and the related disorder. If you choose a real trait, include some real references. DO NOT PLAGERIZE PUBLISHED PAPERS OR BOOKS. See below for an explanation of plagiarism.
(2) Methods: Summarize the approach and the methods you used to analyze the data. Include equations when needed. Describe any assumptions you made concerning the data and any modeling assumptions implicit in the methods. Briefly mention why you chose your method of analysis over the other possible analysis methods.
(3) Results: Present the results of your analysis. Use figures and tables to help you present your results (see below). Point any possible inconsistencies or artifacts in your results. (4) Discussion: Summarize your findings. Briefly justify your choice of analysis method. Discuss the limitations (weaknesses) of the data or of your analyses. IMPORTANT: Present possible additional studies or analyses that you would do next (directions for future research). Discuss improvements to the study design or analysis methods if any exist. (5) Figures and Tables: Besides summarizing linkage results (e.g. LOD score curves, tables of pvalues), you may want to include other figures and tables. Here are some good examples include: (a) using a table to describe demographic information, (b) showing a typical pedigree with symbol coding to show the typical familial aggregation of the trait/covariate values, (c) a histogram for the trait distribution to justify any data transformations made.
Plagiarism and How to Avoid it Plagiarism is the intentional use of a person’s ideas or words without giving them credit. It is not acceptable in scientific writing as it is stealing. To avoid being accused of plagiarism you must cite all references you use to write your paper. In addition, you should not put a phrase (more than a couple of words), sentence, or paragraph from any book or journal in your paper unless you put the information in quotes.
As an example, suppose I am writing a paper about linkage analysis and I read in K. Lange’s book (Lange, 1996). “Any Mendelian model revolves around the three crucial notions of priors, penetrances and transmission probabilities (Elston and Stewart, 1971)”. How can I use this information in my paper?
4 (1) INCORRECT: I just use it word for word without citing Lange or Elston and Stewart. This is stealing both the idea and the words. EXAMPLE: Any Mendelian model revolves around the three crucial notions of priors, penetrances and transmissions probabilities. It’s also wrong to paraphrase the idea and not cite Lange or Elston and Stewart.
(2) INCORRECT: I use the sentence exactly and cite Lange and Elston and Stewart. Although I am not stealing the idea, I am stealing the words. EXAMPLE: Any Mendelian model revolves around the three crucial notions of priors, penetrances and transmissions probabilities (Elston and Stewart, 1971; Lange, 1996).
(3) CORRECT: I put the words in quotes and attribute the quote: EXAMPLE: As Lange (1996) states “Any Mendelian model revolves around the three crucial notions of priors, penetrances and transmissions probabilities" (Elston and Stewart, 1971).
This way (3) is o.k. but it can be boring to read and doesn’t demonstrate that I understood what I read.
(4) THE BEST WAY: I put the concept in my own words and cite Lange and Elston and Stewart for the idea: EXAMPLE: When constructing a Mendelian inheritance model we must define the population frequencies of the genotypes, the priors. We must also specify the probability of the phenotype given a genotype, the penetrance, and the transmission probability, probability that a genotype is transmitted from parents with given genotypes (Elston and Stewart, 1971; Lange, 1996).
I apologize if this discussion is something that you heard many times in your academic career but it's easy to forget and there are gray areas. Plagiarism really is a big deal - recall last year's press coverage concerning accusations against historians Stephen Ambrose and Doris Kerns-Goodwin.
References: Elston, RC, and Stewart, J (1971) A general model for the genetic analysis of pedigree data. Human Heredity 21:523-542.
Lange, K. (1996) Mathematical and Statistical Methods for Genetic Analysis. Spring Verlag, New York, New York
5