Using the MHC-Program

MHC Manual Bert Klei Computational Genetics WPIC-UPMC kleilupmcedu

To cite this program:

Klei L, Roeder K. Testing for association based on excess allele sharing in a sample of related cases and controls. Human Genetics 2007; 121:549-557.

Warnings:

There are some hard coded limits in the program. First of all there are only 23 chromosomes possible. Second the allele frequencies used for the matching algorithm are based on controls (those that have a diagnosis code of 1, the ones with 0 are ignored).

Acknowledgements:

Parts of the program rely on methods developed by others.

BLUE Frequencies: McPeek MS, Wu X, Ober C. 2004. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics 60:359-367.

Case Control Quasi Likelihood Score: Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, Reynolds R, Ober C, McPeek MS. 2003. Novel case-control test in a founder population identifies the P- selectin as an atopy-supsceptibility locus. Am J Hum Genet 73:612-626.

To run MHC in Windows: 1) Create a copy of the execuatable mhc_v2.exe in a directory close to your data. 2) Open a DOS window. 3) Cd to the directory in which the executable mhc_v2.exe is stored 4) Issue the command mhc_v2.exe.

The input file has a number of lines that have to be supplied. line 1: A title that you want to give to the run line 2: Directory and name of the .loc files. You only give the root name, the program assumes that this is followed by chromosome number and .loc. Ie. you supply klei, the program then assumes that the file names are klei1.loc, klei2.loc, …,klei22.loc, kleiX.loc. It is not necessary to have a file for each chromosome, only the ones that you want to use. The layout is a modified linkage format. It is important that the files enter in .loc (see below for details). line 3: Directory and name of the .pre files. Similar to the previous line in that this is the genotype file in the pre-linkage format. This file only needs to contain the individuals with genotypes. These files need to have the extension .pre. line 4: Name and location of the pedigree file. The pedigree file does not have a field for family. line 5: Name and location of a map file. line 6: Name and location of the file with IBD probabilities. The program will look to see whether this file exists. If it does the program will use the values contained in this file, if not it will calculate the IBD probabilities and then store them. Make sure you delete this file when your project changes. line 7: Name and location of the back-up file. For some of the projects you are involved in it might not be possible to get results in one run. The program allows you to have the program come to a gentle halt. The results at that point can then be used to continue later on. Again the program will look for this file when you start. If it exists, it will continue from the stopping point, if not it will start from the beginning. line 8: Name and location of the log-file. This file contains pertinent information collected during the run. line 9: yes if you want to calculate the matching statistic, no if you don’t want to calculate the matching statistic (lower case is important). line 10: Location and file name to which to write the matching results for the cases. line 11: Location and file name to which to write the matching results for the controls line 12: yes if you want to calculate the Hellinger Distance test statistics, no if you don’t (lower case is important) line 13: Location and file name to which to write the Hellinger distance results. line 14: yes if you want to calculate the CaseControl Quasi Likelihood (CCQLS) test statistics (Bourgain et al. 2003), no if you don’t (lower case is important). line 15: Location and file name to which to write the CCQLS results. line 16: Location and file name to come to a gentle stop even though you might not have finished all calculations. line 17: Genetic Model (specify additive, dominant, recessive). line 18: first and last chromosome to analyze (see comments) line 19: first and last marker to analyze line 20: first and last marker on X that behave as true linked chromosome. In the case that you do not have markers on X you should enter 0 0 on this line (see comments). line 21: yes or no to specify whether you want the simulations to allow for recombination between markers. line 22: Reduce pedigree (complete, partial, no) (see comments). line 23: Number of simulations to use for determining the significance of the test statistics (recommend 10,000). line 24: Number of simulations to use to determine the IBD probabilities for each pair (recommend 100,000). line 25: Approximate number of different IBD probabilities (see comments). line 26: p-value at which to stop iterations (recommend 0.10) (see comments) line 27: Allele frequency estimation method (mcpeek, recommended, or naïve) (see comments) .

General comment about file names I highly recommend to put file names between double quotes . If you use blank spaces and other things, it might make the program think it is reading two separate entries. Also, the convention of ../ for a directory up will work with this as well. File names should have the extensions as specified below. No headings or variable names should appear at the top of any of the files.

.loc files (marker information file) One file is needed for each chromosome. The layout of this file: line 1: number of markers line 2: number of alleles for marker 1, ‘#’, followed by marker name line 3: allele frequencies for marker 1 line 4: number of alleles for marker 2, ‘#’, followed by marker name line 5: allele frequencies for marker 2 etc.

Pedigree file For this file it is necessary that parents appear before their descendants. You have to make sure that this is the case. If not, the program will come to a halt. Individuals need to be uniquely coded across families. Layout: column 1: individual column 2: father column 3: mother column 4: sex column 5: dx column 6: genotype indicator (1 if individual is genotypes, 0 if not genotyped).

.pre files (genotype files) For these files, parents do not need to appear before their descendants. It is important to note that alleles need to be coded in linkage format, i.e., if there are 7 alleles for a marker, alleles should be numbered 1-7. You can use MEGA2 to recode alleles. The layout is: column 1: family column 2: individual column 3: father column 4: mother column 5: sex column 6: dx column 7: marker 1, allele 1 column 8: marker 1, allele 2 column 9: marker 2, allele 1 column 10: marker 2, allele 2 etc.

Map file: The information in this file is used for output purposes. You can just make up some information if you need to. It needs to have 4 columns. Only 1 map file is needed. It is important to have all information complete even if you have to make up alternative names for the markers. For example, one can give an alternative name for marker 1 on chromosome 1 as CH1M1, CH1M2, CH1M3, etc…)._ column 1: marker name. column 2: alternative name for the marker (can be the same as field 1, this is the name that is used in the output files). column 3: chromosome on which this marker appears column 4: location (genetic or physical distance).

File with matching statistics (cases and controls) Headers in this file give the information you need. Matching statistics are calculated for ALL pairs, MALE pairs only, and FEMALE pairs only.

File with Hellinger distance test statistics Again the headers describe the column contents. In this case the results are not based on pairs and therefore the values are filled with 0. Test statistics are again calculated for ALL individuals, MALES only, and FEMALES only.

File with Case-Control Quasi Likelhood test statistics Again the headers describe the column contents. In this case the results are not based on pairs and therefore the values are filled with 0. Test statistics are again calculated for ALL individuals, MALES only, and FEMALES only. The difference with the implementation of Bourgain et al. (2003) is that here we used gene dropping to determine the significance. Bourgain used asymptotic properties of the test statistic. stop and go file When the program start it will write a small file with the name specified in line 15 that contains the word “go”. If you want the program to come to a gentle stop replace the word “go” in this file with “stop” and save the file. The program should then stop in a nice fashion so that you can pick up where you stopped and finish calculations.

First and Last Chromosome The program was initially written to deal with a genome wide scan. If you only have a limited number of markers you can put them all on a fictitious chromosome 1 and then enter 1 1 on this line.

First and Last Marker If you want to analyze specific markers you can give a specific range. If you enter -1 and 10000 it will analyze all markers on the chromosomes.

True sex linked markers. This is easiest explained by an example. Assume that on X you have 30 markers and the first 5 and last 2 act as pseudo-autosomes. The values to enter on this line are 5 2. If only the first 5 act as pseudo-autosomes enter 3 0. If there are no pseudo autosomal markers on X, enter 0 0.

Reduce pedigree In many cases you can greatly reduce the computational burden of the program when you only include individuals that are of importance in calculating the test statistic. These are referred to as essential individuals and they include any individual that is either genotyped and all individuals that are on a pedigree path connecting individuals with genotypes. If you specify complete (recommended) it will reduce the pedigree to only these essential individuals. If you specify partial, it will also make sure that all individuals have 2 known parents (similar to --trim in Merlin). If you specify no, it will use the pedigree as is. The results for the 3 options are the same except for variations due to the random process of the gene drop to calculate the significance.

Number of different IBD probabilities The most intensive part of the program deals with determining the expected matching. Calculations are greatly reduced if you limit yourself to all unique IBD probabilities. It is hard to say how many there are before you start. The number to use depends on the complexity of the pedigrees and the number of individuals. We have started the program usually with 5000.

P-value at which to stop To save computing time you can specify that you want to stop iterations if all of the test statistics that you are interested in can no longer achieve this preset value. In other words if it is apparent during the iterations that a p-value smaller than the one specified in this line can no longer be reached for any of the test statistic, the program starts processing the next marker.

Allele frequency estimation method Here you have two options mcpeek or naïve. The naïve method does a simple allele count on all individuals of interest. The mcpeek method use best linear unbiased estimation to determine the allele frequencies in the founders. It takes into account relationships among individuals (McPeek et al, 2004). In cases of simple population sample with all unrelated individuals the two methods give the same results. In all other cases the mcpeek method is more accurate.