Paper PR02

Pharmacogenomics: JMP® right in!

Jason Baucom, PhD, Raleigh, NC

ABSTRACT

The understanding of has greatly accelerated in the past decade. With the unraveling of the human , the ongoing HAPMAP project and the development of microarray single nucleotide (SNP, pronounced “snip”) chips by Affymetrix® and Illumina®, the chore of associating genetic markers with drug reaction states has become a feasible and important project. With the vast amounts of data being stored and sifted it is useful to have a powerful statistical and graphical tool for analysis. SAS® has become entrenched in the pharma world, and provides much of the statistical horsepower needed for such a task. Powerful graphical analysis tools like JMP® have not been able to successfully enter the pharmacogenomic market due to design constraints. JMP stores data sets in memory, so the size of the workable data sets is limited by the memory of your system. With the large data sets typical of pharmacogenomic data, this constraint is debilitating. JMP Genomics® circumvents this problem by allowing JMP to directly work with SAS data sets on disks. This enables the graphical power of JMP to directly interact with the large data set capacity and statistical horsepower inherent in SAS, making for a powerful combination capable of effectively analyzing large pharmacogenomic data sets. JMP Genomics also presents a comprehensive array of genomics tools to effectively analyze pharmacogenomic data sets. In this paper we discuss the unique features JMP Genomics offers and demonstrate its effectiveness in analyzing large pharmacogenomic data sets, specifically single nucleotide polymorphisms.

INTRODUCTION

In recent years pharmacogenomics has been effectively employed to highlight the underlying genetic causes for patient specific drug reactions. Recent successes, such as the development of a genetic test against hypersensitivity to Abacivar (Mallal, 2008) and a genetic screen to test for sensitivity to Warfarin (Rettie, 2006), have inspired the increasing demand for . Such genetic tests offer the promise of based on the individual’s genetic makeup. Effective mapping of an individual’s genetic markers or a sequenced genome is required to identify the genetic component of variant drug responses. The advent of the Affymextrix and Illumina single nucleotide polymorphism (SNP, pronounced “snip”) chips has facilitated this effort. These chips provide a detailed map of a subset of an individual’s SNP’s. These single point mutations can be either causative or can be used as markers linked to causative elements due to physical proximity on the chromosomes. As the density of marker coverage increases, the amount of data gathered increases dramatically and tools effective in dealing with these large amounts of data are required. SAS is an attractive candidate for dealing with the underlying statistics and large data sets, but the graphical component required to facilitate the judicious decision making process is lacking. JMP fills this gap but has design constraints that limit its usefulness with large data sets. JMP deals with data sets stored directly in memory, not on disk. JMP Genomics bridges this gap, providing the graphical analysis power of JMP while dealing with the large data sets on disk and statistical power of SAS. JMP Genomics also provides a wide array of tools designed to effectively analyze genomic data.

PREPARING THE DATA

For the sake of brevity we will focus on SNP analysis of Affymetrix data, though JMP Genomics can process Illumina chip data as well and JMP Genomics handles a wide array of genomic analysis techniques. We will step through the import and analysis of an Affymetrix SNP chip, working with 1M SNP chip sample data provided from Affymetrix. This data is publically available and can be obtained at the Affymetrix web site (Affymetrix, 1). Upon request, Affymetrix will send three DVD’s containing the chip data. This data contains information gathered from 270 individuals from the HapMap project (International HapMap Consortium, 2003). Unzip and untar all three DVD’s into

1

a single directory. Be sure to run md5 checksums for the copied and extracted files. If needed, download a md5 checksum tool at http://www.fastsum.com/. If errors are discovered contact customer support and Affymetrix can provide specific file downloads. Also download the associated library files for the SNP chips (Affymetrix, 2).

IMPORTING CHIP DATA

Start the import process by designing the Experimental Design File. This file will inform JMP Genomics of the .CHP file locations and assign a unique identifier to each file. Select Genomics > Experimental Design File > Create a New Design File Template. Select the “Choose” button next to “Folder of Raw Data Files” and navigate to the location that the extracted Affymetrix data is located. From this location navigate to affymetrix_public_data\calls\GenomeWideSNP_6\hapmap\cc- chp. Set “File Filter Expression” to “.chp” from the pull down menu. Enter the Output File Name as “edf”. Choose or create an appropriate output folder and select “Run” (Figure 1). After completion the EDF file will be presented. The ColumnName column is used as a unique identifier for each row, so this must be populated. The file name in the File column will be used as this identifier. This can be done using the data step. Select Genomics > Data Set Utilities > Data Step. Choose the newly Figure 1: Experimental Design File Builder. created edf data set as your “Input Data Set”. In the “Pre-Set Data Set Statements” text block enter:

Attrib ColumnName length=$7;

In the “Data Step Statements” text block enter:

ColumnName = substrn(File, 1, 7);

For an “Output Folder” select the directory where the EDF file is located. Select “Run” (Figure 2). A data set named edf_dsp will be generated that contains the correctly populated ColumnName column.

Begin the import of Affymetrix data by selecting Genomics > Import > Affymetrix > SNP CHP. For the “Experimental Design File” choose the edf_dsp data set. Choose affymetrix_public_data\calls\GenomeWideSNP_6\ hapmap\cc-chp as the location for the “Folder Containing Raw Data Files”. The folder containing the CDF file can be found at affymetrix_public_data\CD_GenomeWideSNP_6_rev3\Full\Ge nomeWideSNP_6\LibFiles. The “Annotation Data Set” can be located at affymetrix_public_data\ref\ genomewide_6_na27_annot.sas7bdat (Figure 3, next page). Select the “Options” tab. Change the “Number of Rows to Scan” to 20. This will decrease the amount of disk space required during the import process. Select “Run”.

Figure 2: Data Step to define ColumnName.

2

The import process is quite time consuming and requires a large amount of disk space. JMP will not display a running status update but progress can be monitored by opening the folder used for writing temp files. Open this directory by clicking on Genomics -> Documentation and Help -> Open SAS Temporary Folder. Recall that this data set has 270 individuals, so import progress can be monitored based on the numbering of the files in the temporary folder. When the import is finished attempting to open the temporary folder will fail. The hapmap_chp_edf_data dataset will be created in the output folder once the import is finished.

PREPARING DATA FOR ANALYSIS

The data from Affymetrix contains no Figure 3: SNP CHP Input Engine. archived disease or reaction state, so in order to have a meaningful association test a reaction state must be generated. This can be accomplished by adding a binary column to the imported data set in the data step. Click on Genomics > Data Set Utilities > Data Step. Select hapmap_chp_edf_data. Ignore the warning about the number of columns. In the “Pre- SET Data Step Statement” text box enter:

length disease $ 3;

In the “Data Step Statements” text box enter:

disease = ranbin(234567, 1, .5);

Choose the output folder where hapmap_chp_edf_data.sas7bdat file exists and select Run (Figure 4). Hapmap_chp_edf_data_dsp.sas7bdat will be generated and contain the disease variable.

Since this is sample data, a random assortment of reaction states are created. This Data Step statement will generate a new column with random values of 1 or 0. When using clinical data true reaction states may be merge by using Genomics > Data Set Utilities > Merge.

ANALYSIS

Now that genomic information and a reaction state are consolidated into one file statistical analysis can begin. JMP has several methods for testing associations under the Genomics > Association Testing menu. The relationship between reaction state and SNP’s will be studied using Case Control Association, which employs a chi-squared association test.

Navigate to Genomics > Association Testing > Case Control Association. For the “Input Data Set” choose hapmap_chp_edf_data_dsp.sas7bdat, which was generated in our Figure 4: Data Step adding a reaction state to our SNP last data step. Ignore the warning about column limits. Once the CHP data. Available Variables are displayed select “disease” and click on the 3

Figure 6: Result of Case Control Association Tests.

Figure 5: Case Control Association.

arrow beside “Trait Variable”. In the “List-Style Specification of Marker Variables” box enter “SNP:”. This tells JMP that every column beginning with SNP will be used as a marker variable. Choose an “Output Folder” to store results (picture 5). Click on the “Options” tab and change the “Format of Marker Variables” radio button to “Alleles”. Click on the “Annotation” tab and Figure 7: A subset of our results. Note the sliding scale to change select “Probe_Set_ID”. Select the Right Arrow to the left of “Annotation Label Variable”. Select “Run”. limits for this subset.

RESULTS

When the analysis is completed JMP will present the a table containing the case control association test results and a graphic of the -log10(p) score for each SNP tested (Figure 6). The classical limit for a chi-squared test of .05 is meaningless with such large data sets, so care needs to be taken in analysis. With about 1 million SNP’s a -8 meaningful –log10 (p) score for a data set should probably be in the vicinity of 8 (p < 10 ). Since disease states were randomly assigned exceptionally good scores are not expected. Click on the outliers or click and drag to focus on specific SNPs. Clicking on a specific test will present the SNP label.

Subsetting the data set helps focus on score ranges and save statistically relevant results. In the bottom left corner of the association results data set a synopsis of the rows present in the data set is presented. Click on the red arrow to the left of the word “Rows”. Select “Data Filter”. Highlight “ProbTrend” and select “Add”. On the next screen there will be a bar detailing the range of ProbTrend values to view. Adjust this setting by dragging either side of the bar (Figure 7). This will highlight the rows between acceptable range limits. Data can be subsetted and saved by clicking on red arrow beside “Data Filter” at the top of the window.

If a strong association is discovered further investigation into the relationship can ensue. If the relationship is strong a genetic screening test to predict potential reaction states can be developed. The FDA will require a refocused clinical trial, but a potentially viable and profitable drug could be rescued. 4

CONCLUSION

JMP Genomics is a powerful graphical tool, capable of handling the large data sets requisite for meaningful analysis when dealing with pharmacogenomic scale decision processes. The dynamic linking capabilities of JMP help highlight outliers and the graphical power of JMP facilitates the design process requisite for pharmacogenomic. SNP analysis was presented, but JMP Genomics can also perform copy number analysis, principle component analysis, haplotype analysis, QTL Mapping, predictive modeling, pedigree based associations and other common tasks required for genome and proteome analysis. JMP Genomics offer a comprehensive suite of powerful tools capable of meeting current pharmacogenomic needs.

REFERENCES

Affymetrix, 1. Sample Data, Genome-Wide Human SNP Array 6.0. http://www.affymetrix.com/support/technical/sample_data/genomewide_snp6_data.affx

Affymetrix, 2.Support by Product for Genome-Wide Human SNP Array 6.0. http://www.affymetrix.com/support/technical/byproduct.affx?product=genomewidesnp_6.

International HapMap Consortium. 2003. “The International HapMap Project”. Nature 426:789-796.

Mallal S, Phillips E, Carosi G, et al. 2008. “HLA-B*5701 screening for hypersensitivity to abacavir“ N. Engl. J. Med. 358:568-579.

Rettie, AE, Tai, G. 2006. “The pharmacogenomics of warfarin: closing in on personalized medicine” 6:223-227.

CONTACT INFORMATION

Jason Baucom, PhD 2103 Trailridge Ct Raleigh, NC 27603 919-539-3715 [email protected]

5