Identifying and Validating Alternative Splicing Events an Introduction to Managing Data Provided by Genechip® Exon Arrays

Pantone #2685 (Blue) 4/Color Equivalent (C:100.0 M:94.0 Y:0.0 K:0.0) Pantone #3288 (Green) 4/Color Equivalent (C:100.0 M:0.0 Y:56.0 K:18.5) Pantone #032 (Red) ® 4/Color Equivalent (C:0.0 M:91.0 Y:87.0 K:0.0) GeneChip Exon Arrays Black

AFFYMETRIX® PRODUCT FAMILY > RNA ARRAYS AND REAGENTS >

Technical Note

nn nn Identifying and Validating Alternative Splicing Events An introduction to managing data provided by GeneChip® Exon Arrays

GeneChip® Exon Arrays are powerful INTRODUCTION regulation and disease mechanisms. tools for the discovery and study of mRNA This Technical Note provides detailed transcript diversity. For the first time, Alternative splicing is a major source of guidelines for those using exon arrays for researchers will obtain approximately 1.4 protein diversity for higher eukaryotic alternative splicing analysis, to help million data points from each sample in a organisms, and is frequently regulated in a researchers generate meaningful interpreta- single experiment. This increased data developmental stage-specific or tissue-spe- tion of exon array data more quickly. These guidelines include the following: density also poses a number of data cific manner. Current estimates suggest I An introduction to alternative splicing analysis challenges, including manage- that 50 to 75 percent (or more) of human genes have multiple isoforms. prediction algorithms when comparing ment of a higher number of potential false Splice variants from the same gene can pro- changes that have occurred between two positives from the much larger data set. duce proteins with distinct properties and dif- groups of samples In addition, exon arrays provide a new ferent (even antagonistic) functions. In addi- I Description of an analysis workflow dimension of genomic information tion, a number of genetic mutations involved I Practical considerations in filtering data beyond classical gene expression results in human disease have been mapped to I Experimental verification of alternative from microarrays—alternative splicing. For changes in splicing signals or sequences that splicing events the analysis of alternative splicing, new regulate splicing. Thus, an understanding of A list of technical support materials is algorithms will need to be developed and changes in splicing patterns is critical to a included for convenient reference. It is highly tested in various data sets. This is an active comprehensive understanding of biological recommended that users review these reference area of research and we anticipate that new developments and methods will con- Figure 1: Different types of alternative splicing events. tinue to emerge with the increasing avail- ability of sample data sets on exon arrays. Cassette In this Technical Note, we present practi- exon Alt 5’ss cal recommendations for managing some of these challenges based on our experi- Mutually ence at Affymetrix. A sample workflow exclusive Alt 3’ss operating mainly within Affymetrix® exon Expression Console™ Software and other GeneChip®-compatible™ software pro- Intron grams for exon array analysis is described retention in detail. Our experience shows that systematic filtering of the raw array results, as detailed here, is critical to the improvement of validation rate in the subsequent Alternative Alternative RT-PCR validation experiments. Most of polyadenylation promoters the analysis strategies discussed in this Technical Note will be applicable to any statistical method developed for identifying alternative splicing events. AFFYMETRIX® PRODUCT FAMILY > ARRAYS

documents to become familiar with the array minimum size of 25 bp. About 90 percent of directly from Affymetrix’ web site. design and basic algorithms associated with the PSRs are represented by four Perfect Match I Core: RefSeq transcripts and full-length exon arrays prior to use of this Technical Note (PM) probes (a “probe set”). Such redundancy mRNAs (17,800 transcript clusters) to walk through the actual analysis workflow. allows robust statistical algorithms to be used I Extended: Core + cDNA-based annota- For detailed information about using this in estimating presence of signal, relative tions (129,000 clusters) workflow to generate biologically significant expression and existence of alternative splicing. I Full: Extended + ab-initio gene pre- results comparing colon cancer and normal The exon arrays do not include a paired dictions (262,000 clusters) samples, see Gardina, et al. Mismatch probe for each PM probe. Instead, surrogate background intensities are derived SIGNAL ESTIMATION ALGORITHMS BACKGROUND from approximately 1,000 pooled probes Several statistical methods may be used to with the same GC content as each PM probe. combine information from probes belonging EXON ARRAY DESIGN One commonly used set of background to the same gene, or exon, to generate expres- GeneChip® Exon 1.0 ST Arrays are an probes is called the “antigenomic” back- sion signal values of the gene or exon. For incredibly powerful tool for the study of ground probe set, which contains sequences example, RMA and PLIER are two of the alternative splicing. The ability to treat indi- that are not present in the human genome (or most commonly used algorithms. Although vidual exons (or parts of exons) as independ- a few other genomes) and are not expected to this Technical Note focuses only on the ent objects makes it possible to observe dif- cross-hybridize with human transcripts. workflow with PLIER, the basic principles ferential skipping or inclusion of exons. This In addition, exon arrays provide robust apply to RMA and other signal estimation is not possible (or at least, certainly not opti- gene-level expression analysis. The median analysis methods, as well. mal) on more classical expression array number of probes for each RefSeq gene is 30 Relative expression can be determined using designs that focus on transcription activities to 40 distributed along the entire length of the PLIER algorithm (White Paper, “Guide to at the 3’ end of a gene. the transcript, as compared to probes select- Probe Logarithmic Intensity Error (PLIER) The exon array was designed to be as inclu- ed only at the 3’ end in classical gene Estimation”), a robust M-estimator that uses a sive as possible at the exon level, drawn from expression microarrays. multi-chip analysis to fit a model for feature annotations ranging from empirically deter- response and target response for each sample. mined, highly curated mRNA sequences to PROBE SET ANNOTATIONS AND The target response is the PLIER estimate of ab-initio computational predictions (for more TRANSCRIPT CLUSTERS the signal of a probe set (exon-level). information, see Technical Note, “GeneChip® The plethora of exon architectures (as shown Gene-level PLIER estimates are derived by Exon Array Design”), enabling the discovery of schematically in Figure 1, e.g., cassette combining all probe sets predicted to map new alternative splicing events. This is an exons, mutually exclusive exons, alternative into the same transcript cluster (according to advantage over exon-exon junction arrays, splice sites, alternative transcriptional starts the meta-probe set list). Since PLIER is a which are typically designed against only and stops), the variations in quality of tran- model-based algorithm, exons that are alter- observed or annotated junctions. script annotations and the necessity of rapid- natively spliced in the samples, therefore The GeneChip Human Exon 1.0 ST ly incorporating new genomic knowledge exhibiting different expression patterns com- Array contains approximately 5.4 million have led to a dynamic design for reconstitut- pared to the constitutive exons, will have probes (“features”) grouped into 1.4 mil- ing exons into genes. down-weighted effect in overall gene-level lion probe sets, interrogating over 1 mil- A set of rules was created for virtual target response values (White Paper, “Gene lion exon clusters, which are exon annota- assembly of the probe sets (exon-level) into Signal Estimates from Exon Arrays”). tions from various sources that overlap by transcript clusters (gene-level) based on the IterPLIER is a variation that iteratively genomic location. A Probe Selection confidence level of the supporting evidence discards features (probes) that do not corre- Region (PSR) represents a region of the and the juxtapositions of the exon borders late well with the overall gene-level signal genome that is predicted to act as an inte- (White Paper, “Exon Probe set Annotations and then recalculates the signal estimate to gral, coherent unit of transcriptional and Transcript Cluster Groupings v1.0”). derive a robust estimation of the gene expres- behavior. A PSR is the target sequence The mapping between probe sets and tran- sion value primarily based on the expression from which probes are designed. In many script clusters is defined by meta-probe set lists levels of the constitutive exons. cases, each PSR is an exon; in other cases, as described below (in order of decreasing Presence/absence of exons is determined due to potentially overlapping exon struc- confidence). The number of clusters cited using “Detection Above Background” tures (or alternative splice site utilization), below reflects the version of mapping files (DABG), as documented in the Detection Call several PSRs may form contiguous, non- provided by Affymetrix as of November section of the Technical Note, “Statistical overlapping subsets of a true biological exon. 2006. Updated mapping files incorporating Algorithms Reference Guide,” using surrogate The median size of PSRs is 123 bp with a the latest information may be downloaded background intensities as described above.

nn n n 2 GENE EXPRESSION MONITORING ALTERNATIVE SPLICING ALGORITHMS then all of the constitutively expressed exons was the motivation for other methods Alternative splicing by definition is differen- in that gene are also likely to have two-fold (MiDAS, Robust PAC, etc.). Different analyt- tial exon usage. It is not sufficient, however, higher signal values. Thus, in order to focus ical methods for performing splicing analysis to simply identify exons with differential on the differential inclusion of individual may deal with this issue in different ways, but expression patterns; we must also account for exons, we need to normalize to the transcrip- ultimately, the total transcription activity and differential transcription of the gene itself. tion rate of the gene, as shown in Figure 2. splicing change must be included in the For example, if a gene is expressed two-fold This concept led to the development of the analysis for the method to be successful. higher in sample “A” than in sample “B,” “Splicing Index” (Srinivasan K., et al.) and The Splicing Index is a conceptually simple algorithm that aims to identify exons (actually, PSRs) that have different inclusion Figure 2: Exon-level signal consists of the gene-level expression and the inclusion of rates (relative to the gene level) between two that exon as a part of the gene (splicing rate). sample groups. The normalized exon intensity (NI), as described in Figure 3, is the ratio The level of a given exon is a result of the transcription rate of of the probe set intensity to the gene intensi- the gene and the inclusion rate of the exon. ty as estimated by PLIER (or IterPLIER). The Splicing Index Value (SI) is calculated by taking the log ratio (base 2) of the NI in Sample 1 and the NI in Sample 2. A hypo- thetical example of this is shown in Figure 4, where the inclusion rate of the measured exon is 10 times lower in Sample 2, despite the fact that the actual intensity is higher. An SI of 0 indicates equal inclusion rates of Exon Level = Transcription x Splicing the exon in both samples, positive values indicate enrichment of that exon in Sample 1, and negative values indicate repression or exon skipping in Sample 1 relative to Sample 2. To identify exons that have statistically Figure 3: Gene-level normalized intensity (NI). significant changes in inclusion rates between two groups, a Student’s t-test can be performed on the gene-level normalized exon Gene-level Probe set intensity intensities. The absolute value of the Normalized (NI) = Splicing Index represents the magnitude of Intensity Expression level of the “gene” the difference for exon inclusion between the two samples (or groups of samples). It is important to consider both the p-value from Figure 4: Splicing Index Value (SI). the t-test and the magnitude of the change since the best candidates for validation are NI Sample 1 Sample 2 likely to have very small p-values and very large SIs (either positive or negative). Probe set intensity 500 600 = 1.0 = 0.1 MiDAS, which is implemented as a com- Gene level 500 6000 mand-line program of the Affymetrix Power Tools (APT), is conceptually similar to the Sample 1 has 10x higher inclusion level Splicing Index but allows simultaneous comparisons between multiple sample groups. SI Sample 1 NI Splicing Index = log MiDAS employs the gene-level normalized 2 Sample 2 NI exon intensities in an ANOVA model to test 1.0 the hypothesis that no alternative splicing = log = +3.32 occurs for a particular exon. In the case of 2 0.1 only two sample groups, ANOVA reduces to a t-test.

nn nn 3 AFFYMETRIX® PRODUCT FAMILY > ARRAYS

For a more detailed description of algo- potential solution for a tissue-map experi- 3. We are typically dealing with a mixed rithms from Affymetrix, refer to the White ment with many different tissue types is population of related transcripts from a locus Paper “Alternative Transcript Analysis to treat each tissue as a different group rather than the singular “gene” that we usu- Methods for Exon Arrays,” available at and do many pair-wise comparisons of a ally associate with other expression arrays. www.affymetrix.com. Various GeneChip®- single tissue to all other tissues. MiDAS is This is a messy concept, but it more closely compatible™ software products for the exon capable of including more than two reflects real biology. The exon array does not application also introduce additional options groups using ANOVA. detect transcripts per se; it detects exons that and algorithms for alternative splicing analy- The third limitation is the requirement for can be virtually reassembled according to the sis. A complete list can be found at replicates in the statistical t-test (minimum meta-probe sets. It has no information about http://www.affymetrix.com/products/soft- of three samples per group). While it is actu- which exons are actually physically linked ware/compatible/exon_expression.affx. ally possible to run the algorithms with together in transcripts, but instead treats fewer than three samples per group, the t- genes as a collection of all the exons associat- CAVEATS OF MIDAS AND THE SPLICING INDEX test needs the replicates to estimate the intra- ed with that locus. There may, in fact, be a There are several caveats of the currently group variability to calculate meaningful p- number of combinations of transcripts that implemented, first-generation splicing algo- values. In cases where there are insufficient would account for the observed exon probe rithms that are important to consider when replicates within a sample group, it may be set signals; therefore, known transcripts from interpreting exon array results. possible to create logical groups that might public databases may be a useful guide for First, the alternative splicing results are be expected to have similar splicing patterns. interpreting the results. highly dependent on the accurate annotation of As with many algorithms, the result is like- 4. Differential gene expression may further the transcript clusters and precise quantitation ly to be more robust with increased numbers complicate differential alternative splicing of gene-level estimates. Mis-annotated tran- of replicates, larger group sizes and increased analysis. In order to calculate differential script clusters or otherwise incorrect gene-level consistency within each group. exon inclusion into gene transcripts, we predictions may produce less reliable results. always have to account for relative differences For example, genes with many alternative- CONTEXT FOR INTERPRETING ALTERNATIVE in gene expression between two sample ly spliced exons (a large percentage relative to SPLICING DATA types. Therefore, predictors of alternative the total number of exons) or instances where Alternative splicing introduces a new level of splicing must accurately estimate both gene- only a sub-set of the gene is expressed (due to complexity relative to the analysis of gene level signals and exon-level signals. alternative transcriptional starts/stops) are expression. Several things are critical to con- The combination of these factors means likely to have some inaccuracy in gene-level sider for planning, evaluating and interpret- that typical considerations in designing a estimates. One possible improvement that ing microarray data when screening for good microarray study such as sample size may increase the robustness of gene-level altered splicing events: and sample quality become even more criti- estimates is to only use probe sets interrogat- 1. Alternative splicing is often a subtle cal. For example, tumors may be a mixture of ing constitutive exons. However, this can be event: there is rarely an all-or-none change in different tumor stages and probably also extremely dependent on specific samples and exon inclusion between one tissue and anoth- mixed with normal tissue. Even normal tis- difficult to predict. The Splicing Index er. A more likely scenario may be a shift from sue may be composed of various cell types works best when the gene has a large number 50 percent inclusion of an exon in one tissue exhibiting different splicing patterns; for of constitutive exons and a small number of to 80 percent inclusion in a different tissue. instance, colonic sections can consist of both alternative exons. A small change in the ratio of isoforms may smooth muscle and epithelial tissue. Furthermore, these algorithms assume be biologically significant. Due to the inherent heterogeneity of the that the splicing pattern is consistent among 2. Alternative splicing is a very common biology, the data needs filtering by multiple all of the samples within a group. This is phenomenon: ~ 75 percent (or more) of all methods at both the exon and gene level to probably not always the case, e.g., tissue- genes appear to be alternatively spliced. reduce false positives (described in detail in map experiments that combine multiple tis- Between any two tissue types there may be the Workflow and Filtering Methods sec- sue types into a single group. Utilizing thousands of probe sets that vary in their tions below). Visually inspecting the probe groups consisting of multiple tissue types inclusion into transcripts. The splicing pat- set intensities in a genomic context is also a makes it impossible to discover splicing dif- tern can also be inherited and therefore differ useful way to improve true discovery rate. ferences between two tissues within the same from individual to individual, resulting in Ultimately, experimental confirmation by group. Large variations of splicing pattern heterogeneity in the population examined. a different technology—RT-PCR, for exam- (and gene expression, to a lesser extent) with- This adds an additional factor in interpreting ple—will provide confidence in the results in a group will reduce the significance of the results or identifying disease-specific alterna- obtained from statistical analysis. However, t-test, resulting in a larger p-value. One tive splicing events in the mixed background. experimental validation is laborious, so

nn n n 4 GENE EXPRESSION MONITORING approaches to narrow down the candidate 1. In signal estimation, the meta-probe A) Gene level list computationally may save much effort in set defines the exons to be used to cal- I Use Expression Console the lab. It should also be noted that in some culate gene-level signal (PLIER, I IterPLIER (PM-GCBG) cases filtering the data in an attempt to IterPLIER, RMA, etc.), and therefore I Antigenomic.bgp as the GC back- reduce false positives may involve a trade-off defines the set of genes to include in ground pool. Antigenomic sequences with the loss of true positives. The suggest- the analysis. are not found in the human genome or ed filtering methods are intended to highly 2. In MiDAS, the meta-probe set several other genomes. enrich the results with true positives, there- defines the exons that are mapped to I Sketch normalization at 50,000 by maximizing the likelihood of positive the input genes. data points validation. Depending on the specific aims The result is that you could, for instance, I Meta-probe set = Core (this is the of an individual experiment, a different specify only Core (highly confident) genes, most conservative set of genes with the approach may be more appropriate. but use MiDAS to predict alternative splic- highest confidence) In addition, some users may be interested in ing for the Full (speculative) set of exons I Set of CEL files using exon arrays to detect unique individuals associated with those Core genes. Do not use DABG for gene-level estima- or samples within the experimental group I Command-line formats are provided tion of “Present/Absent.” Detection calls for that exhibit different splicing patterns com- (gray box in next column) for those using genes will be estimated separately with pared to the rest of the samples. This analysis UNIX-based systems. The analogous another method (as described below during can be used effectively to detect individual-to- input options for the Windows GUI ver- the filtering steps). individual splicing variation or exon-skipping sion of Expression Console are explained in mutations, frequently observed in cancer sam- detail in the user’s manual. For those using the command-line APT ples. For this type of analysis, the methods programs on UNIX: described here are certainly relevant but the PHASE I: SIGNAL ESTIMATION probe set-summarize -a quant- details will need to be modified to meet the Listed below is a suggested example to con- norm.sketch=50000.bioc=false,pm- specific research objectives. duct exon array analysis. Although there are gcbg,iter-plier -p HuEx-1_0-st- other possible ways to perform these same v2.pgf -c HuEx-1_0-st-v2.clf -b WORKFLOW functions, we describe one of them here as a antigenomic.bgp -s meta-probe starting point for new users. set.core.txt MyCelDirectory/*.CEL This workflow (as schematically shown in Figure 5) begins with CEL files and concludes with empirical validation of the results. Some Figure 5: Workflow for gene expression and alternative splicing analysis with exon arrays. introductory concepts are described here: Tan boxes represent files or data sets and green boxes represent processes or programs. Functions within the gray boxes occur within Expression Console, or Affymetrix APT. Solid I Some of the workflow occurs outside of Expression Console with external applica- arrows are the main data flow and dashed lines are accessory flows. tions for scripting/filtering (Perl) and statistics (like Partek and other GeneChip®- CEL ™ compatible software packages). Meta- IterPLIER PLIER DABG I There are actually two parallel analyses probe set (gene-level and exon-level) carried out in Gene signal Exon signal Expression Console that ultimately con- Remove outlier samples verge at MiDAS (or other alternative algorithms) for alternative splicing prediction. Filter gene Filter exon Exon signals signals p-values I The assumption for this entire approach is that there are two or more sample Meta- MiDAS groups, and we intend to find differential probe set alternative splicing between the groups. AS predictions Alternative splicing can occur within one Gene tissue, but in this scenario, we are only expression Fold change and p-value analysis concerned with the way it varies in differ- Visualization ent tissues or conditions. I There are two points at which meta- Validation probe set options are invoked:

nn nn 5 AFFYMETRIX® PRODUCT FAMILY > ARRAYS

B) Exon level Figure 6: Principle Component Analysis (PCA) to identify outlier samples. The examples I Use Expression Console shown here are the colon and normal paired samples run on exon arrays. The CEL files are I PLIER (PM-GCBG). Most probe available at www.affymetrix.com. Exon-level signals from each sample are mapped by PCA, sets only have four probes, which is too and paired tumor-normal samples are joined by lines. The circled sample data (back left) limited to be useful with IterPLIER at appears to be an outlier in two dimensions (1 and 3) of the PCA mapping. However, it is the individual exon level. Patient #3 (black arrows) that behaves contrary to the majority of the samples; while most I Antigenomic.bgp of the tumor samples tend to have a higher value on the chief PCA component (i.e., right- I Sketch normalization at 50,000 ward on the X-axis) than the normal tissue, Patient #3 tends to the opposite direction. The data points aberrant behavior of Patient #3 is confirmed by ANOVA (shown in Figure 7 below). This also I DABG (PM-only)-produces p-val- illustrates one of the advantages of having paired samples. ues for detection above background I It is possible to limit the analysis to 420 a subset of exons by providing a list 338 Tissue type file. However, in general, it may be Normal best to begin with an inclusive set like Tumor 256 probe set-list.main.txt to minimize any

bias in analysis. 174 I

Set of CEL files % 92 For those using the command-line APT .73

programs on UNIX: 10

probe set-summarize -a plier-gcbg- PC #2 8 sketch -p HuEx-1_0-st-v2.pgf -c -71 HuEx-1_0-st-v2.clf -b antigenomic.bgp -s probe set- -153 list.main.txt -a pm-only,dabg -x 2 -235 MyCelDirectory/*.CEL

-317

PHASE II: REMOVING OUTLIERS -400 Outlier samples should be identified and -700 -559 -419 -279 -1390140 280 420 560 700 eliminated, or at a minimum, accounted for. PC #1 20.6% This is particularly true in analysis of exon array data since low sample quality is likely Figure 7: ANOVA analysis of signal and noise. Exclusion of the sample pair from Patient #3 to be highly influential in generating noise, (“Del_3”) in ANOVA analysis greatly improves the signal-to-noise ratio for discrimination by therefore leading to high false positive rates. tissue type (normal vs. tumor). Therefore, Patient #3 was removed from the remainder of A typical approach is the Principle the analysis described below. No other sample pair had this magnitude of effect. Be aware, Component Analysis (PCA) for identifica- however, that removing outliers reduces sample size; therefore removal of outlier samples tion of possible outliers, followed by should be done cautiously. Furthermore, noise ratios (normalized against error) from ANOVA to test their effect (using a standard ANOVA also showed that the gender, patient and tumor-stage categories were relevant (F statistical package, such as Partek’s Ratio > 1), and thus should be included as factors in later analysis. Genomics Suite). Examples are shown in Sources of Variation Figure 6. Samples found as extreme outliers 6 should be removed from the analysis. 5

PHASE III: FILTERING SIGNAL DATA 4 In order to obtain meaningful splicing All 3 information and to decrease the chances of Del_3 false positives (thereby increasing the veri- 2 Ave fication rate), a number of filtering steps 1 can be performed. . F Within the analysis described in this Ratio 0 Patient Tissue Gender Stage Ulcer Technical Note, validation of splicing events in

nn n n 6 GENE EXPRESSION MONITORING Figure 8: Several scenarios that may lead to artifactual predictions of alternative splicing events. In each case, it is the combination of mis- leading probe set results and differential gene expression that creates a false prediction of alternative splicing.

Group 1 Gene expressed in only one group Cross hybing probe setNon-responsive / absent probe set Group 1 Gene Level Splicing Index 1000 1000 1000

Intensity 100 100 100

10 10 10

Gene-level >1 >1 >1 Normalized Intensity 1 1 1

<1 <1 <1

+ + + Splicing Index Value 0 0 0 (Blue/Red) - - - the laboratory (e.g., by RT-PCR) is by far scenario illustrated in the middle column I Gene signals—Core (filtered); the Core the most laborious step. Therefore, addition- of Figure 8. gene-level meta-probe set is chosen at al effort made during computational analysis 2. Require a minimum gene signal level. the signal-estimation stage to reduce false positive rate will ultimately 3. Remove probe sets with very large I Exon signals—all exons (filtered) decrease time spent in the lab. In this sec- exon/gene intensity ratio—which may I Meta-probe set—Full; at this point the tion, we present some suggestions for mini- also implicate cross-hybridization to meta-probe set tells MiDAS which set mizing the impact of artifactual signals. In other gene sequences. of exons to evaluate. addition, several more stringent optional fil- 4. Remove probe sets with very low The combination described here means tering steps are possible that may further exon/gene intensity ratio in the group that MiDAS will look at all the exons lessen false positives and make it easier to that is expected to have the higher rate (“Full”) associated with the input Core find “low-hanging fruit” for validation. of inclusion—which may also implicate genes. A more conservative approach Figure 8 illustrates several examples of pos- non-linearity of the probe set. may be to look only at Core exons by sible scenarios that may lead to misidentifi- inputting the Core meta-set probe file. cation of alternative splicing events. TERTIARY (OPTIONAL) FILTERING I cel_ids.txt—This file tells MiDAS how Several filtering methods applied to signal 1. Remove genes that have very large samples are partitioned into Groups data are suggested here and are described in differential expression. (e.g., “brain,” “lung,” “kidney”) that detail in the Filtering Methods section. Some 2. Limit search to high-confidence exons. should be compared for differential of these methods have been implemented in 3. Restrict the search to only highly splicing. The ANOVA in MiDAS can third-party software, but otherwise can be per- expressed genes. compare multiple groups but does not formed using simple scripts, e.g., in Perl. 4. Focus on exons that have gene-level incorporate additional factors like normalized intensities near 1.0 in the gender, tumor stage, etc. PRIMARY (MANDATORY) FILTERING group predicted to have a higher 1. Remove any gene (transcript cluster) inclusion rate and near 0 in the group For those using the command-line APT that is not expressed in both sample predicted to have a lower inclusion rate. programs on UNIX: groups—to eliminate the scenario 5. Filter probes with unusually low variance. midas -c cel_ids.txt -g CC.genes- illustrated in the left column of Figure 8. 6. Limit search to known alternative core.i-plier.sum.txt -e CC.exons- 2. Remove any exon (probe set) that is not splicing events. main.plier.sum.txt -m meta-probe- expressed in at least one sample group— set.full.txt to eliminate the scenario illustrated in PHASE IV: MIDAS the most right column of Figure 8. Subsequent to filtering, both gene-level and MiDAS outputs predictions for alternative exon-level signal intensities are input into splicing events as p-values. These should not SECONDARY (SUGGESTED) FILTERING MiDAS. necessarily be treated as true p-values but rather 1. Remove probe sets with high potential for One option to carry out alternative splic- as scores that reflect relative ranking. The false cross-hybridization—to eliminate the ing analysis through MiDAS is detailed here: positive rate is likely to be much higher than

nn nn 7 AFFYMETRIX® PRODUCT FAMILY > ARRAYS

indicated by p-values and the high-ranking Figure 9: A candidate splicing event mapped onto the UCSC Genome Browser. The high- candidates should be further screened/filtered lighted region that corresponds to the targeted probe set (“YourSeq”) appears to be a cas- and ultimately verified empirically. sette exon (one that is included or skipped in different known transcripts).

PHASE V: POST-MIDAS FILTERING 1. Keep only probe sets with p-values less than a particular cutoff (i.e., p-value < 1x10-3). Probe sets with the smallest p-values are the most likely to have significant differences in inclusion rate. The cut-off value used can easily be lowered to increase stringency. The optimal cut-off value is really dependent on the number of targets you wish to see on your final list. In addition, you can First, BLAT the sequence of the PSR iden- ing. For example, if the exon of interest is pre- do a multiple testing correction (such as tified to be alternatively spliced (can be dicted to be higher in group A (increased Bonferroni or the Benjamini-Hochberg obtained using the probe set ID on the inclusion), but surrounding probe sets also False Discovery Rate) to determine the NetAffx Data Analysis Center at appear higher, it may suggest that the region p-value at which the differences are www.affymetrix.com) to the UCSC Genome of the gene has an alternative start or stop that considered statistically significant. Browser (http://genome.ucsc.edu). BLATing affects multiple exons. Also, if there are mul- 2. Sort results by magnitude (absolute the sequence can also provide information tiple probe sets for the exon, it is reassuring value of Splicing Index) and keep only about potential cross-hybridization or if the that the data for all of them are in agreement. probe sets that have a minimum probe set maps to multiple genomic locations. Visualization and careful observation of the difference of 0.5. View the probe set in the browser and data can often predict the likelihood of a pos- Probe sets with larger magnitudes of pre- zoom out slightly to get a broader view of the itive result prior to verification. For example, dicted changes are more likely to have more region around the exon or the entire gene consider an experiment where we are compar- dramatic splicing changes. Small magnitude that contains the exon of interest. The ing tumor samples to normal samples to look changes may still be biologically relevant; genome browser can provide important for aberrant splicing patterns in cancer. We however, larger magnitude changes typically information about location of the probe set find an exon that is predicted by the Splicing make better candidates for validation. As dis- within the gene and if the exon is known to Index (or similar algorithm) to be higher cussed above, the cut-off value can easily be be alternatively spliced (based on EST and (increased inclusion) in the tumor samples. raised to increase stringency of the filter mRNA sequences). It may also be possible to However, when we locate the position of the depending on the study design and needs. This observe overlapping, intervening or inde- exon within the gene, we discover by looking step is analogous to filtering by fold change for pendent transcripts (on the same or opposite at the genome browser that the targeted exon gene expression. Whereas p-values measure strand) and potential artifacts of transcript appears to be constitutive (always included) the ability to statistically separate two groups, cluster annotations, such as multiple genes in numerous EST and mRNA sequences. the SI gives the magnitude of the difference. being combined into a single cluster. This exon is very unlikely to be a simple case Unfortunately, MiDAS does not directly out- While not an absolute requirement for of alternative splicing (cassette exon), and put the SI and it must be calculated externally validation, candidates consistent with known thus is very unlikely to result in a positive val- to the program (e.g., with Perl or R). examples of alternative splicing are more idation. It is entirely possible that this seem- likely to prove to be true positives. However, ingly constitutive exon is, in fact, alternative- PHASE VI: VISUAL FILTERING OF RESULTS this filtering approach will also tend to sup- ly spliced in cancer. However, in this case, the Despite all attempts to filter the data, sever- press the discovery of novel splicing events. exon is predicted to be higher in the tumor al classes of false positives are not possible to Second, it is worthwhile to examine the samples, but the exon is already included in be identified purely based on values from the probe set intensity data of your exon of inter- all of the transcripts of this gene. It is not pos- statistical analysis itself. A simple manual est and surrounding exons in an integrated sible to have an increase in inclusion when the inspection of the data in genomic context can browser such as the Affymetrix Integrated exon is already present in 100 percent of tran- catch many of these potential pitfalls. Genome Browser (IGB) or BLIS (Biotique scripts. While the data may represent an Previous experience has shown the benefits of Systems). This is a helpful step to ensure that interesting biological event, it is possible that doing this in two ways. the data is consistent with alternative splic- the result is a false positive or may involve a

nn n n 8 GENE EXPRESSION MONITORING Figure 10: A view of two mutually exclusive exons (19a and 19b) in the BLIS viewer rate estimates of the absolute levels of each (Gardina, et al). This candidate splicing event was confirmed by RT-PCR. (BLIS normalizes isoform, depending on the individual the exon signal for each sample to the median exon signal for that sample across all the requirements of the experiment, it may not exons in the view.) be necessary in the verification of results from the exon array. In most cases the two different bands representing two alternatively spliced isoforms are produced by the same primer pair, and it is easy to observe changes in the relative intensity of the two bands between samples. Amplification efficiencies may differ between the two products, but this bias will be the same for all samples. The key observation is a change in ratio (relative intensity) of the two products. By starting with equal amounts of input cDNA, simple RT-PCR can

e be considered semi-quantitative. Since each validation case is unique and

mpl different types of alternative splicing events require different strategies, for best results, automated selection of primer sequences is Figure 11: Primer design for RT-PCR validation of alternative splicing events. in sa not recommended. For example, detection of mple Typical RT- P CR Validation alternative starts and stops requires place- ment of a primer within the exon of interest Primers in flanking constitutive exons since the target exon does not connect to Exon of interest Non-bra flanking exons on both sides. In this case, two

M Brain sa separate PCR reactions should be run. One Pre-mRNA 500 determines presence/absence of the exon (as 400 described above), and the other is a reference Skip Include 300 primer pair designed to measure the alterna- exon exon 200 tive pattern or expression level of the gene. A similar design strategy can be used for Mature (spliced) 100 mutually exclusive exons of similar size that RNA cannot be resolved by separation on a gel. In addition, in order to discriminate inclusion more complex regulation (possibly involving versus skipping events on a gel, the overall alternative starts/stops) and as such may not PHASE VII: RT-PCR VALIDATION size of the amplicon should be designed rel- be easily verified by the follow-up RT-PCR Validations are easily carried out via RT-PCR ative to the size of the target exon. Smaller verification. using primers in exons that flank the exon of exons should have smaller overall amplicon While the process of manual filtering can interest. This works well for simple cassette sizes so that the skip/include products will be time-consuming, a well-trained eye can exons and alternative 5’ and 3’ splice sites. As clearly resolve (separate) on the gel. Clearly, view 100 or more potential hits in a reason- illustrated in Figure 11, when the PCR prod- design of primers for validation is not a “one able amount of time. Our experience suggests ucts are run on an agarose gel, there will be size fits all” situation. that it is well worth the time and it can be separation between the “include” product and There are several classes of validation tar- coupled with the design of primer sequences the “skip” product based on size (“include” gets that pose additional challenges to RT- for validation. Several commercially available product is larger by the size of the alternative PCR. Alternative starts/stops and mutually GeneChip®-compatible™ Software Packages, exon and therefore runs slower through the exclusive exons were mentioned above. including Partek’s Genomics Suite and gel). Primers are best designed in flanking Alternative transcriptional starts/stops that Biotique’s X-Ray, provide easy click-through constitutive exons and it is usually possible to include multiple exons are also tricky and visualization of the raw signal data on an exon- calculate the expected sizes of the PCR prod- require validation strategies different from by-exon basis that belong to the same gene, ucts ahead of time. the standard flanking exon approach. and can easily be used to conduct this step of While carefully designed quantitative For cassette exons that are very large in visual inspection. PCR (i.e., TaqMan) could provide more accu- size, it may be difficult to efficiently amplify

nn nn 9 AFFYMETRIX® PRODUCT FAMILY > ARRAYS

the “include” product using flanking exons estimation of the gene-level signal is give a false signal for probe set intensity due to the large size of the amplicon. This more reliable) due to cross-hybridization, probe set satu- may also be the case for genes with multiple 4. A single exon is alternatively spliced ration or inherently weak and non-linear consecutive cassette exons. 5. The alternatively spliced exon is always response. Although this is not a frequent Additionally, incorrect selection of con- included in one sample group and never event, the exon array represents 1.4 million stitutive exons can lead to misinterpretation included in the other sample group. probe sets, giving a larger number of of the RT-PCR results. Independent inter- Figure 13 shows an example of a correctly potential false positives. In most cases, this vening transcripts may be impossible to identified alternative splicing event (later signal will be about the same across all validate using this method since the exon of validated by RT-PCR). samples in all groups. However, if the asso- interest is never included in transcripts for ANOMALOUS PROBE SIGNALS ciated gene-level signals are different that gene. Some of these problematic cases In some cases, technical anomalies may between the two groups, the false probe set are illustrated in Figure 12. Nevertheless, thoughtful and careful Figure 12: Examples of challenging cases for RT-PCR validation where the primer pairs primer design can usually find a solution for (marked by flat arrows) may not give anticipated validation results due to the complication validation. This also supports the need for of the biology. Therefore, careful selection of the primers and use of an alternative primer visualization of the data in genomic context. design strategy should be considered if negative results are obtained. Many of these problematic situations can be avoided by looking at EST/mRNA sequence Alternative 3’ ends Mutually exclusive exons data for evidence of these odd types of transcript structures. In many cases, if you didn’t know the factors beforehand, RT-PCR may result in a single band and you may incor- Internal alternative start Multiple consecutive cassette exons rectly identify the exon as a false positive.

FILTERING METHODS Independent transcript Very large cassette exon In this section, we will discuss some of the inherently challenging cases associated with this type of array-based alternative splicing analysis, and propose a number of primary and optional methods to increase Figure 13: Exon-level signals from a single gene with a validated differential splicing event. the true discovery rate. The average signal and standard error for each probe set is shown for tumor samples (blue) and normal samples (red). The log2 intensity scale is shown on the right-hand axis. In the AN IDEAL SPLICING EVENT validated splicing event (indicated by the arrow), the exon shows low expression relative to Prior to describing the specifics of filtering, gene expression in normal tissues. Note that there are many characteristics of an ideal sce- it is helpful for us to conceptualize the ideal nario: the gene is highly and equally expressed in both groups, and most of the exons scenario for gene and exon expression that behave uniformly. (The data here is visualized with the Partek Genomics Suite.) will maximize our ability to identify differential splicing. Factors that deviate from this ideal situation tend to lower the probability of correctly identifying splicing events mathematically. Characteristics of the ideal alternative splicing event include: 1. High gene expression (and consistent across all samples) 2. Equal gene expression in both sample groups 3. Most exons (probe sets) parallel the overall expression of the gene (in other words, most of the exons belonging to that gene are constitutive so that the

nn n n 10 GENE EXPRESSION MONITORING signal will be interpreted as differential ual probe set signal may be stuck on “high” phenomenon is shown in Figure 14. inclusion of the exon because the probe set regardless of the actual expression of that 2. Poorly hybridizing or non-linear probes. signals are normalized against the gene sig- exon. Therefore, this exon will appear to be This is the converse of the above problem, nals from each group. relatively more included in the sample but the probe sets will appear to be weakly 1. Cross-hybridizing or constitutively high group that has lower expression for the expressed in all samples (see Figure 15). probe signals. In this scenario, an individ- cognate gene. A probable example of this Characteristics of these anomalous probe set signals are: Figure 14: A probable artifact from probe sets with constitutively high signals (indicated by 1. Approximately equal intensity in both the arrow). Even though this probe set gives the same signal for both tissue types, it sample groups appears to have a relatively high inclusion in normal samples because the gene-level signal 2. Low variance is lower in normal than in tumor. This probe set displays three trademark symptoms of this 3. Possibly extreme intensities relative to artifact: it is higher than the neighboring probe set signals, it is approximately the same in gene expression level. both tissues and it shows an extremely low variance across the replicates. Implementation of a systematic filtering method will be possible as more data becomes available. Fortunately, such anomalous signals are also generally easy to identify by visual inspection.

PRIMARY (MANDATORY) FILTERING OF SPLICING INDEX RESULTS There are two major criteria that must be met for genes and probe sets to be included in further downstream analysis after microarrays. Both relate to a common theme, where the lack of expression will very often be mistaken for alternative splicing when compared to actual expression (or even low expression) in another group, since the signals are so low that the arrays are simply measuring the noise of the experi- Figure 15: A probable artifact due to non-responsive or non-linear probes. The indicated ments. Thus, the first two mandatory filter- probe set appears to be alternatively spliced since its signal in tumor is relatively low com- ing steps of all splicing analyses should be: pared to gene expression in tumor. Consequently, the exon appears to be 100 percent 1. Remove exons (probe sets) that are included in normal but only about 15 percent in tumor (the scale is log2). This probe set dis- not expressed in at least one group. plays two trademark symptoms of this artifact: it has the same intensity in both tissues, and (The reason for only requiring detection it shows low variance across replicate samples. in one group is that if the exon is alternatively spliced it is completely reasonable that it is absent in all of the samples of one group.) The following is a good default set of rules as starting points to remove exons that are expressed at low level: 1. If the DABG p-value of the probe set is less than 0.05, the exon is con sidered as “Present” in a sample. 2. If the exon is called as Present in more than 50 percent samples of a group, the exon is considered to be expressed in that group. 3. If the exon is expressed in either group, accept it; otherwise, filter

nn nn 11 AFFYMETRIX® PRODUCT FAMILY > ARRAYS

it out. genes (9,000 remained out of 17,800 Core median intensity across the gene are likely In a recent study of colon cancer transcript clusters). If Full gene sets are to have a signal that originates from some- splicing, this step removed almost two- used for analysis, it is anticipated that a where outside of the gene. While there thirds of the exons (540,000 remained much larger proportion of the speculative should be some consideration for probes out of 1.4 million probe sets). More content will be removed. with varying affinities (probe response), stringent filtering options may be intensities that are dramatically different implemented to require that the exon is SECONDARY (SUGGESTED) FILTERING from the rest of the exons belonging to the called as Present in more than 75 1. Discard probe sets with high poten- same gene are certainly questionable. Even percent of the samples or have a DABG tial for cross-hybridization. given the potential variability in probe p-value of less than 0.01. Probe sets that have the potential for cross- response, accepting a value much beyond 5.0 2. Remove genes (transcript clusters) hybridization are more likely to result in false (e.g., the exon is five times higher than the that are not expressed in both groups. positives. This is because the signal from the gene intensity in an absolute comparison sense) (Alternative splicing is meaningless probe set may originate from an entirely dif- is not recommended. Large values for the gene- unless the gene is expressed in both ferent gene that may have tissue-specific level normalized intensities can also result in sample groups.) expression. On the exon array, each probe set somewhat inflated magnitudes of change. This is more complicated, because we has a value of potential to cross-hybridize in 4. Discard probe sets with very low don’t presently have a direct way of mak- the annotation files. Probe sets with cross- gene-level normalized intensities in the ing Present/Absent call at the gene level. hybridization values other than 1 might be group that is predicted to have the high- Instead, an approach utilizes the exon- candidates for removal from the analysis. er rate of inclusion (exon/gene < 0.20). level DABG results to indirectly esti- 2. Require a minimum gene expression The theory behind this filtering step is mate Present/Absent for the gene. In this intensity of greater than 15. somewhat related to the argument above, case, we use the Core exons as a surrogate Results from genes with low expression that exon intensities that are much different for gene expression (but this is only valid values close to the noise level may not be as from the median intensity are unusual. for Core genes). reliable. Genes with higher expression val- However, it is a bit more complicated on the The following is a good default set of ues tend to have less noise. The actual opti- low end. It is entirely feasible that an exon rules to remove genes expressed at low level: mal cut-off value is likely to be dependent could be included in only, say, 10 percent of 1. If the DABG p-value of the probe on the experiment (and how the gene-level transcripts in group A, but in only 1 percent set is less than 0.05, the exon is values are calculated). This value can be of transcripts in group B. This difference considered as Present in a gene. increased if higher stringency is desired. still represents a 10-fold higher inclusion 2. If more than 50 percent of the Core In general, one can estimate the back- rate in group A, and the exon could still be exons are called as Present, the ground gene signal and set an absolute considered enriched in group A. However, if gene is considered Present. threshold for calling the gene Present. In our interest is in reducing the false positive 3. If the gene is called as Present in more this case: rate and finding suitable validation targets, than 50 percent of the samples in a 1. If the gene signal is greater than this class is better off being filtered out. In group, the gene is considered THRESHOLD, the gene is called most of these cases it is more likely that the expressed in that group. as Present. exon is simply not expressed at all. 4. If the gene is expressed in both 2. If the gene is called as Present in groups, accept it; otherwise, filter more than 50 percent of the TERTIARY (OPTIONAL) FILTERING it out. samples in a group, the gene is 1. Remove genes that have very large This filtering step is easily made more expressed in that group. expression differences between the stringent by increasing the percentage of 3. If the gene is expressed in both two groups. samples (e.g., more than 75 percent) in groups, accept it; otherwise, filter Despite the fact that the Splicing Index which the gene is expressed, and/or by low- it out. algorithm corrects for gene expression level, ering the cut-off DABG p-value for detec- 3. Discard probe sets with very genes with very large differentials in gene tion. Assuming that the gene is in fact large gene-level normalized inten- expression between the two groups have a alternatively spliced, you may not want to sities (exon/gene > 5.0). tendency to produce false positives. Large increase the “50 percent of Core probe sets” This filter relates to probes with poten- disparities in transcription rate make it requirement since it is likely that some of tial cross-hybridization, but also includes more likely that probe set intensities in the exons will be skipped. probes with high background levels. The the two groups are disproportionately In a recent study of colon cancer splicing, theory behind this filter is that exons hav- affected by background noise or satura- this step removed almost half of the Core ing intensities much higher than the tion. A 10-fold difference is an arbitrary

nn n n 12 GENE EXPRESSION MONITORING cut-off that can be decreased if a less Figure 16: Histogram of gene-level normalized intensities for validated alternative stringent filtering is desired: splicing events. |log2 (gene / gene)| > 3.32 At the extreme, we might begin by look- 60 ing only at genes with approximately equal expression. 50 2. Limit search to “Ensembl/RefSeq Supported” or “Core” Exons. Narrowing the set of analyzed exons to 40 only well-annotated exons (“Core” exons) is a way of dramatically reducing the 30 amount of speculative content. Because quency speculative exons have much lower levels Fre Brain Group of annotation/sequence support, they are 20 more likely to result in false positives. The Non-brain Group trade-off with this filter is that you will lose the ability to discover new splicing 10 events involving novel exons. It is possible, however, to uncover new examples of 0 alternative splicing that involve well- 0.0 0.4 0.8 1. 2 1. 6 2.0 2.4 2.8 3.2 3.6 4.0 annotated exons (novel skipping of a Median gene-level normalized intensity (Exon/Gene) known exon, for example.) 3. Increase stringency on gene expres- the exon would be expected to have a tively spliced exon will have probe set sion level filter to restrict the search gene-level normalized intensity of 1.0 intensities that change between sam- to only highly expressed genes. (exon intensity equivalent to gene ples (Present in one sample, Absent in Genes with higher expression levels expression level) in group A, and a gene- another). This will also tend to elimi- have exon intensities that are further level normalized intensity of 0 (exon not nate probes that are cross-hybridizing away from noise and thus are more reli- expressed) in group B. or non-responsive. able. A reasonable, but very conservative, Real data is, of course, not this ideal- 6. Limit search to known alternative threshold would be greater than 100. ized. However, there is some empirical splicing events. This filter can be used to find the lowest support for this theory. Figure 16 shows One way of quickly screening for candi- of the “low-hanging fruit,” but at the a histogram of gene-level normalized dates that are most likely to result in pos- expense of potentially interesting alter- intensities for validated brain-enriched itive validations is to limit the analysis set native splicing events in low or moder- exons in both the brain samples (red) and to exons that have prior evidence of being ately expressed genes. non-brain samples (green). The peak for involved in an alternative splicing event. 4. Focus on exons that have gene-level the brain samples (exon included) is This filter is more difficult to implement normalized intensities near 1.0 in the between 0.8 and 1.0, and the peak for since it relies on bioinformatic predictions group predicted to have a higher inclu- the non-brain samples (exon skipped) is of alternative splicing based on sion rate and gene-level normalized between 0.2 and 0.4. EST/mRNA sequences or annotations. intensities close to 0 in the group pre- 5. Filter probes with unusually low The UCSC Genome Browser is one good dicted to have a lower inclusion rate. variance. source of transcript and alternative splic- This filter is a bit theoretical, but does This filter seems a bit counterintu- ing predictions. In addition, there are a have some basis in fact. For the sake of itive at first, but the idea is that number of publicly accessible databases of simplicity in this example, let’s assume probes with very low variance (relative alternative splicing events available on the that all probes have equivalent affinities to others within the gene) across the Internet. It may be possible for this pre- (probe response). Let’s also assume that samples are very likely to be either diction to be run once to create a list of you are looking for a dramatic change in absent in all samples (not expressed) or probe sets that could be used to filter all splicing pattern such that an exon is saturated in all samples. If the signal subsequent analyses. It should be pointed fully included in all transcripts in one is above background it can escape the out that most exon-exon junction arrays group and fully skipped in all transcripts DABG filter even though the exon is are filtered in this way by default since in the other group. In this perfect case, “Absent.” By definition, an alterna- they are typically designed to observe

nn nn 13 AFFYMETRIX® PRODUCT FAMILY > ARRAYS

junctions. Thus, it might be expected analysis. Depending on how you than 50 percent of the well-annotated that they would have higher validation define “transcript cluster,” the exclud- exons within a transcript cluster be rates on the whole compared to an exon ed probe sets represent a significant detected above background for the array design. Using this filter does, how- portion of the total number of probe gene to be considered as expressed. ever, surrender the ability to discover new sets on the exon array. Thus, if expression involves fewer than alternative splicing events. I No probe set for the alternatively half of the exons, the algorithm will spliced exon incorrectly call the gene as absent. OVERALL STRATEGIES There are several reasons why an exon may not be represented by a SUMMARY It is important to remember that after the probe set on the exon array. Some of CEL files are produced, the most laborious the reasons for lack of probe set include Exon arrays are powerful tools that enable step in the analysis is validation in the labo- very small exons (less than 25 bp), researchers to monitor genome-wide gene ratory, e.g., by RT-PCR. Therefore, it may over-fragmented exons resulting in expression and alternative splicing beyond be wise to begin with an extremely conser- multiple PSRs that are all below the classical microarrays. By following the basic vative analysis that minimizes false positives minimum size and lack of evidence (or guidelines in this Technical Note, novel (i.e., using all or most of the recommended prediction) for existence of the exon. It splicing events may be uncovered that are filtering steps), then relax the search for can- is also possible that the sequence of the critical to biology and disease studies, adding didate splicing events until the false positive exon made it impossible to build a new dimension to genome research. rate becomes unacceptable. probes, the designed probes give very This strategy may be modified depend- weak signal or the sequence of the exon FAQs ing on time constraints since it postulates is repeated elsewhere in the genome iterative cycles of analysis/validation. The such that data from the probes for that 1. Why should DABG not be used for final success rate will likely be heavily exon were discarded or filtered out. gene-level Present/Absent calls? dependent on sample quality, sample size I Alternatively spliced product is There is a strong assumption in DABG and the inherent biological differences rapidly degraded that all the probes are measuring the same between the sample groups. It is possible that mRNA, including thing (i.e., the same transcript). This is not As an example, a study comprising a an alternatively spliced exon, is turned the case at the gene level due to alternative panel of relatively pure normal tissue sam- over at a high rate so that the signal is splicing. For example, probes for a cassette ples produced a validation rate of 85 per- not detectable by the array. In many exon that is skipped will contribute to a mis- cent, while a much noisier comparison of cases, inclusion of an exon (or mis- leadingly insignificant p-value. colon tumor versus normal tissue demon- splicing) alters the protein coding 2. Can we determine frameshifts or strated a validation rate of 35 percent. reading frame or incorporates a prema- the introduction of nonsense codons by Researchers might also have particular ture termination codon (PTC). Cells alternative splicing events? interests that guide their strategies, e.g., have a mechanism called Nonsense No. The resolution of the Human Exon searching for targets of a particular splicing Mediated Decay (NMD) for detecting 1.0 ST Array is not nearly sufficient to deter- factor or alternative splicing events in a and destroying these messages. It has mine single nucleotide changes. This would particular pathway. also been shown that in some cases require junction-type arrays, which focus on this mechanism is exploited as a specific events that are known a priori. The SPLICING EVENTS THAT MAY means of regulating gene expression: Human Exon 1.0 ST Array is designed more BE MISSED BY THE ANALYSIS purposeful inclusion of a PTC so that for genome-wide discovery of large-scale the mRNAs are degraded by NMD to alterations in transcript structure. Here is a partial list of situations where silence expression of the gene. 3. What validation rate should be a real splicing event may be missed (false I Only a fraction of annotated expected from the analysis? negative) by current splicing algorithms exons from a gene are expressed A validation rate of 80 percent would be along with brief discussions of each: Many genes have alternative tran- excellent, but rates down to 30 percent might I Alternative splicing outside of scriptional starts or alternative 3’ ends. be acceptable in discovery-based research. transcript clusters In some cases, these alternative starts Generally, the predicted splicing events that The MiDAS uses transcript cluster and stops may result in only a fraction consistently survive different types of filter- information to generate the gene-level of the well-annotated exons in a tran- ing are more likely to be true positives. The call, thus any probe set outside of a script cluster being expressed. Our fil- acceptable validation rate will be a balance transcript cluster is excluded from the tering approaches require that more between the researcher’s available time and

nn n n 14 GENE EXPRESSION MONITORING resources to the validation versus the desire to extend the limits of the search. As a first discovery tool, exon arrays will generate information not previously possible with other tra- ditional technologies. In some cases, the discovery of even 10 new splicing events in an exon study might be highly significant. As more data sets become available on exon arrays, it is anticipated that further algorithm development will become more sophisticated and the validation rate will improve. 4. Can MiDAS handle multiple sample groups? The ANOVA in MiDAS can compare multiple groups, e.g., brain, kidney and heart. It does not handle additional factors like gender, tumor stage, etc. More sophisticated ANOVA methods have been implemented in third-party packages.

REFERENCES

Technical Note, GeneChip® Exon Array Design White Paper: Guide to Probe Logarithmic Intensity Error (PLIER) Estimation White Paper: Exon Probe Set Annotations and Transcript Cluster Groupings v1.0 White Paper: Gene-Signal Estimates from Exon Arrays White Paper: Alternative Transcript Analysis Methods for Exon Arrays Technical Note, Statistical Algorithms Reference Guide Gardina, P. J., et al. Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics 7:325 (2006). Srinivasan K., et al. Detection and measurement of alternative splicing using splicing-sensitive microarrays. Methods 37(4):345-59 (2005).

A complete listing of GeneChip®-compatible™ software products for exon applications can be found at http://www.affymetrix.com/products/software/compatible/ exon_expression.affx.

nn nn 15 AFFYMETRIX, INC. AFFYMETRIX, UK Ltd. AFFYMETRIX JAPAN K.K. 3420 Central Expressway Voyager, Mercury Park, Mita NN Bldg., 16 F Santa Clara, CA 95051 USA Wycombe Lane, Wooburn Green, 4-1-23 Shiba, Minato-ku, Tel: 1-888-DNA-CHIP (1-888-362-2447) High Wycombe HP10 0HH Tokyo 108-0014 Japan Fax: 1-408-731-5441 United Kingdom Tel: +81-(0)3-5730-8200 [email protected] UK and Others Tel: +44 (0) 1628 Fax: +81-(0)3-5730-8201 [email protected] 552550 [email protected] France Tel: 0800919505 [email protected] Germany Tel: 01803001334 Fax: +44 (0) 1628 552585 [email protected] [email protected] www.affymetrix.com Please visit our web site for international distributor contact information. For research use only. Not for use in diagnostic procedures. Part No. 702422 Rev. 1 ©2006 Affymetrix, Inc. All rights reserved. Affymetrix®, ®, GeneChip®, HuSNP®, GenFlex®, Flying Objective™, CustomExpress®, CustomSeq® , NetAffx™, Tools To Take You As Far As Your Vision®, The Way Ahead™, Powered by Affymetrix™, GeneChip-compatible™, and Command Console™ are trademarks of Affymetrix, Inc. Products may be covered by one or more of the following patents and/or sold under license from Oxford Gene Technology: U.S. Patent Nos. 5,445,934; 5,700,637; 5,744,305; 5,945,334; 6,054,270; 6,140,044; 6,261,776; 6,291,183; 6,346,413; 6,399,365; 6,420,169; 6,551,817; 6,610,482; 6,733,977; and EP 619 321; 373 203 and other U.S. or foreign patents.