Bioinformatic analysis of the interface between mitochondrial biogenesis and apoptotic cell death signaling pathways in Parkinson’s disease.

Robert Bentham Supervised by Dr G. Szabadkai and Dr K. Bryson

March 3, 2012

Contents

1 Introduction 1

2 Microarray Analysis 2 2.1 Data acquisition ...... 3 2.2 Quality Control ...... 3 2.2.1 Normalisation ...... 5 2.3 LIMMA ...... 5 2.3.1 Results ...... 7 2.4 Set Analysis ...... 7 2.4.1 GSEA ...... 8 2.4.2 GAGE ...... 8 2.4.3 Results ...... 9

3 Conclusion 9

References 14

A Tables 16

B R code 17

1 Introduction

Mitochondria are subcellular organelles present in most eukaryotic cells. They have a complex evolutionary history, endosymbiotic theory saying that they evolved from free living bacteria which became incorporated within a cell. They have their own DNA (known as mtDNA) which is inherited from the mother only. Mitochondria primarily function being to provide ATP to the rest of the cell which is used as a source of energy means that they are essential for the healthy function of a cell Cell survival is dependent on the maintenance of a healthy cellular mitochondrial pool which is in turn dependent on two processes. The degradation of damaged mitochondria by autophagy and the process of mitochondrial renewal, mitochondrial biogenesis. This project will chiefly concern the latter of these processes, mitochondrial biogenesis. This biogenesis is simply the process of which new mitochondria are formed, however, the precise biological machinery controlling this process however is highly complex. Despite this complexity the PGC-1 family of transcriptional coactivators have been identifies as the master regulators of mitochondrial biogenesis[14].

1 Robert Bentham

Cancer, cardiovascular disease and neurodegenerative diseases such as Parkinson’s have all been associated with dysfunction of the mitochondria [14] [5]. In a recent review on the overlapping pathways involved in Parkinson’s and cancer,[3] the role of mitochondria in both is stressed. It has previously been shown that PGC-1α down regulation occurs in Parkinson’s disease [19], this could lead to the pathogenesis of Parkinson’s disease due to mitochondrial dysfunction, possibly meaning that PGC-1α is a potential therapeutic target. Additionally, in previous bioinformatic analysis of the role of PGC-1 in cancer[1], PGC-1 also was found to down regulate stress pathways involved in DNA damage. Interestingly DNA damage has also been suggested to be associated with Parkinson’s disease [12]. The aim of this work is to test the hypothesis that in clinical samples of Parkinson’s disease besides downregulation of mitochondrial pathways, there are alterations in pathways involving DNA damage. To do this previously published microarray data will be studied, and significantly expressed and gene pathways identified.

2 Microarray Analysis

A microarray is a device for measuring the expression levels of large numbers of genes. It does this via utilising the process of DNA hybridisation, which is illustrated in Figure 1. The expression level of each gene is detected by hybridisning with a number of oligonucleotide fragments on the chip acting as probes. For a single gene there are 11 perfect match (PM) probes and 11 mismatched probes (MM), in which the sequence differs by a single base. These MM probes are important for quality control, they measure the specificity of the hybridisation by giving an indication of any cross-hybridisation that has occurred. Thus the chip is covered with large number of probes of DNA both of type PM and MM. The target RNA from the experimental sample is manipulated and fluorescently labelled. So when hybridisation occurs with the probes, there is a measure of gene expression obtained from the intensity of the fluorescence at each spot on the microarray. For Affymetrix chips two microarrays from different experimental conditions, one being from a control sample can then be compared, and differences arising from the experimental condition inferred.

Figure 1: Image from Affymetrix illustrating the construction and workings of an affymetrix microarray.

There are numerous issues in the use of microarrays or any other high-throughput technique, firstly there is a huge amount of data that must be analysed in a statistically robust manner. To maintain this robustness quality control is a essential part of any analysis, these issues and others are discussed in [20],

2 Robert Bentham

unfortunately with microarrays different statistical techniques can lead to quite different results, so one must proceed with care. There are also many things beyond our control, there is technical variability in the actual experiment. This comes from differences in the temperature and pH values which affecting hybridisation on the microarray. Additionally each probe can not be optimised for hybridation equally, adding the stochastic nature of biological systems this leads to very noisy results with large systematic bias. Any statistical analysis must deal with these levels of noise and judge when to reject a microarray from the analysis if any systematic bias can no longer be tolerated.

2.1 Data acquisition For the aims of this report four datasets were identified for analysis involving microarrays from patients with Parkinson’s disease. The first dataset, which will be referred to as the Zheng dataset (available from GEO series accession number GSE24378 [9]) and is part of the meta study that identified PGC1-α as a potential target for parkinson’s disease [19]. This particular study is made of 17 samples with 8 replicates for parkinson’s disease and 9 replicates for the controls, the RNA used on the microarray is from 500 dopamine (DA) neurons from the pars compacta (SNc) of the substantia nigra. Another three data sets were furthermore selected for analysis, these included another dataset, which will be called the Middleton dataset (available through GEO Series accession number GSE20292[8]) which was also used in the meta study [19] [28]. Middleton has 18 control replicates and 11 replicates with parkinson’s disease. The next data set chosen, named Mullen (available through GEO Series accession number GSE7621[6]) has 16 replicates for Parkinson’s disease and 9 replicase for the controls [17]. The final data set, will be referred to as Moran (available through GEO Series accession number GSE8397[7] [21]). The Moran dataset, had microarrays from the Affymetrix U133A and U133B chip, of these only the U133A chip were used, as well as this microarrays taken from the substantia nigra with no distinction between the lateral and medial parts. After this the Moran dataset contained 24 replicates for Parkinson’s disease and 15 replicates for the controls. All of the data sets chosen were from microarrays using Affymetrix chips , Middleton and Moran used U133A chips while Mullen uses the more recent U133 plus 2.0, these two differ in the number of genes they detect the plus 2.0 having probes for an additional 6500 genes. In contrast to this the Zheng study uses the U133 X3P chip which uses probes designed to examine sequences closer to the 3’ end of transcripts, which is useful in cases of bad RNA degradation which happens from the 5’ end of transcripts.

2.2 Quality Control The purpose of quality control is to identify arrays which are not possible to correct and use in the analysis. Problems may include mistakes in the experimental procedure or a very high signal to noise ratio. For a comprehensive look at array quality a variety of measures should be examined, this can be quite time consuming, however it is possible to automate this process somewhat with R package arrayQualityMetrics [15]. A few of the main methods of quality control used will be discussed here, though there are many different techniques many of which are generated automatically in the arrayQualityMetrics package. The first thing to check for is array defects by looking at a spatial plot of intensitied, areas such as high intensity could indicate uneven hybridisation, while patterns in the spatial plot could indicate a particle being loose in the chip and scratching the surface while hybridisation occurs in a centrifuge. Figure ??a shows the spatial plots for all chips in the Middleton dataset, in this case and all other datasets examined there were no problems with either array defects or hybridisation effects. The next quantity to check for is RNA degradation or poor labeling. It is well known that RNA degra- dation starts from the 5’ end of a molecule and finishes at the 3’ end, a feature that the chip U133 plus 2.0 makes use of. For this reason if RNA degradation has occurred the mean intensity of the probes at the 3’ end should be much higher, this can easily be checked and plotted in R. Figure 2b shows an increase in the intensities of probes at the 3’ end in the Zheng dataset. Indeed all the other datasets showed similar results, this result could also be due to inefficient labeling as the labeling reaction used in preparing the

3 Robert Bentham

RNA degradation plot

C1 C2 C3 80 C4 PD1 PD2 PD3 C5

60 PD4 PD5 PD6 PD7 C6 C7 40 PD8 C8 C9 Mean Intensity : shifted and scaled 20 0

0 2 4 6 8 10 5' <−−−−−> 3' Probe Number

(a) Spatial plot showing probe intensities of microarrays (b) RNA degradation plot showing severe degradation for the Middleton dataset, all microarrays here are nor- for samples in the Zheng dataset despite the special mal. This plot was generated with the arrayQualityMet- U133 X3P chip here designed for cases with bad RNA rics package degradation.

(c) PM and MM log2 intensity graph for data in the Middleton study generated with the arrayQualityMetrics package.

Figure 2: Quality Control measures used in analysis

RNA to sample occurs from the 3’ end, however due to all samples being taken from postmortems it is very likely that the cause of this result is RNA degradation. As all samples have comparable degradation in each dataset this should not effect the analysis[2]. Addition measures of quality control include checking the density histogram of the PM and MM log2 intensities, MM probes measure the non-specific hybridisation or cross hybridisation that occurs, it is expected that the RNA should bind more strongly to the PM probes than the MM probes, if this is not the case than there will be a high signal to noise ratio in the results. This graph for the Middleton dataset is shown in Figure 2c, for all studies it was found the the RNA binded more strongly to the PM probes, though the graphs suggest that the levels of noise are possibly quite high.

4 Robert Bentham

2.2.1 Normalisation An abundance of variation exists between chips in microarray analysis, even between replicates from the same sample. Variations are caused by a combination of technical and biological reasons, technical such as the temperature and pH levels during hybridisation, and biological such as differences between two samples coming from patients with the same condition. Therefore for a fair comparison of all the chips being analysed, all chips need must be normalised with respect to each other. Checking for successful normalisation is the final step in quality control. The method of normalisation used in this report was Robust Multichip Average (RMA) which is fully described along with other possible alternatives methods in [11]. Checking successful normalisation can be measured by examining boxplots and MvA plots both pre and post normalisation. Figure 3 shows the effects of boxplots of the intensity values on each chip. MvA plots measure M, the log-2 fold change between intensity values of each probeset on different arrays, while A is the average log-2 intensity of each probeset on the arrays. On an MvA plot every probeset is plotted with M on the y axis and A on the x axis. Figure 4 shows MvA plots post normalisation, ideally and MvA plot is symmetrical in the x axis and resembles a comet shape [24]. Once all the data has been normalised the next task is to find the significantly expressed genes.

(a) Boxplot showing pre normalised data for the intensity (b) Boxplot showing post normalised data for the inten- of each chip in the Middleton study sity of each chip in the Middleton study

Figure 3: The effect of normalisation on the boxplot showing intensity for each chip, graphs generated with the arrayQualityMetrics package. Normalisation is needed so all chips can be compared fairly.

2.3 LIMMA LIMMA or linear models for microarray data [25] is a package in R designed for finding significant genes by estimating the log fold changes in expression level between different experimental conditions. The method LIMMA uses is fully explained in [24]. The first step is to calculate the log fold change, for which LIMMA assumes a linear model:

E[yj] = Dαj (1)

Here yj represents the expression data for the gene j, and E[yj] is a vector of the expression levels for gene j in each sample. D is the design matrix, which will be explained shortly, and αj is the vector of coefficients, containing the differences between the experimental conditions. The design matrix and vector of coefficients can be made so that the comparison of interest, here the log fold change between the control

5 Robert Bentham MvA plots between a sample of control replicates in the Moran study shown post normalisation, the ideal shape of a MvA plot for replicates post Figure 4: normalisation is symmetrical inthe the x x axis. axis The and replicates resembles a shown here comet have shape. all Here been the normalised LOESS fairly line well. is shown in red and if normalisation is done well should lie on

6 Robert Bentham

and Parkinson case, is built into the fitted model. To see this, suppose that there are 4 samples, 2 replicates for the control case and two for Parkinson’s disease. Then the design matrix and vector of coefficients can be written as:

 1 0     1 0  θ1 D =   , αj = (2)  1 1  θ2 1 1

Here θ2 could be written as xpd − xc, where xc and xpd are the log expression levels of a particular gene in the control and Parkinson’s samples respectively. Written like this θ2 gives the difference between the log expression level in the control and Parkinson case, in contrast θ1 gives the difference between the log expression level between the control and a reference. LIMMA estimates both coefficients but it is only the value of θ2 representing the log fold change that is of interest. LIMMA then uses an empirical Bayes’ method to adjust these coefficients, Empirical Bayes borrows information across genes and makes sure the analysis is stable which is especially needed for experiments with small numbers of arrays [24]. After this LIMMA automatically calculates the FDR adjusted p values, this is needed as multiple hypotheses are being tested for significance. For example if 1000 genes were tested for significance at a significance level of 0.01, statistically it is expected that 10 genes would be deemed to be significant even if really there are no significant genes. FDR or false discovery rate adjusts the p value, so this false discovery rate is controlled. This new adjusted p value is essentially the probability of a false discovery of a differentially expressed gene among those genes which have been classified as differentially expressed. If a particular gene has a FDR adjusted p value of 0.07 it means that an estimated 7% of the genes with lower adjusted p values are false positives.

2.3.1 Results Running LIMMA on the Zheng dataset identifies precisely zero significantly expressed genes after multiple hypothesis adjustment, this is surprising due to the difference that is expected between patients with Parkinson’s and patients without. One reason this could be so is that due to the experimental design of the Zheng dataset samples were taken from only 500 DA neurons, this is a very small sample size and it is not surprising therefore that little is found in the analysis[2]. Another telling sign is that this dataset is only part of a meta study, [19], and has no papers published just using its results, suggesting that by itself there are no significant findings. For these reasons the Zheng dataset was discarded from further analysis. The other three datasets did find significantly expressed genes. The Middleton study found 91 genes that were significantly expressed, Mullen 180 and Moran 3360. All genes were significantly expressed with multiple hypothesis adjusted p values of less than 0.05. Clearly the Moran study found a much greater number of significantly expressed genes, this could be due to Moran having the largest number of replicates of all the studies thereby being able to find more significant genes, in contrast Middleton had the smallest number of replicates and has the least number of significantly expressed genes.

2.4 Gene Set Analysis From using LIMMA, a list of significant genes for each study has been found. This tells us there are differences between the two cases of the controls and those with Parkinson’s disease, however to extract biological meaning from these lists presents difficulty. The truth is that in biology a gene does not act in isolation but in concourse with many others. An improved approach is to examine differences in sets of genes, genesets or gene pathways, that provide a common function or purpose. These gene pathways or gene sets are largely identified from major databases such as or KEGG. Finally significant gene pathways involves using a set of statistical techniques known as Gene Set Analysis (GSA), here two of these techniques will be described.

7 Robert Bentham

2.4.1 GSEA GSEA or gene set enrichment analysis is one of the standard methods for GSA. Originally developed by Subramanian et al. [26] in 2005. Since then it has found wide use in the bioinformatics community, and is certainly one of the most popular method of GSA. The original method involved using a ranked gene list such as the ones generated by a LIMMA analysis, and calculating what is referred to as an ‘Enrichment Score’ for each gene set. This enrichment score fully described in [26] and gives a score based on whether the genes in a gene set were towards the top or bottom of the ranked list. Using this enrichment score, significance is inferred by use of sample permutations to derive a distribution from which the p-values can be calculated. Many varieties of GSEA can be found in the literature, the calculation of the enrichment score has been seen as over complicated. Other approaches include using a two sample t test such as is found in [13] and [27]. Particularly, [13] introduces the Jiang and Gentleman statistic or the J-G statistic: X p τk = tg/ |Sk| (3)

g∈Sk

Here tg is the t statistic for a single gene expression g and |Sk| is the size of the gene pathway of interest. The J-G statistic is normalized by the length of the pathway, such that as |Sk| approaches infinity, the distribution of the J-K statistic approaches the unit normal. Methods of inferring the significance of gene pathways are then as in the original GSEA paper made using sample permutations. This method was slightly adapted in [22], where instead of the J-K statistic based on an aggregation of t statistics a statistic based on the aggregation of gene level regression residuals was used instead. This method assumes that there is a linear relationship between the mean response variable i.e. gene expression and the explanatory covariates such as the presence of Parkinson’s disease. If such a linear regression model holds the regression residuals can be calculated, in a similar way to the J-K statistic: X p Rki = rgi/ |Sk| (4)

g∈Sk Significance is again inferred using sample permutations. This last procedure calculating the significant pathways with regression residues can be implemented in the R package, GSEAlm [23].

2.4.2 GAGE GAGE or Generally Applicable Geneset Enrichment for pathway analysis is a method for gene set analysis, developed by Luo et al. [18], in which the authors claim to improve on previous GSA methods such as GSEA and PAGE[16], another popular methor for GSA . GAGE like PAGE determines the significance of gene sets based on a parametric analysis as opposed to a method based on permutation of sample labels as is used to calculate the significance in GSEA. Some claim that GSEA has low sensitivity, while the authors of GAGE claim that PAGE is overly sensitive. The procedure of GAGE is outlined in [18], and will be given in brief here. As with all GSA methods the aim is to give a ranking and assign the significance to gene set pathways. It does this by taking into account the mean fold changes of the target gene set by means of a two sample t test. PAGE in comparison uses a z-test. The two sample t test and its degrees of freedom are defined as follows: m − M t = (5) q s2 S2 n + n

(s2 + S2)2 df = (n − 1) (6) s4 + S4 Where m, s and n are the mean fold change, standard deviation and number of genes in the gene set respectively. M and S represent the average mean fold change and standard deviation of all the genes. The

8 Robert Bentham

t test essentially compares the gene set of interest with a gene set of identical size with mean fold change and standard deviation derived from the background. P values can then be obtained from the t test. However GAGE combines all the p values from different replicates into a global P value. GAGE has two modes to compare pairs of experiment-control samples 1-1 if the samples are paired or to compare the experimental samples to the average gene expression levels for the unpaired. Since all studies used in this report were unpaired the latter case will only be discussed. With k = 1, ..., K experimental samples and l = 1, ..., L control samples and L 6= K, the p values need to be combined in a way where each p value is independent. If the null hypothesis is true, the p values from the two sample t test will follow a Uniform(0,1) distribution. Additionally it is known that the negative log sum of K independent p-values follows a Gamma(K,1) distribution. Thus to calculate a global p-value is simple using the gamma distribution:

P (X > x) ∼ Gamma(K, 1) (7) The only issue therefore is constructing K independent p-values for unpaired data such as we use in this report, however this turns out to be fairly simple. For the first experimental sample P1 is calculated as the average of the one on one comparison with the experimental sample to all of the control samples, in this way K independent p-values are constructed from which the negative log sum follows the gamma distribution.

1 X x = − logP (8) L kl kl Running GAGE in it’s R package is extremely simple, and automatically ranks the gene sets and corrects the p-values for multiple testing issues. The results gained from the GAGE and GSEA analysis are discussed below.

2.4.3 Results Both GSEAlm and GAGE were used to analyse the data, both implemented in R. Out of this only the results for GAGE are presented in this report, due to problems with the GSEAlm analysis. GSEAlm predicted that there were no significant pathways with p values less than 0.05 for the Middleton study, since the LIMMA analysis showed earlier that there were significantly expressed genes in the Middleton study between the Parkinson cases and the control this seems surprising and biologically unrealistic. This could be explained by low sensitivity of GSEA which has been suggested in the literature [4]. Additionally there seems to be a problem with this version of GSEA: GSEAlm. The outputs has many different pathways with exactly the same p value, from online resources [10] this seems to be a common feature of the program. Due to this the GSEAlm output fails to give a definitive ranking of the significance of the gene pathways and fails to hit pathways which are biologically relevant and so was judged unsuitable for use in this report. The results from the GAGE analysis are given in Tables 1,2 and 3. Table 1 shows the consensus pathways between all three studies that have been significantly regulated up or down. Table 2 shows significant pathways relevant to DNA damage and stress present in each individual study, and Table 3 shows the significantly expressed genes in these pathways. Full implications of these results will be discussed in the conclusion.

3 Conclusion

Evidence from the bioinformatic results in this report suggest that the hypothesis given in the introduction is correct. The clearest demonstration of this is in Table 1 and 2. Table 1 shows many mitochondrial pathways down regulated as expected but also that DNA damage and stress related pathways are altered. Table 2 gives the significant pathways related to DNA damage and stress in each of the data sets analysed, this shows just how many DNA damage and stress related pathways were shown to be involved. Table

9 Robert Bentham peroxisome glycolysis Description synaptosome GDP binding fatty acid binding syntaxin-1 binding proton transport synaptic transmission phospholipase binding protein polymerization mitochondrial matrix regulation of exocytosis movement of substances response to amphetamine enzyme regulator activity during mitotic cell cycle adult locomotory behavior electron transport chain endocytic vesicle membrane synaptic vesicle endocytosis malate metabolic process protein homotetramerization glycogen biosynthetic process iron-sulfur cluster binding dopamine biosynthetic process mitochondrial inner membrane cellular carbohydrate metabolic process small conjugating protein ligase activity small GTPase mediated signal transduction hydrogen ion transmembrane transporter activity NADH dehydrogenase (ubiquinone) activity regulation of long-term neuronal synaptic plasticity ubiquitin-dependent protein catabolic process generation of precursor metabolites and energy anaphase-promoting complex-dependent proteasomal mitochondrial electron transport, NADH to ubiquinone proton-transporting ATP synthase complex, coupling factor F(o) positive regulation of ubiquitin-protein ligase activity during mitotic cell cycle 1(b) GOTerm GO:0006886 intracellular protein transport GO:0016820 hydrolase activity, acting on acid anhydrides, catalyzing transmembrane GO:0007269 neurotransmitter secretion GO:0006836GO:0051281 positive regulation of release of sequestered calcium ionGO:0030424 into cytosol neurotransmitter transport axon GO:0007612GO:0051246GO:0005883 regulation of protein metabolic process learning neurofilament GO:0006887GO:0007268 GO:0003924GO:0016192GO:0051437 GO:0030426GO:0042416 GO:0007626GO:0001975 GO:0048169 GO:0043524GO:0008344 GO:0000502GO:0043274 vesicle-mediated transport exocytosisGO:0008198 GTPase activity GO:0005777 GO:0051258 GO:0048854GO:0030666 growth cone locomotoryGO:0007264 behavior GO:0030672 negative regulation of neuron apoptosis GO:0009636GO:0005978 GO:0016829GO:0019717 proteasome complex ferrous iron binding brain morphogenesis GO:0000226GO:0005838 synapticGO:0051289 vesicle membrane GO:0017157 GO:0019003 GO:0017075 response to toxin GO:0005504 lyaseGO:0015078 activity GO:0006413GO:0048488 microtubule cytoskeleton organization GO:0044262 GO:0030234 proteasome regulatory particle GO:0019787 GO:0004298 translational initiation threonine-type endopeptidase activity GO:0031145 GO:0006108 GO:0070469 respiratory chain GO:0005743 GO:0051436 GO:0005759 negative regulation of ubiquitin-protein ligase activity GO:0006120 GO:0005747 GO:0022900 mitochondrial respiratory chain complex I GO:0006099 tricarboxylic acid cycle GO:0046961 proton-transporting ATPase activity, rotational mechanism GO:0015992 GO:0006096 GO:0008137 GO:0030170 GO:0046933 GO:0006091 hydrogenGO:0045263 ion transporting ATP synthase activity, rotationalGO:0006800 mechanism oxygen andpyridoxal reactive phosphate oxygen binding species metabolic process GO:0009055 GO:0051536 electron carrier activity GO:0006626 protein targeting to mitochondrion GO:0042776 mitochondrial ATP synthesis coupled proton transport GO:0051287 NAD or NADH binding related pathways α − was not shown to be 1 α − P GC 1 P GC cell cycle arrest inflammatory response cellular defense response transcription repressor activity positive regulation of transforming growth factor beta receptor signaling pathway was shown to be down regulated leading to defects in mitochondrial α − 1(a) 1 GOTerm Description GO:0042326 negative regulation of phosphorylation GO:0005540GO:0006954 GO:0007507GO:0030511 hyaluronic acid binding heart development GO:0004861 cyclin-dependent protein kinase inhibitor activity GO:0016564 GO:0007050 GO:0006968 Table 1: 1(a) and (b) show all the significant gene pathways found from the clearly are significant here. statistically significant in the LIMMA analysis but function, unlike [19], a large meta study, Gene Ontology database using GAGE. Tablewere 1(a) shows significantly all the up pathways which were regulated, significantly while down table regulated.help (b) of The Dr shows results Gyorgy all Szabadkai, haveand into the been stress, pathways which and pathways annotated are pathways that with related relatedrepresent to to the DNA those mitochondrial damage functions. associated Pathwaysare with in the blue DNAmitochondrial damage (PGC-1regulated and pathways dependent) , stress arepathways. strongly whiledown related As those regulated to can in contain DNA be manyred damageare seen pathways and PGC-1 associated the stress dependent. up with whileP mitochondrial GC the which These results confirm the conclusions in [19] where

10 Robert Bentham 05. . 0 < DNA repair cell cycle arrest DNA replication chromatin remodeling histone H3 acetylation caspase activator activity histone H4-K16 acetylation regulation of anti-apoptosis transcription factor complex transcription repressor activity transcription corepressor activity regulation of gene-specific transcription heterogeneous nuclear ribonucleoprotein complex cyclin-dependent protein kinase inhibitor activity negative regulation of transcription factor activity positive regulation of ubiquitin-protein ligase activity during mitotic cell cycle positive regulation of gene-specific transcription from RNA polymerase II promoter Significantly down regulated GO pathways related to stress/DNA damage Significantly up regulated GO pathways related to stress/DNA damage in the GOTerm Description GOTerm Description GO:0051437 GO:0016566 specific transcriptional repressor activity GO:0010553 negative regulation of gene-specific transcription from RNA polymerase II promoter GO:0000118GO:0000084GO:0016581GO:0012501 histone deacetylase complex S phase of mitotic cell cycle NuRD complex programmed cell death GO:0016564 GO:0045892GO:0032583 GO:0003704GO:0004861 GO:0043433 GO:0005667 GO:0035257 negativeGO:0043984 regulation of transcription,GO:0005694 specific DNA-dependent RNA polymeraseGO:0043966 II transcriptionGO:0042800 factor activity GO:0030530 GO:0006950GO:0007050 GO:0003714 GO:0008656 GO:0006260 nuclear histone hormone methyltransferase receptorGO:0045941 activity binding (H3-K4 specific) GO:0006338 GO:0016605GO:0010552 GO:0006281 GO:0044428 GO:0045767 response to stress positive regulation of transcription PML body nuclear part (b) in the Middleton study (d) Mullen study DNA repair cell cycle arrest nucleic acid binding induction of apoptosis histone H3 acetylation histone deacetylase complex transcription activator activity transcription repressor activity negative regulation of apoptosis transcription corepressor activity histone acetyltransferase complex resulting in induction of apoptosis positive regulation of cell proliferation negative regulation of cell proliferation NAD+ ADP-ribosyltransferase activity negative regulation of gene-specific transcription negative regulation of gene-specific transcription heterogeneous nuclear ribonucleoprotein complex cyclin-dependent protein kinase inhibitor activity positive regulation of transcription, DNA-dependent DNA damage response, signal transduction resulting in DNA damage response, signal transduction by p53 class mediator resulting in transcription of p21 class mediator regulation of transcription from RNA polymerase II promoter Table 2: Gene Pathways related to stress/DNA damage that have significantly been up or down regulated with p values negative regulation of transcription from RNA polymerase II promoter Significantly up regulated GO pathways related to stress/DNA dam- Significantly up regulated GO pathways related to stress/DNA dam- GOTerm Description GOTerm Description GO:0007050 cell cycle arrest GO:0048384 retinoic acid receptor signaling pathway GO:0004861GO:0006968 cyclin-dependent protein kinase inhibitor activity GO:0016564 cellular defense response GO:0045941GO:0003705 RNA transcription polymerase repressor II activity transcription factor activity, enhancer binding positiveGO:0042771 regulation of transcription DNA damage response, signal transduction by p53 class mediator GO:0016563GO:0005667GO:0003727 transcription activator activity GO:0006309 transcription factor complex single-stranded RNA binding DNA fragmentation involved in apoptosis GO:0016564 GO:0000122GO:0007050 negativeGO:0000080 regulation of transcriptionGO:0016563 from RNA polymerase II promoter GO:0032582 G1 phase ofGO:0000122 mitotic cell cycle GO:0004861 GO:0043065GO:0030530 GO:0000123 GO:0032582 GO:0000118 GO:0000060GO:0008285 positiveGO:0003676 regulation of apoptosis GO:0006968GO:0043966 GO:0008656 protein importGO:0006357 into nucleus, translocation GO:0043066 GO:0003714 GO:0006950GO:0008630 cellular defenseGO:0045893 response GO:0006978 caspase activator activity GO:0003950 GO:0008284 GO:0006281 response to stress GO:0003690 double-stranded DNA binding age in the Middleton study (a) (c) age in the Moran study

11 Robert Bentham

Gene Name Moran Middleton Mullen DNAJB6 ! % ! HSPA1L !%! Gene Name logFC Adjusted P value PHF21A ! % ! GAS1 1.094643 2.055961E-02 INSR !%! BANF1 0.541628 2.219783E-02 HNRNPH3 ! % ! DNAJB6 1.241308 2.219783E-02 !%! MYST3 0.484861 2.349617E-02 CXXC1 ! % ! HSPA1L 0.766496 2.784302E-02 CUL2 PHF21A 0.503459 2.784302E-02 KAT2A !%! INSR 0.759715 2.934864E-02 HBP1 ! % ! HNRNPH3 0.552047 2.934864E-02 PHF15 !%! CXXC1 0.430215 2.934864E-02 MBD3 ! ! % CUL2 -0.672837 2.943436E-02 IKBKB !!% TRIM28 0.372411 3.553511E-02 MAP3K11 % ! % KAT2A 0.567463 3.926393E-02 TCIRG1 %!% HBP1 0.491259 4.561574E-02 FOXO1 % ! % HNRNPH3 0.548458 4.884567E-02 %%! PHF15 0.873731 4.901984E-02 GAS1 BANF1 % % ! (a) Genes in the Mullen study with P values < 0.05 in gene pathways related to DNA damage and stress MYST3 %%! TRIM28 % % ! (b) Genes which are significant (P value < 0.05) in mul- tiple studies. Most significant genes in the Moran study omitted here and fully given in Appendix A

Gene Name logFC Adjusted P value MBD3 0.588015 2.692973E-02 MAP3K11 0.656159 4.069257E-02 IKBKB 0.211414 4.160097E-02 TCIRG1 0.509469 4.639875E-02 FOXO1 0.528559 4.903848E-02 (c) Genes in the Middleton study with P values < 0.05 in gene pathways related to DNA damage and stress

Table 3: Tables showing significant genes in pathways related to DNA damage and stress. a) and c) show significant genes in the Mullen and Middleton study. While b) shows which genes are significant in multiple studies. 381 genes related to DNA damage and stress pathways were significant in the Moran study and are fully listed in Appendix A.

12 Robert Bentham

3 then shows significant genes involved in these DNA damage related pathways, and which genes were significant in more than one of the datasets. These significant genes could be useful in finding a new therapeutic target for Parkinson’s disease. A larger study or a meta study with more microarray data would give a clearer picture of the genes and pathways that have been up or down regulated in comparison to the fairly noisy one presented in this study. However despite the relatively small sample sizes and accompanying noise, the overall trend of down regulated mitochondria pathways and altered DNA damage and stress pathways is clear. More research in the interface of these two areas and the role of PCG-1 in Parkinson’s disease would heighten our understanding and develop new approaches for the treatment of Parkinson’s disease.

13 Robert Bentham

References

[1] T.E. Bartlett. Bioinformatic analysis of the interface between mitochondrial biogenesis and apoptotic cell death signalling pathways in cancer. Mres Summer Project, 2011.

[2] Kevin Bryson. private communication, 2012.

[3] M.J. Devine, H. Plun-Favreau, and N.W. Wood. Parkinson’s disease and cancer: two wars, one front. Nature Reviews Cancer, 11(11):812–823, 2011.

[4] I. Dinu, J. Potter, T. Mueller, Q. Liu, A. Adewale, G. Jhangri, G. Einecke, K. Famulski, P. Halloran, and Y. Yasui. Improving gene set analysis of microarray data by sam-gs. BMC bioinformatics, 8(1):242, 2007.

[5] M.R. Duchen and G. Szabadkai. Roles of mitochondria in disease. Essays Biochem, 47:115–137, 2010.

[6] GEO. http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE7621, 2007. Ac- cessed: 28/02/2012.

[7] GEO. http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE8397, 2008. Ac- cessed: 28/02/2012.

[8] GEO. http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE20292, 2010. Ac- cessed: 28/02/2012.

[9] GEO. http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE24378, 2011. Ac- cessed: 28/02/2012.

[10] Daniel Gusenleitne. Gene set enrichment analysis (gsealm)tutorial. http://bcb.dfci.harvard.edu/ ~aedin/courses/cccb-introduction-to-r-and-bioconductor-may-2011/tutorial.pdf. Ac- cessed: 28/02/2012.

[11] R.A. Irizarry, B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf, and T.P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2):249, 2003.

[12] D.K. Jeppesen, V.A. Bohr, and T. Stevnsner. Dna repair deficiency in neurodegeneration. Progress in Neurobiology, 2011.

[13] Z. Jiang and R. Gentleman. Extensions to gene set enrichment. Bioinformatics, 23(3):306, 2007.

[14] A.W.E. Jones, Z. Yao, J.M. Vicencio, A. Karkucinska-Wieckowska, and G. Szabadkai. Pgc-1 family coactivators and cell fate: Roles in cancer, neurodegeneration, cardiovascular disease and retrograde mitochondria-nucleus signalling. Mitochondrion, 2011.

[15] Audrey Kauffmann, Robert Gentleman, and Wolfgang Huber. arrayqualitymetrics–a bioconductor package for quality assessment of microarray data. Bioinformatics, 25(3):415–6, 2009.

[16] S.Y. Kim and D. Volsky. Page: parametric analysis of gene set enrichment. BMC bioinformatics, 6(1):144, 2005.

[17] T.G. Lesnick, S. Papapetropoulos, D.C. Mash, J. Ffrench-Mullen, L. Shehadeh, M. De Andrade, J.R. Henley, W.A. Rocca, J.E. Ahlskog, and D.M. Maraganore. A genomic pathway approach to a complex disease: axon guidance and parkinson disease. PLoS genetics, 3(6):98, 2007.

14 Robert Bentham

[18] W. Luo, M. Friedman, K. Shedden, K. Hankenson, and P. Woolf. Gage: generally applicable gene set enrichment for pathway analysis. BMC bioinformatics, 10(1):161, 2009.

[19] J.K. McGill and M.F. Beal. Pgc-1 α], a new therapeutic target in huntington’s disease? Cell, 127(3):465–468, 2006.

[20] M. Miron and R. Nadon. Inferential literacy for experimental high-throughput biology. Trends in Genetics, 22(2):84–89, 2006.

[21] LB Moran, DC Duke, M. Deprez, DT Dexter, R.K.B. Pearce, and MB Graeber. Whole genome expression profiling of the medial and lateral substantia nigra in parkinson’s disease. Neurogenetics, 7(1):1–11, 2006.

[22] A.P. Oron, Z. Jiang, and R. Gentleman. Gene set enrichment analysis using linear models and diagnostics. Bioinformatics, 24(22):2586–2591, 2008.

[23] Assaf Oron, Robert Gentleman (with contributions from S. Falcon, and Z. Jiang). GSEAlm: Linear Model Toolset for Gene Set Enrichment Analysis. R package version 1.8.0.

[24] G. Smyth. Limma: linear models for microarray data. Bioinformatics and computational biology solutions using R and Bioconductor, pages 397–420, 2005.

[25] Gordon K. Smyth. Limma: linear models for microarray data. In R. Gentleman, V. Carey, S. Dudoit, and W. Huber R. Irizarry, editors, Bioinformatics and Computational Biology Solutions using R and Bioconductor, pages 397–420. Springer, New York, 2005.

[26] A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545, 2005.

[27] L. Tian, S.A. Greenberg, S.W. Kong, J. Altschuler, I.S. Kohane, and P.J. Park. Discovering statis- tically significant pathways in expression profiling studies. Proceedings of the National Academy of Sciences of the United States of America, 102(38):13544, 2005.

[28] Y. Zhang, M. James, F.A. Middleton, and R.L. Davis. Transcriptional analysis of multiple brain re- gions in parkinson’s disease supports the involvement of specific protein processing, energy metabolism, and signaling pathways, and suggests novel disease mechanisms. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 137(1):5–16, 2005.

15 Robert Bentham

Appendices

A Tables

Gene name logFC Adjusted P value Gene name logFC Adjusted P value Gene name logFC Adjusted P value CUX2 -0.957411 3.110158E-08 FGFR1 0.220495 3.832065E-03 MYST1 0.147059 1.765914E-02 NR4A2 -1.206677 2.511338E-07 ORC2L -0.238690 3.850204E-03 KRT18 -0.293551 1.768079E-02 YTHDC2 -0.956063 5.114295E-07 SIRT2 0.337159 3.852516E-03 ERI3 -0.357999 1.789257E-02 PSEN2 -0.597398 6.201595E-07 DRD2 -0.323548 3.898446E-03 PFDN5 0.257103 1.796585E-02 RBM9 -0.691800 9.260810E-07 MAFF 0.312910 3.917795E-03 DLG3 -0.201248 1.810632E-02 ICMT -0.454469 1.077351E-06 VEZF1 0.327521 4.006702E-03 APEX1 -0.229858 1.863233E-02 DRD2 -0.833074 1.736227E-06 C1D -0.376084 4.010220E-03 WARS -0.465903 1.888945E-02 SUB1 -1.168283 2.409076E-06 SMARCA4 -0.421016 4.029591E-03 MAB21L1 -0.242650 1.902206E-02 NR4A2 -1.305759 2.505296E-06 NDRG4 -0.964717 4.030569E-03 CUL3 -0.280591 1.915695E-02 PBX1 -1.030599 3.204438E-06 KDM5A 0.359093 4.083478E-03 CDKN1C 0.580427 1.927160E-02 ATP8A2 -0.757973 9.099608E-06 TPD52L1 0.553849 4.128617E-03 DUSP10 0.227539 1.970679E-02 MED24 -0.337476 1.521774E-05 RPS9 0.213239 4.145544E-03 RPS9 0.331523 1.970843E-02 FABP7 -0.948737 1.784250E-05 DEAF1 -0.456666 4.187411E-03 PPM1D 0.292983 1.991855E-02 PAN2 0.587286 1.862470E-05 TFEB 0.377877 4.333228E-03 SMARCC1 0.311631 1.996298E-02 BASP1 -1.101985 2.422178E-05 ZC3H7B 0.323857 4.454371E-03 ATRX -0.252841 2.003025E-02 RNF14 -0.762596 2.657807E-05 NAB1 -0.390073 4.481002E-03 CAPN10 0.171640 2.032344E-02 MXI1 0.426898 2.777041E-05 RPS4X 0.301142 4.481002E-03 IKBKB 0.233266 2.032344E-02 OBFC1 -0.345506 3.074203E-05 INSR 0.571061 4.494150E-03 HSPA1L 0.337327 2.060335E-02 NR4A2 -0.858372 3.837342E-05 MED7 -0.229032 4.634760E-03 RNF14 -0.613698 2.060335E-02 DRD2 -0.528345 4.053527E-05 RASSF1 0.217266 4.686959E-03 FOXO4 0.261129 2.123364E-02 ZBTB16 0.804387 4.154816E-05 SAP30 0.370421 4.750235E-03 CHD3 -0.231212 2.129811E-02 CASC3 0.459059 4.227519E-05 CIAO1 -0.233149 4.828717E-03 CIZ1 0.198300 2.132111E-02 HLTF -0.610986 4.407303E-05 CDKN1C 0.630543 4.884169E-03 RAF1 0.297875 2.213752E-02 RNF10 -0.363115 4.492968E-05 LRCH4 0.240531 4.909274E-03 MBD3 0.232923 2.216529E-02 HBP1 0.430443 6.173774E-05 PHF17 0.288786 5.008478E-03 CDKN2C 0.191727 2.217483E-02 TOB2 0.625310 7.242554E-05 RYBP 0.437083 5.008588E-03 DNAJA2 -0.565387 2.217483E-02 ODZ1 -0.774075 8.018387E-05 AGGF1 -0.357470 5.177286E-03 PTBP1 0.312594 2.278352E-02 FOXA1 -0.493710 1.082717E-04 HTATIP2 0.260162 5.269930E-03 WARS -0.405446 2.320651E-02 SIN3B -0.362442 1.132238E-04 YWHAB -0.278063 5.304984E-03 TFDP2 -0.173172 2.323106E-02 SORT1 0.567520 1.433114E-04 RAN -0.494002 5.435535E-03 ZCCHC14 0.311412 2.339757E-02 FABP7 -0.856448 1.466798E-04 ABCA2 0.597212 5.484714E-03 CREBBP 0.273296 2.360719E-02 RBM9 -0.790446 1.526903E-04 RBM9 -0.576966 5.580142E-03 ZNF274 0.205700 2.373376E-02 DNAJB6 1.030789 1.622180E-04 YWHAB -0.526150 5.645709E-03 PRKAR1A -0.349559 2.384159E-02 AZGP1 1.429307 1.780275E-04 SMARCA4 -0.309234 5.717543E-03 MLH1 -0.151832 2.420398E-02 R3HDM1 -0.407426 1.873502E-04 RBMS1 -0.331339 5.727718E-03 CUL2 -0.244754 2.434984E-02 PKNOX2 -0.323459 1.917057E-04 TARDBP 0.366918 5.785979E-03 DLG3 -0.186573 2.493001E-02 DNAJB2 0.566402 1.917057E-04 MAPK1 -0.465367 5.900381E-03 RTEL1 0.097418 2.494372E-02 MAPK9 -0.697986 1.950452E-04 C11orf9 0.530671 5.954850E-03 FOXA2 -0.248195 2.496548E-02 RXRA 0.486204 2.353479E-04 TPD52L1 0.708256 6.003614E-03 NBN -0.389994 2.509963E-02 SCFD1 -0.435021 2.383114E-04 ZC3H11A 0.286946 6.123835E-03 FOXL2 0.129132 2.518111E-02 PSEN2 -0.360370 2.443693E-04 SMARCC1 0.298454 6.163572E-03 DBP -0.215670 2.519681E-02 TRAK1 -0.334989 2.905580E-04 KDM4B 0.215654 6.163572E-03 ARNTL 0.187634 2.555248E-02 LRCH4 0.280912 3.346337E-04 PHB 0.351591 6.179972E-03 CLEC11A 0.211593 2.560748E-02 ATR -0.487612 3.365620E-04 FTH1 0.333135 6.211837E-03 CDKN1C 0.592224 2.578780E-02 PBX1 -0.977497 3.587020E-04 FOXO3 0.414799 6.499113E-03 ENPP2 0.411463 2.582428E-02 SCG2 -1.678182 3.704575E-04 CAND1 -0.421703 6.700786E-03 ASH2L -0.208881 2.603854E-02 SIN3B -0.458179 3.775224E-04 RBMS1 -0.379899 6.708206E-03 EIF1 0.184540 2.616135E-02 TXNIP 0.763070 3.835557E-04 EIF1 0.227191 6.810593E-03 OLIG2 0.361263 2.620637E-02 TCF12 0.521666 3.865018E-04 TIPARP 0.423771 6.953670E-03 CTBP2 0.186639 2.696214E-02 TCF25 -0.371687 4.053739E-04 ADARB2 0.367469 7.017789E-03 TBL1X 0.182123 2.738215E-02 MYT1L -1.093676 4.053739E-04 ING1 0.186267 7.034542E-03 PEX14 -0.219804 2.757462E-02 DRD2 -0.351110 4.269094E-04 AHSA1 0.583125 7.146545E-03 CUL4B -0.190014 2.793711E-02 PPP2R5C -0.556722 4.727540E-04 PTPRU -0.249405 7.167527E-03 DDX23 0.216878 2.814519E-02 MTMR15 0.426398 4.882360E-04 CNOT8 0.349014 7.291980E-03 GAS7 0.281240 2.832397E-02 CUL2 -0.204652 5.021639E-04 SOX10 0.519145 7.291980E-03 EXOG -0.243930 2.834072E-02 PIAS2 -0.439832 5.083587E-04 KPNB1 -0.304317 7.508054E-03 NFE2 0.171944 2.899066E-02 SMARCA4 -0.523213 5.326174E-04 HUS1 -0.151773 7.508054E-03 ZNF143 0.249454 2.919955E-02 HTR2A -0.603349 5.429389E-04 TRAK1 -0.360190 7.714198E-03 SERTAD2 0.285793 2.950488E-02 HNRNPA0 -0.394298 5.512777E-04 C16orf5 0.480987 7.784919E-03 CAPNS1 -0.265057 2.958784E-02 NFKBIA 0.848332 5.735661E-04 SMARCD3 -0.376585 7.784919E-03 HRAS -0.315547 3.000099E-02 AZGP1 0.876736 5.963782E-04 HR -0.251331 7.966696E-03 ZNF862 0.168448 3.027087E-02 USP21 0.384855 6.023984E-04 FADS1 -0.339560 7.966696E-03 PRKRIR -0.197368 3.055533E-02 ATF4 0.493282 6.129153E-04 SERP1 0.493409 8.013586E-03 PINK1 -0.331692 3.103766E-02 UCHL1 -1.130250 6.129153E-04 MTDH -0.430631 8.035706E-03 CAT 0.316974 3.121511E-02 ATR -0.197517 6.612463E-04 CDC7 -0.229709 8.044592E-03 SQSTM1 0.298721 3.255658E-02 PHF21A 0.360997 6.661077E-04 BCL6 0.684969 8.434040E-03 ELF2 0.232207 3.255658E-02 SGK1 0.808574 7.873610E-04 CTBP2 0.301135 8.446550E-03 CALCOCO1 0.235879 3.285726E-02 P2RX7 0.851832 7.873610E-04 TCEB1 -0.200684 8.638273E-03 NCOA6 0.167214 3.374271E-02 FOXA2 -0.297080 7.874555E-04 CA2 0.830987 8.686105E-03 HTATIP2 0.370030 3.377235E-02 SIAH2 0.263142 7.961976E-04 YBX1 0.380355 8.805353E-03 PARP4 0.253744 3.379494E-02 CXXC1 0.304683 8.022894E-04 EIF1 0.202097 9.381447E-03 PRKCZ -0.380026 3.393437E-02 TOB2 0.803282 8.116605E-04 RPS4X 0.257787 9.594234E-03 SP1 0.133998 3.402302E-02 MXD4 0.330607 8.430294E-04 BECN1 -0.358866 9.722317E-03 IGF1R 0.663056 3.447343E-02 PRKDC -0.419466 8.506714E-04 MKRN1 -0.419459 9.889990E-03 SERTAD2 0.319964 3.451420E-02 MUS81 0.231609 8.752821E-04 CTCF 0.281895 9.943180E-03 ADRA2A -0.233688 3.482569E-02 PHF15 0.480905 9.156354E-04 HNRNPD -0.251050 9.943782E-03 SF3A3 -0.182652 3.499074E-02 KIFAP3 -0.812213 1.086406E-03 NME1 -0.563889 9.944915E-03 PTPRF -0.222122 3.531795E-02 RAD17 -0.225555 1.125597E-03 SIRT4 0.223675 1.003110E-02 PIAS2 -0.083276 3.626421E-02 KRAS -0.511846 1.139155E-03 CCK -0.777129 1.067137E-02 SSR1 -0.285912 3.632779E-02 MKNK2 1.029514 1.139155E-03 TMEM161A 0.169364 1.071402E-02 HTR2A -0.210176 3.670744E-02 PNKP -0.365547 1.233909E-03 CAT 0.447409 1.085852E-02 SMARCD2 0.190971 3.749642E-02 GADD45G 0.441461 1.254979E-03 SAP30 0.387418 1.102105E-02 USP22 -0.230039 3.773870E-02 CAND2 -0.326311 1.278468E-03 ZC3H15 -0.401399 1.114595E-02 LUC7L3 0.364510 3.776656E-02 NDN -0.469761 1.292416E-03 TXNIP 0.783723 1.122945E-02 SMARCA4 -0.351617 3.785179E-02 ZCCHC24 0.446347 1.357363E-03 APC -0.443112 1.123172E-02 LITAF 0.273056 3.849488E-02 TRMT11 -0.427785 1.388872E-03 BRPF1 0.233774 1.150433E-02 RPS4X 0.169256 3.864644E-02 PTN -0.449501 1.388872E-03 FTSJ2 -0.196468 1.150433E-02 IL6ST 0.300552 3.947735E-02

16 Robert Bentham

GAS7 0.422373 1.401236E-03 RPS6 0.245071 1.218943E-02 FTSJ1 -0.269207 3.952213E-02 CXXC1 0.223148 1.463128E-03 NFKB1 0.235819 1.218943E-02 NOTCH1 0.235740 3.976578E-02 KDM1A -0.226117 1.477855E-03 HNRNPH3 0.286614 1.247986E-02 NFX1 0.136595 3.977656E-02 BTG1 0.659258 1.477855E-03 REXO4 0.230182 1.258635E-02 APBB2 0.185202 4.005845E-02 TXNIP 0.769755 1.512227E-03 PRPF19 -0.214907 1.265220E-02 DBC1 -0.355175 4.005845E-02 PBX1 -0.403184 1.539236E-03 PTBP1 0.456295 1.279337E-02 PIAS4 0.265963 4.073692E-02 NOP2 0.304194 1.574186E-03 HNRNPH3 0.415568 1.296613E-02 TRAP1 -0.162923 4.076317E-02 YWHAB -0.496887 1.577064E-03 HNRNPH3 0.401294 1.296613E-02 STK3 0.256124 4.077380E-02 TCEA2 -0.538326 1.586035E-03 BCL2 0.419075 1.307179E-02 RB1CC1 -0.238072 4.137764E-02 PHLDA3 0.233626 1.586035E-03 CTBP2 0.420036 1.307265E-02 KDM4B 0.141568 4.174237E-02 ZNF24 0.349494 1.598659E-03 DFFB 0.228604 1.325943E-02 PMS2L3 -0.191117 4.193934E-02 MXD4 0.287742 1.619508E-03 SUB1 -0.344900 1.340769E-02 HINFP 0.219310 4.205649E-02 STS -0.526712 1.923440E-03 PTBP1 0.408555 1.360090E-02 PTBP1 0.291009 4.218885E-02 HNRNPF 0.435092 1.940266E-03 ZHX2 0.249500 1.393329E-02 KHDRBS1 -0.275841 4.314293E-02 TASP1 -0.205034 2.116698E-03 PTN -0.406477 1.425780E-02 BARD1 0.226402 4.330154E-02 SESN1 0.317038 2.167532E-03 HDAC3 -0.152992 1.427661E-02 MXD4 0.384053 4.344766E-02 TGIF1 0.413216 2.291665E-03 RPS6 0.179358 1.459720E-02 KPNB1 -0.194875 4.373219E-02 ATP8A2 -0.947873 2.312591E-03 TBP -0.133868 1.462089E-02 CAND1 -0.319295 4.409346E-02 CHD3 -0.319365 2.413011E-03 NARS -0.231926 1.463586E-02 SETMAR 0.202307 4.425921E-02 HDAC1 0.443182 2.416774E-03 NFE2L2 0.447991 1.470688E-02 ATF3 0.289605 4.451829E-02 EIF1 0.305392 2.455681E-03 MAFF 1.314520 1.482125E-02 CDC123 -0.195283 4.469899E-02 HNRNPH3 0.443930 2.456575E-03 CHD1L 0.256668 1.512965E-02 TAF9 -0.341545 4.469899E-02 SIRT3 -0.315936 2.462529E-03 ASNS -0.651617 1.520338E-02 MAPK9 -0.295685 4.526086E-02 SMARCA4 -0.366475 2.571390E-03 DDX39 0.380958 1.523832E-02 ST18 0.501593 4.539651E-02 CAND1 -0.295231 2.644765E-03 NARS2 -0.262346 1.547714E-02 NPAT -0.197968 4.600492E-02 HTRA2 -0.269759 2.680301E-03 CIZ1 0.215167 1.547714E-02 RELA 0.228300 4.623955E-02 TGFB3 0.443781 2.806006E-03 SMARCC1 0.260351 1.553457E-02 CDKN1C 0.583804 4.636397E-02 SATB1 -0.472312 2.811070E-03 LRCH4 0.224864 1.555004E-02 BRD1 0.138380 4.648539E-02 ARID5B 0.381866 2.857923E-03 TMEM204 0.309821 1.561598E-02 HDAC4 0.204620 4.663850E-02 RING1 0.174505 2.864779E-03 L3MBTL 0.210720 1.569126E-02 CCND1 -0.441077 4.689704E-02 KAT2A 0.293493 2.902642E-03 EDNRB -0.924036 1.570343E-02 CIZ1 0.169355 4.735165E-02 CDKN2C 0.433614 3.136708E-03 MEIS2 0.374867 1.593708E-02 BRD7 0.260610 4.755995E-02 BNIP3L 0.279732 3.194006E-03 SRCAP 0.143141 1.630483E-02 PHF16 0.205263 4.785161E-02 MCTS1 -0.452188 3.226521E-03 NCOR1 0.215121 1.654489E-02 KCNMA1 0.267207 4.785472E-02 GAS7 0.300674 3.336826E-03 PBXIP1 0.410710 1.674284E-02 PFDN5 0.190537 4.824541E-02 ZNF282 0.206052 3.484791E-03 AARSD1 -0.230421 1.677547E-02 HNRNPD -0.328193 4.874030E-02 SMAD2 -0.327491 3.649291E-03 YTHDC2 -0.209308 1.692677E-02 RNASE4 0.209620 4.923652E-02 ZNF423 -0.464177 3.728742E-03 YBX1 0.451395 1.694194E-02 ZEB1 -0.306020 4.923652E-02 RBMS1 -0.340321 3.728742E-03 ZFP161 0.208603 1.714257E-02 PATZ1 0.186908 4.941602E-02 APC -0.279449 3.789623E-03 BCL2L13 -0.314110 1.744508E-02 CDKN1C 0.542014 4.942412E-02 Genes in the Moran study with P values < 0.05 in gene pathways related to DNA damage and stress

B R code

Below is the R code for this report showing the major steps, as used for the Middleton dataset.

Quality Control

1 #Quality control for Middleton Study, step1 make sure CEL files are in wd and load them intoR 2 3 library("affy") 4 library("arrayQualityMetrics") 5 library("limma") 6 7 Middleton<-ReadAffy(); 8 9 #Define pheno_data for AffyBatch to includePD/C info 10 Middleton_Status<-c("C",rep("PD",7),"C","PD",rep("C",7),"PD","C","C","PD","PD",rep("C",7)); 11 Middleton_pheno_data<-new("AnnotatedDataFrame",data=data.frame(sample=c(1:17),Status=Middleton_Status)); 12 sampleNames(Middleton_pheno_data)<-list.celfiles(); 13 phenoData(Middleton)<-parkinson_pheno_data; 14 15 #Calculate and plot RNA degradation graph 16 Middleton_degrade<-AffyRNAdeg(Middleton,log.it=TRUE); 17 plotAffyRNAdeg(Middleton_degrade,transform="shift.scale"); 18 19 #Plot MvA plot PreNorm 20 Middleton_controls<-which(Middleton_Status=="C"); 21 Middleton_park<-which(Middleton_Status=="PD"); 22 mva.pairs(exprs(Middleton[,Middleton_controls[1:9]]),log.it=TRUE,plot.method="smoothScatter"); 23 mva.pairs(exprs(Middleton[,Middleton_controls[10:18]]),log.it=TRUE,plot.method="smoothScatter"); 24 mva.pairs(exprs(Middleton[,Middleton_park]),log.it=TRUE,plot.method="smoothScatter"); 25 26 #Note for full quality analysis use arrayQualityMetrics package: 27 # arrayQualityMetrics(expressionset= Middleton, outdir="Middleton_QAraw", force= FALSE, do.logtransform= TRUE, intgroup= fac)

Normalisation, Quality Control and LIMMA Analysis

1 # Normalisation and postNorm quality control and LIMMA analysis 2 3 4 #Normalise using RMA 5 Middleton_normed<-rma(Middleton);

17 Robert Bentham

6 7 #Do post normalisation Quality control 8 mva.pairs(exprs(Middleton_normed[,Middleton_controls[1:9]]),log.it=TRUE,plot.method="smoothScatter"); 9 mva.pairs(exprs(Middleton_normed[,Middleton_controls[10:18]]),log.it=TRUE,plot.method="smoothScatter"); 10 mva.pairs(exprs(Middleton_normed[,Middleton_park]),log.it=TRUE,plot.method="smoothScatter"); 11 12 #Note for full quality analysis use arrayQualityMetrics package: 13 # arrayQualityMetrics(expressionset= Middleton_normed, outdir="Middleton_QAnorm", force= FALSE, do.logtransform= TRUE, intgroup= fac) 14 15 #Continue with LIMMA analysis- Create design matrix 16 Middleton_design<-model.matrix(~Middleton_normed$Status); 17 colnames(Middleton_design)<-c("C","PDvsC") 18 19 #Run lmFit and eBayes 20 Middleton_fit<-lmFit(Middleton_normed,Middleton_design); 21 Middleton_fit<-eBayes(Middleton_fit); 22 23 # Do multiple hypothesis adjustment 24 Middleton_top=topTable(Middleton_fit,coef="PDvsC",adjust="BH",number=nrow(Middleton_normed)); 25 Middleton_results<-decideTests(Middleton_fit,adjust.method="fdr",p.value=0.05); 26 27 28 #Write significant genes to file 29 Middleton_sig<-rownames(Middleton_results)[which(as.integer(Middleton_results[,2])!=0)]; 30 write(Middleton_sig,file="Middleton_sig_genes.txt");

Gene Set analysis GSEAlm

1 #GSEAlm method for Gene Set Analysis for pathways from Gene Ontology(GO) database 2 3 library(genefilter) 4 library(hgu133a.db)#Check with annotation() if this is right, note for Mullen need hgu133aplus2.db 5 #library(KEGG.db) 6 library(GO.db) 7 library(GSEAbase) 8 library(GSEAlm) 9 10 #Get GeneSetCollection fromGO with all pathways 11 Middleton_gsc<-GeneSetCollection(Middleton_normed,setType=GOCollection()); 12 13 #Create Incidence matrix from the GeneSetColletion describing all pathways 14 Middleton_Am<-incidence(Middleton_gsc); 15 16 #Create expression set with only genes in incidence matrix 17 Middleton_nsF = Middleton_normed[colnames(Middleton_Am), ]; 18 19 #Only select the pathways with greater than 10 genes as short pathways are difficult to analyse statistically 20 Middleton_selectedrows<-(rowSums(Middleton_Am)>10); 21 Middleton_Am2<-Middleton_Am[Middleton_selectedrows,]; 22 23 #Apply the GSEAlm algorithm with 2000 permutations 24 Middleton_perm<-gsealmPerm(Middleton_nsF,~Status,mat=Middleton_Am2,nperm=2000); 25 26 #Prepare the output file 27 Middleton_permA=Middleton_permB=c(1:length(Middleton_perm[,1])); 28 29 for (i in 1:length(Middleton_perm[,1])){ 30 Middleton_permA[i]<-min(Middleton_perm[i,1],Middleton_perm[i,2]); 31 if(Middleton_tAadj[i]<0){ 32 Middleton_permB[i]="DOWN"} 33 else{ 34 Middleton_permB[i]="UP"} 35 } 36 37 Middleton_permAdj=p.adjust(Middleton_permA,method="fdr",n=length(Middleton_permA)); 38 39 names(Middleton_tA)= rownames(Middleton_Am2) ; 40 41 Middleton_GO<-cbind(as.vector(names(Middleton_tA)),as.vector(Term(names(Middleton_tA))),as.vector(Middleton_permB), 42 as.vector(Middleton_permA),as.vector(Middleton_permAdj)); 43 Middleton_GO<-Middleton_GO[order(as.numeric(Middleton_GO[,4])),]; 44 colnames(Middleton_GO)=c("GOID","GO Term","UP/DOWN","P value","AdjustedP value"); 45 46 #Save results to file. 47 write.table(Middleton_GO,file="Middleton_GO_terms.txt",sep="\t");

GAGE

1 #GAGE method for Gene Set Analysis for pathways from Gene Ontology(GO) database 2 3 #library(KEGG.db) 4 library(GO.db) 5 6 #Use GSEABase package to get Gene set collection, format needs to be changed slightly to work with GAGE.

18 Robert Bentham

7 Middleton_gsc<-GeneSetCollection(Middleton_normed,setType=GOCollection()); 8 Middleton_geneset<-geneIds(Middleton_gsc); 9 Middleton_genesetnames<-names(Middleton_gsc); 10 11 #Apply GAGE algorithm 12 Middleton_gage <- gage(exprs(Middleton_normed), gsets = Middleton_geneset, ref = Middleton_controls, samp = Middleton_park,compare=’unpaired’); 13 14 #Get GOID from Middleton_genesetname in right format 15 Middleton_lessterms<-c(1:length(Middleton_genesetnames)); 16 Middleton_greaterterms<-c(1:length(Middleton_genesetnames)); 17 for (i in 1:length(Middleton_lessterms)){ 18 Middleton_lessterms[i]<-Middleton_genesetnames[as.numeric(substring(rownames((Middleton_gage$less[, 1:5]))[i],2,nchar(rownames((Middleton_gage$less[, 1:5]))[i])))] 19 Middleton_greaterterms[i]<-Middleton_genesetnames[as.numeric(substring(rownames((Middleton_gage$greater[, 1:5]))[i],2,nchar(rownames((Middleton_gage$greater[, 1:5]))[i])))] 20 } 21 22 #Prepare file for saving 23 Middleton_GOgageless<-cbind(Middleton_lessterms,as.vector(Term(Middleton_lessterms)),as.vector((Middleton_gage$less[, 3])),as.vector((Middleton_gage$less[, 4]))); 24 Middleton_GOgagegreater<-cbind(Middleton_greaterterms,as.vector(Term(Middleton_greaterterms)),as.vector((Middleton_gage$greater[, 3])),as.vector((Middleton_gage$greater[, 4]))); 25 colnames(Middleton_GOgageless)=c("GOID","GO Term","P value","AdjustedP value"); 26 colnames(Middleton_GOgagegreater)=c("GOID","GO Term","P value","AdjustedP value"); 27 28 write.table(Middleton_GOgagegreater,file="Middleton_GO_gage_greater.txt",sep="\t"); 29 write.table(Middleton_GOgageless,file="Middleton_GO_gage_less.txt",sep="\t");

Find GO Pathways which are significant

1 #FindGO pathways that are significantly over expressed in all studies 2 3 Middleton_GO_gagegreatersig<-Middleton_GO_gagegreater[which(Middleton_GOgagegreater[,4]<0.05),]; 4 Moran_GO_gagegreatersig<-Moran_GO_gagegreater[which(Moran_GOgagegreater[,4]<0.05),]; 5 Mullen_GO_gagegreatersig<-Mullen_GO_gagegreater[which(MullenGOgagegreater[,4]<0.05),]; 6 7 sig_genes<-intersect(Mullen_GO_gagegreatersig[,1],intersect(Moran_GOgagegreatersig[,1],Middleton_GOgagegreatersig[,1])); 8 9 GO_gage_greater_sig<-cbind(sig_genes,Term(sig_genes)); 10 11 write.table(Middleton_GOgagegreatersig,file="Middleton_GO_gage_greater_sig.txt",sep="\t"); 12 write.table(Moran_GOgagegreatersig,file="Moran_GO_gage_greater_sig.txt",sep="\t"); 13 write.table(Mullen_GOgagegreatersig,file="Mullen_GO_gage_greater_sig.txt",sep="\t"); 14 write.table(GO_gagegreatersig,file="GO_gage_greater_sig.txt",sep="\t");

Table for Significant genes in pathways related to DNA damage and stress

1 #Create table of significant genes in pathways related to DNA damage in the Middleton study. 2 3 #Create mapping between Affy probes and gene names 4 a<-hgu133aSYMBOL; 5 mapped_probes<-mappedkeys(a); 6 xx<-as.list(a[mapped_probes]); 7 8 #Import relavantGO pathways from premade file 9 Middleton_GO_DNA <- read.table("~/CP2/NewStudies/Middleton_GO_DNA", quote="\") 10 11 #Find all genes involved ina DNA damage related pathway 12 A=c(1:length(Middleton_GO_DNA)); 13 A[1]=which(Middleton_genesetnames==Middleton_GO_DNA[1]); 14 C1=genepaths[[A[1]]] 15 for (i in 2:length(Middleton_GO_DNA)){ 16 A[i]=which(Middleton_genesetnames==Middleton_GO_DNA[i]) 17 C2=genepaths[[A[i]]] 18 C1=union(C1,C2)} 19 20 #Select genes if interest 21 D1=which(Middleton_top[,1] %in% C1); 22 23 24 #Map probe AffyID to geneID 25 C1a=C1; 26 for (i in 1:length(C1)){ 27 C1a[i]=xx[[Middleton_top[D1[i],1]]]} 28 29 #Select only significant genes and save file 30 Middleton_DNA_genes<-cbind(C1a,Middleton_top[D1,c(2,6)]); 31 Middleton_DNA_genes_sig<-Middleton_DNA_genes[which(Middleton_DNA_genes[,3]<0.05),]; 32 for (i in 1:5){ 33 write.table(sprintf("%s&%f&%E \\\\\\hline",Middleton_DNA_genes_sig[i,1],Middleton_DNA_genes_sig[i,2],Middleton_DNA_genes_sig[i,3]), 34 file="Middletontable.txt",append=TRUE,row.names=FALSE,col.names=FALSE) 35 } 36 37 #Create table for significant genes common to all datasets 38 39 for (i in 1:10){ 40 cat(sprintf("%s&\\Checkmark&\\XSolidBrush&\\Checkmark \\\\\\hline",intersect(Mullen_DNA_genes_sig[,1],Moran_DNA_genes_sig[,1])[i]))}

19 Robert Bentham

41 for (i in 1:2){ 42 cat(sprintf("%s&\\Checkmark&\\Checkmark&\\XSolidBrush\\\\\\hline",intersect(Middleton_DNA_genes_sig[,1],Moran_DNA_genes_sig[,1])[i]))} 43 for (i in 1:3){ 44 cat(sprintf("%s&\\XSolidBrush&\\Checkmark&\\XSolidBrush\\\\\\hline",setdiff(Middleton_DNA_genes_sig[,1],Moran_DNA_genes_sig[,1])[i]))} 45 for (i in 1:4){ 46 cat(sprintf("%s&\\XSolidBrush&\\XSolidBrush&\\Checkmark \\\\\\hline",setdiff(Mullen_DNA_genes_sig[,1],Moran_DNA_genes_sig[,1])[i]))}

20