Materials and Methods
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Protocols Functional annotations Probe sequence and their annotations for the custom microarray platform employed in the analysis (Blanchette et al. 2005) were downloaded from the NCBI GEO database through accession GPL7508. Due to the turnover of the FlyBase CG-identifiers employed in these annotations to assign probes to the transcripts that contain them, we have re-annotated the transcript and gene assignments to all the 42,034 probes as follows. First we downloaded from the UCSC Genome Browser all the FlyBase and RefSeq annotations on the Dm1 version of the Drosophila melanogaster genome which correspond, respectively, to 18,746 and 21,617 transcripts whose sequence was obtained from their genome annotation. Second, we built a non-redundant set of 25,965 FlyBase and Refseq transcript sequences and annotations. Third, we annotate probes to those transcripts for which we find a perfect match of the probe sequence. Through the transcript annotations in the genome we also annotated the genomic position of each probe within each matched transcript, the genomic position of the functional element associated to the probe (an intron for a junction probe and an exon for an exon probe) and the genomic position of the upstream and downstream exon in junction probes. Using this information we also annotated what probes map to multiple genomic locations. Fourth, using the Bioconductor organism-level annotation package for Drosophila melanogaster org.Dm.eg.db we fetch the Entrez Gene (EG) identifiers that correspond to each of the transcripts and annotate probes to the corresponding EG identifiers through the transcript(s) they are probing. Using this information we also annotated what probes map to multiple Entrez Gene identifiers. Fifth, using the pairs of probe and EG identifier and the AnnotationDbi package from Bioconductor, we build a custom annotation package employed later throughout the analysis, storing all the retrieved information and other functional annotations that the Bioconductor infrastructure pulls out from other databases through the EG identifiers. We also included into this custom annotation package probe annotations of alternative splicing events generated as follows. Using software from the UCSC Genome Browser (Kent et al. 2002) source code (http://genome.ucsc.edu/admin/cvs.html), in particular the tools txBedToGraph and txgAnalyze, and the non-redundant set of 25,965 FlyBase and Refseq transcripts, we first generated a collection of 15,280 genome-wide alternative splicing (AS) event annotations. Next, we annotated junction probes to all those AS events whose genomic coordinates intersected those from the corresponding intron where, in the case of alternative transcription start site and alternative polyadenylation site, the intersection was considered also with respect to the upstream and downstream exons since such AS event annotations were overlapping the 5' and 3' end of exons, respectively. Processing of the microarray data and non-specific filtering Microarray data was processed using the limma package (Smyth 2004) from the Bioconductor software (Gentleman et al. 2004) and, more concretely, they were background-corrected using the "normexp" model with maximum likelihood estimation and an offset value of k=50 was added before taking logarithms (i.e., M=log2[(R+k)/(G+k)]) in order to shift the intensities away from 0 and stabilize the variance of the log-ratios at low intensities (Silver et al. 2009). We normalized expression values within each array employing loess normalization and further normalized their scale between the arrays. The initial set of 42,034 non-control probes was filtered to discard probes mapping to multiple genomic positions and probes that either were associated to more than one EG identifier or for which no EG identifier was found, leaving a set of 41,022 probes. This set was further filtered to discard junction probes for which their associated gene had no exon probes, and discard junction and exon probes belonging to genes with less than 2 exon probes, leaving a final set of 40,898 probes where 19,611 corresponded to splice (exon-exon) junction probes and 21,287 to exon probes. Gene-level log2 expression ratios were obtained by, first, summarizing log2 expression values of exon probes with common EG identifier separately for male and female samples. This summarization was calculated by using the median polish algorithm, which is also employed in the summarization step of the RMA procedure for Affymetrix expression arrays (Irizarry et al. 2003) and aims to protect against outlier probe values. Second, the summarized log2 expression values from the female samples were subtracted from the ones of the male samples in order to obtain a final set of 2,664 gene-level log2 expression ratios. Finding differentially expressed genes M Let xgk represent the log2 expression value in replicate k from a male sample hybridized to the Cy5 channel for gene g, obtained by summarizing the background-corrected and normalized expresssion values of the exon probes F occurring within the gene, as previously described. Let xgk represent the analogous value from a female sample hybridized to the Cy3 channel. The log2 ratio of expression between a male and a female fly for gene g in replicate k is thus defined as M F ygk = xgk − xgk . Using the limma package (Smyth 2004) from Bioconductor and the following simple linear model ygk = µg + εgk , where εgk are the residual error terms and µg is the expected log2 ratio of expresssion between male and female flies for gene g, we obtain estimates of µg and their standard errors for each gene by fitting the model to the log2 ratios ygk of the 3 replicates. Estimates and corresponding standard errors are then employed to compute moderated t-statistics by the empirical Bayes shrinkage method (Smyth 2004) implemented in which borrows limma information across probes and increases the effective degrees of freedom, thus improving the rate of detection of differentially expressed. By setting cutoff values of minimum fold-change to 2 and maximum adjusted P-value to 1%, we call in this way 270 genes as being differentially expressed between male and female flies. Finding differentially regulated splice junctions We call a splice (exon-exon) junction differentially regulated if its corresponding junction probe changes significantly due to distinct post- transcriptional regulatory events of splicing between male and female flies. Following a similar approach to (Blanchette et al. 2005) we first estimate the so-called log2 ratio of net-expression which corresponds to the log2 ratio of expression of a junction probe normalized by its corresponding gene-level M log2 ratio of expression. More, concretely, let x jk represent the log2 expression value in replicate k from a male sample hybridized to the Cy5 F channel for junction j. Let x jk represent the analogous value from a female sample hybridized to the Cy3 channel. The log2 ratio of expression between a male and a female fly for junction j in replicate k is thus defined as M F y jk = x jk − x jk . Let the g(j) indicate the gene containing junction j and let yg( j )k be the log2 ratio of expression for gene g(j) and replicate k thus representing the same quantities as the previous ygk values but retrieving the corresponding gene- level log2 ratio of expression for junction j. Then, the log2 ratio of net- expression for junction j and replicate k is calculated as net M F M F y jk = y jk − yg( j )k = x jk − x jk − (xg( j )k − xg( j )k ) , which by re-organizing terms as follows net M M F F y jk = x jk − xg( j )k − (x jk − xg( j )k ) , can be interpreted as the log2 ratio of expression between male and female flies for the expression value of a junction normalized by its corresponding gene-level expression. However, differently to (Blanchette et al. 2005) and analogously to the previously described gene expression analysis we fit again a simple linear model net net y jk = µ j + εgk , net to the log2 ratios of net-expression y jk in order to estimate the expected log2 net ratio of net-expression µ j for junction j and its standard error. Again, using the empirical Bayes method implemented in limma and setting cutoffs of fold- change to 2 and adjusted P-value to 1% we call 986 junctions as being differentially regulated between male and female flies. Analysis of the hnRNP splicing microarray data The raw hnRNP splicing microarray data from (Blanchette et al. 2009) was downloaded from GEO through accession GSE13940. Since these data was produced with the same microarray platform that used in this work, the same pre-processing and differential analysis steps were taken with the exception of the experimental design we employed which consisted of a common reference design with one control RNA source hybridized to the Cy3 channel while the different knockout experiments and replicates were hybridized to the Cy5 channel. We also employed a slightly smaller minimum fold-change cutoff value of 1.5 due to the narrower response to a knockout as opposed to the response observed for sex-specific differences while the maximum adjusted P-value cutoff was the same as in the sex-determination analysis (1%). This analysis yielded four lists of differentially regulated junctions, one for each hnRNP factor, which were considered for enrichment with respect to the junctions differentially regulated by sex-specific changes. References Blanchette M, Green RE, Brenner SE, Rio DC (2005) Global analysis of positive and negative pre-mRNA splicing regulators in Drosophila. Genes Dev 19(11): 1306-1314. Blanchette M, Green RE, MacArthur S, Brooks AN, Brenner SE et al. (2009) Genome-wide analysis of alternative pre-mRNA splicing and RNA-binding specificities of the Drosophila hnRNP A/B family members.