Integrative Transcriptome Analysis Reveals Common Molecular Subclasses of Human Hepatocellular Carcinoma
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Information Hoshida, et al., Integrative Transcriptome Analysis Reveals Common Molecular Subclasses of Human Hepatocellular Carcinoma Data analysis Preprocess of microarray datasets: Non-HCC and replicated samples were removed based on the sample annotation attached to each dataset. To integrate the datasets generated by different microarray platforms, each probe ID was converted and collapsed into gene symbols (http://www.genenames.org/) by averaging the signal intensities. For Affymetrix GeneChip datasets, only probe sets with a minimal 3-fold differential expression and absolute difference >100 units across the samples were included (after applying floor and ceiling values of 20 and 16000 units, respectively). Gene filtering for two-channel cDNA array datasets was based on a minimal 2-fold differential expression across the samples and an absolute fold change >2. In addition, only genes having missing values in less than 20% of the samples were included. Missing values were imputated using a k-nearest neighbor algorithm (1) (ImuputeMissingValues module, GenePattern). Identification of common HCC subclasses: i) Subclass Mapping (SubMap) Subclass Mapping (SubMap) was used to identify corresponding subclasses between the training datasets (2). The subclasses to be mapped (candidate subclasses) were defined in each dataset before collapsing into gene symbols using three unsupervised clustering methods: hierarchical clustering (HC, http://biosun1.harvard.edu/complab/dchip/), non- negative matrix factorization (3) (NMF, http://www.broad.mit.edu/cgi- bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=89) and k-means clustering (kmeans function of Matlab software, Mathworks). For the dendrogram generated by HC, the subclass splitting was traced from its root and splitting was allowed if each resulting subclass contained at least 10% of the samples in the dataset. For NMF and k-means clustering, the number of clusters tested (k) was 2 to 7, and the size threshold for candidate subclass was set at 5% of the cases in a given dataset, because these methods tend to call small distinct subclasses at smaller k. For each clustering algorithm, candidate subclasses were mapped between datasets using Subclass Mapping algorithm (SubMap module, GenePattern). First, over-expressed marker genes of a candidate subclass in the first dataset (A) were chosen using the signal- to-noise ratio (SNR) (4). Similarly, genes in a second dataset (B) were rank-ordered according to the extent of over-expression in a candidate subclass in B using SNR. Enrichment of the marker genes in A in the rank-ordered gene list in B was measured using the gene set enrichment score (ES) (5), and a nominal p-value was computed by the random permutation test (n=1,000). The procedure was repeated by switching the roles for A and B, allowing the generation of an enrichment p-value. The two bi-directional p- values for each pair of the candidate subclasses were summarized using Fisher’s inverse chi-square statistic, F (6). The significance of F was estimated based on a null distribution for F generated by randomly picking a pair of nominal p-values from the null for ES corresponding to the pair of candidate subclasses. The nominal p-values for F were adjusted by Bonferroni correction. A candidate subclass connection with Bonferroni-corrected p < 0.05 was regarded as significant (Supplementary Fig. S5). ii) HCC subclass signatures For the common subclasses defined by each clustering algorithms, meta-analysis marker genes were selected from the intersection of the three training datasets including 1950 genes. Over-expression of a gene in a subclass was measured using SNR in each dataset, and its significance was computed as a nominal p-value based on the random permutation test (n=1000). The three nominal p-values for the gene computed in the three training datasets were summarized using Fisher’s inverse chi-square statistic (6). Significance of the statistic was evaluated using a null distribution generated by picking three nominal p- values for the SNR computed in the random permutation tests and subsequently corrected using the false discovery rate (FDR) (7). An FDR of 0.005 was used as a significance threshold. The significant genes selected in all three clustering methods were regarded as the marker genes of common HCC subclasses (HCC subclass signature). iii) Prediction of HCC subclasses The property of the gene expression signal widely varies across the datasets due to platform difference, lab-to-lab/day-to-day variation, etc. This makes it practically infeasible to build a model of gene expression signal trained on one particular dataset and apply it to the rest as seen in standard machine learning prediction procedures. In addition, the meta-analysis signature is associated only by the summary statistic, F, instead of gene expression signal. To accommodate these problems, we designed a nearest neighbor- based method simply assessing the direction of change of the signature, i.e., if either of the subclass signatures is ON or OFF (NearestTemplatePrediction module, GenePattern). Briefly, a hypothetical expression pattern of each subclass (template) was defined as a vector having the same length with all subclass signature genes. In a template for a subclass, values for corresponding signature genes were set as 1 and the rest were set to 0. For each sample, a prediction was made based on the closest “template” by the cosine distance. The confidence of the prediction was evaluated as a nominal p-value based on a null distribution generated by randomly picking the same number of genes 1,000 times from the same sample’s microarray data, and corrected for multiple hypothesis testing using FDR. An FDR < 0.05 was regarded as significant. To ensure the significance of the prediction, we evaluated if the confident predictions occur by chance. First, for each cohort, we generated 100 datasets by randomly picking the same number of genes with the subclass signature 100 times. The prediction was performed on each randomly generated dataset, and we counted how many samples received prediction with statistically significant confidence (i.e., FDR<0.05). As a summary of the results, we computed an average of the proportion of samples received high confident prediction. The randomly generated datasets yielded high confidence predictions in less than 1% of the samples, indicating no sample received a high confident prediction by chance. Microarrays for fixed tissues A panel of 6,000-gene DASL probes was designed by prioritizing the most variable genes in previously generated microarray datasets including 24 studies, 2,149 samples, and 15 tissue types. The top 6,000 transcriptionally informative genes in this set cover 70~90% of the genes in published microarray signatures and a panel of biological pathways (8). Hybridized array was scanned by BeadArray Reader (Illumina). Poor quality scans defined as having %P-call smaller than 65% were eliminated as described elsewhere (8), and the scanned raw data were normalized using the cubic spline algorithm (9) (IlluminaDASL pipeline, GenePattern). Immuno-staining Immunohistochemical staining: Immunohistochemical staining was performed on 5 micron FFPE sections after antigen retrieval in a pressure cooker with NaCitrate buffer. Antibodies used were beta-catenin (1:2,000 dilution; BD), pAKT (1:100 dilution; Cell Signaling), and p53 (1:1,200 dilution; Immunotech) followed by detection using the Envision+ DAB system (Dako). Immunohistochemical stains for beta-catenin, pAKT, and p53 and were evaluated by a pathologist blinded to the results of the gene expression profiling. Staining was visually assessed in at least 1,000 tumor cells in the areas of greatest immunopositivity, and the results were scored in a binary system as follows. For beta-catenin, a semi-quantititative quick multiplicative method (10) was first used assign a score to reflect both the distribution and intensity staining. Briefly, a numerical score for the percentage of cells showing membranous staining was assigned: score 0 = no positive cells, 1 = 1-4% of cells positive, 2 = 5-19%, 3 = 20-39%, 4 = 40-59%, 5 = 60-79%, 6 = 80-100% of cells positive. Intensity of staining was also assigned a score: 0 = no detectable staining, 1 = weak intensity, 2 = moderate intensity, 3 = strong intensity. The scores reflecting percentage cells positive was multiplied by the scores for the intensity of staining to yield an overall score. Cases with multiplicative scores of 12 or greater were classified as positive for strong beta-catenin staining. For pAKT, cases showing definite nuclear and/or cytoplasmic positivity in any cells were considered positive and those with no staining were considered negative. For p53, cases showing nuclear positivity in ≥ 10% of cells were considered to be positive whereas those with less than 10% of nuclei positive were considered to be negative. Immunofluorescence staining: Cells grown on multiwell chamber slides were fixed for 20 minutes in 4% paraformaldehyde in PBS. After PBS wash, non-specific antibody binding was blocked by incubating in 1% bovine serum albumin, 3% normal goat serum, and 0.1% Triton X-100 in PBS for one hour at room temperature. Primary antibody for beta-catenin in block solution was applied at 1:2,000 dilution for 1 hour at room temperature overnight at 4˚C. After PBS washes, goat anti-mouse Alexa Fluor-546 conjugated IgG secondary antibodies (Invitrogen/Molecular Probes) was applied at 1:1000 dilution in block for one hour at room temperature. After PBS washes, slides were coverslipped with Vectashield Mounting Medium with DAPI (Vector Labs). HCC subclass signature in HCC mouse models To search for dysregulated molecular pathways involved in the HCC subclasses, we tested whether the subclass signatures were induced in in vivo experimental models of HCC. For this purpose, we used a publicly available HCC transgenic mice dataset (11) (NCBI Gene Expression Omnibus, GSE1897). The mouse gene identifiers were converted into those of human orthologous genes using a mapping table (Jackson Laboratory, http://www.jax.org/).