American University in Cairo AUC Knowledge Fountain

Theses and Dissertations

6-2015

META-analysis of microarray data to assess gender biased differential expression in hepatic tissue

Amira Salah eldin Mahmoud ElBakry

Follow this and additional works at: https://fount.aucegypt.edu/etds

Recommended Citation

APA Citation ElBakry, A. S. (2015).META-analysis of microarray data to assess gender biased differential in hepatic tissue [Master’s thesis, the American University in Cairo]. AUC Knowledge Fountain. https://fount.aucegypt.edu/etds/1399 MLA Citation ElBakry, Amira Salah eldin Mahmoud. META-analysis of microarray data to assess gender biased differential gene expression in hepatic tissue. 2015. American University in Cairo, Master's thesis. AUC Knowledge Fountain. https://fount.aucegypt.edu/etds/1399

This Dissertation is brought to you for free and open access by AUC Knowledge Fountain. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of AUC Knowledge Fountain. For more information, please contact [email protected]. The American University in Cairo

School of Sciences and Engineering

META-ANALYSIS OF MICROARRAY DATA TO ASSESS GENDER BIASED DIFFERENTIAL GENE EXPRESSION IN HEPATIC TISSUE

A Thesis Submitted to

The Biotechnology Graduate Program

In partial fulfillment for the pre-requisite

requirement for full admission to

the Ph.D Program in Applied Sciences

By: Amira ElBakry

Under the Supervision of: Dr. Rania Siam

Chair of the Biology Department

May/2015

ACKNOWLEDGMENTS

My unreserved gratitude goes to Dr. Rania Siam, for her unconditional support, invaluable help and constant encouragement. Thank you for making this thesis possible, and for pushing everything forward for me and with me. Working with you is a pleasure and a privilege.

Additionally, Mustafa Adel has provided much-needed help in many aspects of this thesis. Thank you for taking the time, putting the effort, and always replying to my emails and answering my many questions.

Appreciation is directed to Dr. Ahmed Moustafa, for supporting my idea and his invaluable coaching in bioinformatics. I learned an incredible amount in a very short time.

Special thanks to Dr. Sherif El-Khamisy for inspiring this thesis, it was his idea I built most of the foundation for the work on.

Also, acknowledgment is due for Ali El Behery, for helping me multiple times with my hopeless programming and resolving every single error. Also, my appreciation for Yasmeen El Howeedy, for taking time to share her experience and providing valuable help.

Thank you for everyone in the biology department that made my day easier, happier or just listened to my complaining.

Finally, my eternal gratitude goes for my family, for their endless support every step of the way.

ii

ABSTRACT

Hepatocellular carcinoma (HCC) is the second deadliest cancer globally, and with an estimated

782,000 new cases in 2012, it is the fifth most common cancer in men and ninth in women. HCC is of particular concern in Egypt because of the high prevalence of Hepatitis C Virus (HCV).

Due to its poor prognosis, HCC is the leading cause of cancer-related deaths in Egypt. A gender disparity is observed in liver cancer cases, with higher prevalence in men by three to five fold.

This sex bias is even more pronounced in mouse models of HCC, which was found to be sex hormone-dependent. Some studies have attempted to elucidate the molecular mechanisms of this disparity; but with inconclusive and sometimes contradicting outcomes, they remain largely unresolved. Understanding the natural protective mechanisms in females would allow for the development of preventative and therapeutic strategies for patients at risk for HCC or already inflicted with the disease. In this study, we applied a meta-analysis approach on already available microarray data from human normal liver tissues to identify differentially expressed between males and females. Microarray datasets were downloaded from the Gene Expression

Omnibus database, Robust Multiarray Average pre-processed and analyzed for differential expression. The combination of 2 distinct datasets and analysis using a p-value cut-off of 0.05 and fold change cut-off of 2 revealed male up-regulated genes including RPS4Y1, EIF1AY,

CYorf15B, UTY, DDX3Y and USP9Y. Female up-regulated genes included XIST, PNPLA4 and

PZP. Our results confirm gender-specific differential expression patterns found in other tissues and call for further investigation using a larger sample size and more sensitive approaches such as RNA-Sequencing and, targeted -level studies.

iii

TABLE OF CONTENTS LIST OF TABLES ...... v LIST OF FIGURES ...... vi ABBREVIATIONS ...... vii 1. Introduction ...... 1 1.1 Gender Bias in Hepatocellular Carcinoma ...... 1 1.2 Molecular Mechanisms of Gender Bias in HCC ...... 2 1.3 Studying Gene Expression using Microarrays ...... 4 1.4 Microarray Data pre-processing ...... 6 1.5 Differential Expression Analysis ...... 8 1.6 Accounting for Batch Effects ...... 10 2. Materials and Methods ...... 12 2.1 Data Collection and Processing ...... 12 2.2 Data Exploration and Differential Expression Analysis ...... 12 2.3 Dataset Merging and Batch Effect Removal ...... 13 2.4 Gene ID conversion and functional annotation ...... 14 3. Results ...... 14 3.1 Dataset Collection and Processing ...... 14 3.2 Individual Dataset Analysis Using the T-test Method ...... 15 3.3 Individual Dataset Analysis Using the Limma Package and Bayesian statistics ...... 16 3.4 Analysis of Merged Datasets Using the Student’s T-test ...... 16 3.5 Batch Effects Removal ...... 17 3.6 Gene Signature Validation ...... 18 4. Discussion ...... 18 TABLES ...... 26 FIGURES ...... 35 REFERENCES ...... 43

iv

LIST OF TABLES

Table 1. Summary of the datasets of microarray studies using human normal liver tissue...... 26

Table 2. Sample information of three microarray datasets with gender information ...... 27

Table 3. Summary of differentially expressed probes found in all three data set and their corresponding gene names ...... 28

Table 4.Summary of differentially expressed probes identified in GSE14323 using the Limma

Package ...... 29

Table 5. Differentially expressed probes identified in GSE 23343 using the Limma Package. ... 30

Table 6. Differentially expressed probes and their corresponding gene names, identified in merged datasets using the t-test method...... 31

Table 7. Differentially expressed probes and their corresponding genes in dataset GSE14343 after batch effect removal using fRMA...... 32

Table 8. Differentially expressed probes and their corresponding genes in dataset GSE23343 after batch effect removal using fRMA...... 33

Table 9. Differentially expressed probes and their corresponding genes in merged datasets after batch effect removal using fRMA or using ComBat ...... 34

v

LIST OF FIGURES

Figure 1: Hierarchical Cluster Analysis of 3 microarray datasets…………………………….…35

Figure 2: Differentially Expressed probes in individual datasets (t-test)………………………..36

Figure 3: Differentially Expressed probes in individual datasets (Limma)……………………...37

Figure 4: Differentially Expressed probes in merged datasets…………………………………..38

Figure 5: Hierarchical Cluster Analysis of merged microarray datasets………………….……..39

Figure 6: Differentially Expressed probes in individual datasets after batch effect removal……40

Figure 7: Differentially Expressed probed in merged datasets after batch effect removal………41

Figure 8: Gene Signature Validation………………………………………………………….....42

vi

ABBREVIATIONS

ANOVA Analysis of variance ComBat Combining Batches of Gene Expression Microarray Data DAVID Database for Annotation, Visualization and Integrated Discovery DDX3Y DEAD (Asp-Glu-Ala-Asp) box polypeptide 3Y DEN Diethylnitrosamine DWD Distance-weighted discrimination EIF1AY Eukaryotic initiation factor 1A-Y FC Fold Change fRMA Frozen robust multichip average GEO Gene Expression Omnibus HBV Heptitis B Virus HCC Hepatocellular carcinoma HCV Hepatitis C Virus IL-6 Interleukin 6 KDM5D (K)-specific demethylase 6A LARP4B La ribonucleoprotein domain family, member 4B MM Mismatch PM Perfect Match PNPLA4 Patatin-like phospholipase domain containing 4 PRL Prolactin PRLR Prolactin Receptor PZP Pregnancy Zone Protein RMA Robust Multichip Average RNA-Seq RNA Sequencing RPS4Y1 Ribosomal protein S4 RVM random variance model SAM significance analysis of microarrays SVA Surrogate variable analysis TLR Toll-like Receptor TNF-α Tumor Necrosis Factor-α USP94 ubiquitin specific peptidase 9 UTY ubiquitously transcribed tetratricopeptide repeat gene WT Wild Type XIST X (inactive)-specific transcript

vii

1. Introduction

1.1 Gender Bias in Hepatocellular Carcinoma

Hepatocellular carcinoma (HCC) is one of the most common cancers in the world, with an estimated 782,000 new cases in 20121. HCC is the fifth most common cancer in men and ninth in women and with about 746,000 deaths in 2012, it is the second deadliest cancer globally 1. Risk factors for HCC include alcoholic liver disease, infection with Hepatitis C virus (HCV) or

Hepatitis B Virus (HBV) , and aflatoxin exposure2. HCC is of particular concern in Egypt because of the high prevalence of HCV. A recent study showed that more than 91% of HCC patients surveyed tested positive for HCV 3, which is in agreement with a previous report showing that almost 89% of surveyed HCC patients were HCV positive4. Owing to its poor prognosis, HCC is the leading cause of cancer-related death in Egypt, and second in incidence to breast cancer 1.

One important aspect of HCC is a global gender disparity, with a higher prevalence in men by two to four fold, despite equal exposure to risk factors such as HCV and HBV 5. This female protection seems to be hormone-dependent, as indicated by the rise in liver cancer incidence amongst women who undergo menopause 6. Furthermore, hormone replacement therapy in women is associated with lower HCC incidence 7. This sex bias is even more pronounced in mouse models of HCC, which was also found to be sex hormone-dependent 8,9.

HCC is chemically induced in mice using a post natal injection of diethylnitrosamine (DEN), which causes DNA damage and hepatocyte death. This triggers inflammatory responses from

Kupffer cells, which further promotes compensatory hepatocyte proliferation and leads to tumor formation 10,11. In experiments with DEN- induced HCC , almost all male mice develop the disease, while only 30% of female mice progress to HCC 9. This effect is significantly reduced

1

by estrogen administration or castration. On the other hand, ovariectomy and / or testosterone supplement significantly reduced female protection9. This disparity has also been shown to be androgen-receptor-dependent in DEN-induced HCC mouse models12. Mice lacking functional androgen receptors showed resistance to DEN-induced carcinogenesis, and females treated with testosterone showed a higher incidence of liver tumors than untreated females 12. In another study, transgenic mice expressing Hepatitis B Surface Antigen and/or p53 were exposed to aflatoxin in different combinations 8. All male mice with these three risk factors developed tumors, compared to only 62% of females 8. Furthermore, males in all other groups developed

HCC more frequently than their female counterparts. Some studies have addressed the molecular mechanisms of this disparity in an attempt to elucidate the underlying causes to this phenomenon, but the picture is far from complete.

1.2 Molecular Mechanisms of Gender Bias in HCC

Liver cancer is strongly associated with inflammation13, and thus studies investigating the molecular mechanisms underlying the observed gender bias in HCC found links between sex hormones and inflammatory responses in the liver.

In a key article published in 2007, Naugler et al. found that interleukin-6 (IL-6) plays an important role in promoting DEN-induced liver carcinogenesis in mice through signaling via a

Toll-like receptor (TLR) adaptor protein 14. Males had a higher level of serum IL-6 after DEN administration and a higher level of liver injury than females. IL6 -/- males had lower HCC incidence than wild type (WT), while no difference was observed between WT and IL6-/-.

Estrogen administration to males and ovariectomized females reduced liver injury through suppression of IL-6 production. While this study identified the estrogen receptor ERα as the receptor responsible for this effect, this was later questioned by another group. Bigsby et al.

2

found that ERα status did not affect tumorigenesis in females, and its absence in male mice resulted in less tumors 15. Interestingly, estradiol treatment reduced the number of tumors in WT male mice, but not in ERα-/- mice, suggesting a protective effect of ERα only for exogenously administered estradiol 15.

Another hormone, prolactin (PRL), has also been indicated as an important factor in HCC resistance in women. Hartwell et. al. found that the pituitary hormone PRL (expressed more in females) restricts innate immune responses in the liver by inhibiting c-Myc activation16. PRL was found to signal through a hepatocyte-predominant short-form prolactin receptors (PRLR-S) to inhibit IL-1β, TNF-α, and TLR-4 induced innate responses. PRL acts by ubiquitinating a group of in a “Trafasome” thereby inhibiting their downstream interactions with c-Myc activating pathways. This is in contrast to previous studies which show no role for PRL in HCC gender bias. Tumor incidence in PRLR knockout was comparable to WT, and in famles, ovariectomy induced tumorigenesis regardless of PRLR status15. This difference has been attributed mainly to the biological differences between PRLR and PRL knockout mice.

In an attempt to identify potential oncogenes and their role in sex-bias occurrence on a larger scale, one group used the sleeping beauty transposon system to screen for potential oncogenes in mice. They found that transposon insertions in the epidermal growth factor receptor

(EGFR) gene were more common in tumors from male mice than female mice 17. Additionally, gene expression analysis in human liver samples revealed differences in male and female liver tissues predominantly pertaining to inflammation, carcinogenesis and reproduction16. Together, these results indicate a role for sex hormones, pituitary hormones and the innate immune system in the gender disparity observed in HCC, the details of which remain largely uncharacterized.

3

The paradoxical results obtained by different groups highlight important points of consideration while studying this aspect of HCC. First, the use of DEN-induced HCC model represents a point of variation as the treatment regimen varies in dose, age of mice at administration, and time of sample collection post DEN administration. Another factor to be considered is the mice genetic background, which as demonstrated by Bigsby et al., affects the number of tumors obtained by a certain treatment and the sex-dependent response15. For example, ovariectomy did not affect the tumorigenesis in C57Bl/6J females, but showed an effect in the mixed background strain 129Ola-X-C57Bl/6J. Additionally, HCC in humans is largely associated with a other external factors2 (such as alcohol, viral infection, chemical exposure), which are not represented in animal models. Therefore, it is evident that while mice provide an invaluable experimental tool, they do not accurately reflect the pathogenesis of HCC in humans, nor the sex bias observed. A valuable source of information would be clinical samples from HCC patients and human liver tissues to provide a more accurate and reliable

1.3 Studying Gene Expression using Microarrays

One way to extract valuable information from clinical data is to utilize already available genome- wide microarray data from human liver tissues to identify genes with potential roles in the sex bias observed in HCC. High density oligonucleotide arrays represent a very useful tool for studying gene expression as they allow the quantitative detection of mRNA. Microarray chips, such as the Affymetrix GeneChip, use tens of thousands of synthetic oligonucleotides (25 bases long) that hybridize to specific sequences of target genes18,19. Commercially available platforms allowed the widespread use of microarrays for high throughput transcriptome studies, and the accumulation of large amounts of gene expression data. Microarray studies are often used to study disease pathogenesis or treatment effectiveness by comparing gene expression in normal

4

and diseased tissues, treated and untreated samples or multiple stages of one disease, among many other alternatives. These approaches allow the large-scale identification of genes that are up-regulated and down-regulated in experiment versus control, allowing subsequent downstream functional analyses. For example, microarray studies have been used to identify gene expression signatures in cancers of the colon20, prostate21, breast22, liver23,24 and lung25 to identify diagnostic and prognostic biomarkers and therapeutic targets. Public databases, such as the NCBI

Gene Expression Omnibus (GEO)26,27 and ArrayExpress28 allow access to a huge number of microarray data.

Although various gene expression studies addressed different aspect of HCC pathogenesis, only one study attempted to dissect the molecular basis of the observed gender bias. Hartwell et al. carried out gene expression analysis on 7 male and 7 female human liver samples to identify a gene signature that could be protective of females as compared to males16.

Their findings distinguished about 500 genes that are differentially expressed between the 2 groups, many of which have functions in cell cycle, inflammation and cancer16.

Our aim in the current study is to build on these results by using a larger sample size to define a more specific signature, which can be followed in patients that develop HCC. One approach to achieve this is to use data from multiple studies that are already published. However, lack of correlation between platforms and experimental variations among labs do not allow a direct comparison between heterogeneous studies 29,30. One way to overcome this is to carry out a meta-analysis of the data. Meta-analysis is a statistical approach, consisting of a set of statistical techniques that allow the combination of data from independent, but relevant, studies

31. Combining data from various studies using a meta-analysis approach increases the statistical power to allow the detection of small, but consistent changes that are otherwise missed in single-

5

study analyses 32. Furthermore, statistical approaches directed to overcome heterogeneity among the different datasets increases the sensitivity and reproducibility of the result, when compared to individual studies33. Another advantage of meta-analysis is its relative low-cost, as the data are already available and many analysis tools are open-source.

The success of this approach is evident by the increasing number of researchers employing it to generate novel information from already existing data, such as common transcriptional profiles or gene signature of certain cancers, and even new prognostic biomarkers

34–37. Rhodes et al. used meta-analysis on 40 independent microarray cancer studies and identified a “meta-signature” of cancer transformation and progression35. In a more clinically- driven approach, Mehra et al. combined 417 samples and identified GATA3 as a prognostic marker for breast cancer34. Also, Ewald et al. used patient-derived microarray data to identify pathways involved in the stage-wise progression of bladder cancer from papillary to muscle- invasive tumors38. In addition, efforts to compare and unify meta-analysis approaches and workflows have been emerging to enhance the applicability of such studies and enhance their outcome 31,39,40. Overall, meta-analysis is a powerful approach to use already-available data to answer novel questions, which is what we aim to accomplish in this study. However, as explained below, careful consideration must be given to using public microarray datasets for differential expression analysis, owing to the heterogeneity of their experimental design, the platforms used, and the data processing methods.

1.4 Microarray Data Pre-processing

The output of microarray experiments, i.e. raw data, has to be pre-processed in order to reach meaningful, informative values. Affymetrix microarray chips contain around 11-20 probes for

6

each gene, arranged in pairs known as “probe pairs”. Probe pairs consist of perfect match (PM), which

match the target gene completely, and mismatch (MM) probes, which are the same sequences as

PM probes, but with a mismatched base in the middle (13th base) to account for non-specific binding18. Therefore, in order to have a single expression value for each probe, the data from all probes in a probe set needs to be summarized in an expression value. In addition, normalization of the data is crucial to account for obscuring variations that occur during sample preparations, handling, production and processing of the arrays.

One of the most widely-used methods for pre-processing of Affymetrix microarray data is the Robust Multichip Average (RMA) method. The RMA expression measure is generated by background correction of probe-level PM values, followed by quantile normalization of the data and then linear fitting using the median polish method, to generate log expression values 41. The quantile normalization approach utilizes information from all arrays with the goal of making the distribution of probe intensities the same for all arrays in a particular set analyzed 42. Therefore, it is important to process all arrays from an experiment together as one batch when using RMA.

When compared to other normalization methods, such as non-linear and scaling methods (used in the Affymetrix algorithms), quantile normalization was most successful at reducing variance and produced the smallest distances between arrays in pairwise comparisons which remained constant across intensities42. It also performed better in terms of bias and speed. This improvement in performance could be attributed to using all arrays in the normalization process, rather than a single baseline, which is less representative of the complete data42.

7

Additionally, algorithms such as the Affymetrix MAS 5.0 algorithm and dChip software developed by Li and Wong rely on subtracting the MM signal to correct for non-specific binding

43. However, RMA does not use the MM probes for subtraction, and relies only on PM values, as studies have shown that MM probes detect signal in addition to non-specific binding, and that methods utilizing PM- MM or PM/MM values add noise and result in a biased signal, respectively 41,44. In a series of spike-in experiments, RMA was compared to other expression measures, such as Li and Wong’s model-based expression indexes (MBEI) and the MAS 5.0 signal, in which it performed better in 3 criteria: (1) precision, as estimated by the gene-specific standard deviations (SD) across replicates; (2) consistency of estimates for fold change; and (3) the specificity and sensitivity when using fold change to detect differential expression41,45.

1.5 Differential Expression Analysis

The simplest way to identify differentially expressed gene is using the Fold Change (FC) indicator, which evaluates the average log ratio between 2 conditions or groups and considers gene with FC above a certain (arbitrary) threshold differentially expressed. For example, if the

FC threshold is set at 2, genes that are found to be more abundant under one condition or less abundant in the other by more than 2 fold are considered to be differentially expressed. However,

FC is not a statistical test; it does not provide any level of confidence and does not account for variance across samples. Therefore, it is important to employ a statistical method to account for this variance and to standardize differential expression.

A simple and popular method to employ is a standard t-test, which has been widely used to detect differentially expressed gene for microarray studies46. The t-test is conducted for each gene and the error variance is estimated based on the log ratios. For each gene, a t-statistic is computed and then converted to a p-value. Typically, genes with p-values falling below a

8

specific threshold are considered significant. Despite the popular application of the t-test to microarray data analysis, this approach has its drawbacks, particularly with low-variance genes and small sample size, where some bias is introduced and the statistical power is compromised, respectively 47,48. Consequent to these criticisms, numerous statistical approaches have been developed to address different areas of concern, in attempts to improve variance estimates, accuracy and statistical power. The abundance and variety of these methodologies make it difficult to decide on which method would is most effective in analyzing gene expression data and which method would be most appropriate to specific experimental settings.

Fortunately, studies were designed to specifically address this issue, and some have utilized data in which the differentially expressed genes are already known (for example, using spike-in experiments) to evaluate the performance of statistical tests 47–50. For example, one study has compared the performance of 8 statistical methods of variance modeling in gene expression data analysis: Welch's t-test, analysis of variance (ANOVA), Wilcoxon's test, significance analysis of microarrays (SAM), random variance model (RVM), Limma, VarMixt and SMVar 47. Using Limma resulted in the best overall performance compared to the other 7 methods, where the most improvement was achieved compared to the t-test, especially with small sample size, and in terms of ease of use and speed of execution.

Limma is a popular method used for microarray data analysis, which has recently been adapted for RNA-sequencing (RNA-Seq) data as well. The principal idea in Limma is the use of gene-wise linear model fitting to analyze an entire experiment as a whole 51. The linear models are fitted to each row, and regression coefficients and standard error generated for the compared group. The parallel nature of gene expression experiments motivates the use of empirical Bayes statistics which allows sharing of information between genes to account for variances across

9

genes and samples to obtain Bayes posterior variance estimators51. The estimated sample variances are squeezed towards a common variance, leading to more stable inferences from small sample size52. In addition, the use of gene-wise linear modeling allows flexibility in handling different experimental designs. In testing for differential expression, empirical Bayes statistics, such as moderated t-statistic and p-values are generated for each coefficient of the linear model in order to assess the significance of these changes. Limma is able to reduce the false positive rate in genes with low variance and improve the power for detecting genes with large variances.

The use of empirical Bayes statistics in Limma has been found favorable to other significance testing methods in gene expression analysis, and its performance favored especially with small samples49,50.

1.6 Accounting for Batch Effects

Another important source of variation in microarray data that needs to be addressed is the variation introduced by non-biological factor that causes differences between samples, known as the “batch effect”. Batch effects are introduced when samples are run in different “batches”, which could be different runs, different days, using different reagents or different technicians 53–

55. Batch effects can affect the downstream processing of the data, resulting in lower power to detect real changes, or more the serious consequences of false or misleading biological conclusions. Although batch effects have been demonstrated in early microarray experiments 56, surprisingly, they have been reported only in small percentage of studies. For example, one study reported that out of 219 papers published using microarray data, less than 10% addressed batch effects 53. Moreover, upon examining data from 9 high throughput studies, Leek et al. found all of them had considerable batch effects, with substantial percentages (32.1-99.5%). While batch effects can be minimized through careful experimental design, the only way to avoid them

10

completely is by running samples as one batch. Therefore, several methods for batch effect removal have been developed, including Distance-weighted discrimination (DWD) 57, mean- centering (PAMR)58, Surrogate variable analysis (SVA)59, Geometric ratio-based method

(Ratio_G)60 and Combating Batch Effects When Combining Batches of Gene Expression

Microarray Data (ComBat)54. When all these 5 methods were compared and performance assessed based on accuracy, precision and variation reduction (batch effect removal), using

ComBat resulted in the best overall performance. In addition, ComBat was robust for handling small batches that other methods did not perform well with53. This is attributable to the empirical

Bayes framework employed in the ComBat approach, which estimates location and scale adjustment parameters for each gene independently by borrowing information across genes to

“shrink” the batch effect parameter estimates towards an overall mean of the estimates54. These estimates are then used to adjust the data for batch effects, providing robust adjustment for small batches (10<)54.

Although normalization of microarray data does not account for batch effects, some modifications of existing methods attempt to address these effects in batches of datasets or single arrays that can be analyzed individually or after before combining them with others. The frozen robust multichip average (fRMA) is used to achieve the advantages of multichip processing to single-array analysis by using a large dataset of representative samples to create a reference distribution for the subsequent quantile normalization. Pre-computed estimates of probe effects are used in concert with data from the set being analyzed to generate the summary expression values. fRMA was found to perform similarly to RMA when the data was preprocessed as one batch, but outperformed RMA in terms of precision when analyzing multiple batches.

11

Using the above tools for analyzing gene expression data from human liver tissues, we aim to identify differentially expressed genes between males and females that could be responsible for the inherent HCC resistance in females, which would be a starting point for subsequent in silico analysis and in vitro and in vivo experiments.

2. Materials and Methods

2.1 Data Collection and Processing

The NCBI GEO database ((http://www.ncbi.nlm.nih.gov/geo/) was searched for “human liver”, with the search filters set as follows: “Organism” set to “Homo sapiens”, “Data type” set to

“expression profiling by array”, “Attribute name” set to “tissue to satisfy our inclusion criteria for whole-transcriptome studies using human tissues from normal livers. In order to unify gene

IDs and normalization, we only selected data produced using Affymetrix chips. Studies using cell lines, animals, non-coding RNA profiling, or different platforms were excluded. For the selected datasets, raw (.cel) files were downloaded and pre-processed using the RMA method to generate expression value output41,45, which is summary measure of background-corrected, normalized log-transformed probe intensities. RMA was applied by implementing the “just

RMA()” function in the Bioconductor Affy Package 61. As with all other statistical analyses used in this study, all methods were implemented in the R statistical computing environment

(http://www.r-project.org/), and all packages are available from the open source Bioconductor project62,63 (http://www.bioconductor.org/about/).

2.2 Data Exploration and Differential Expression Analysis

As an initial exploratory step, hierarchical cluster analysis was carried out using the “hclust” function in R, with the default method of complete linkage, which defines the distance between 2 clusters as the maximum distances between its components. Distance matrices were constructed

12

by using the Pearson distances between columns (1- Pearson’s correlation coefficient), and the analysis displayed as a dendrogram.

To identify differentially expressed genes between male and female groups, we used 2 methods: a standard t-test between the 2 groups and the Limma Bioconductor package which implements linear modeling and Bayesian statistics51. Using the t-test, p-values were calculated for each probe, and a cut-off of 0.05 was set so that only probes with p-values below this limit are selected. Absolute fold change cut-off was set as 2, so genes which are up- or down- regulated by at least 2 fold are selected. Probes that met both of these criteria were filtered for further downstream analysis.

For the Limma Package, the “lmFit” function was used for the estimation of fold changes and standard errors through fitting a linear model, which is specific by the design matrix, for each gene52. The fitted model was then processed using the “eBayes” function to apply empirical

Bayes statistical methods and generate statistics such as the moderated t-statsitic and its associated p-value, adjusted p-values for multiple testing and average log expression values52.

The selection cut-off for differentially expressed genes was set to a p-adjusted value of 0.05

(using the Holm’s sequential Bonferoni multiple testing correction method)64. This ensures that errors introduced due to testing multiple hypotheses are corrected for and so minimizes false positives.

2.3 Dataset Merging and Batch Effect Removal

The datasets were combined together using the inSilicoMerging Bioconductor package65, which allows the merging of different datasets and the use of different methods for batch effect removal

65. Using this package, we used the ComBat approach, which utilizes empirical Bayes statistics

13

to correct for variations between arrays due to use of different methodology, or data generated in different laboratories, such as the case in this study 54.

Another method for batch effect adjustment that was employed was the fRMA method, which allows pre-processing of individual datasets by using pre-computed probe effects and variances during the normalization process 66. This way, individually fRMA-processed datasets can be computed without the need for further batch effect removal. Datasets were fRMA pre- processed using the “fRMA” function in the fRMA Bioconductor package, then combined using the inSiliocMerging package and then analyzed for differentially expressed probes using the

Limma package.

2.4 Gene ID conversion and functional annotation

To convert probe IDs to gene names and identify the location and function of these genes, The

Database for Annotation, Visualization and Integrated Discovery version 6.7 (DAVID, http://david.abcc.ncifcrf.gov/) tool was used 67,68. Each list of probes was copied and pasted into the Gene List Manager and analyzed either by the Gene ID Conversion tool to identify the gene corresponding to the selected probes or the Gene Functional Classification tool to group genes into functional groups.

3. Results

3.1 Dataset Collection and Processing

The GEO search yielded 739 studies on Januray 2015, of which 7 studies were selected

(summarized in Table 1) that provided expression data for normal human liver. All selected datasets were generated using the Affymetrix U133 Plus 2.0 Array, except

GSE14323, which was generated using the Affymetrix GeneChip Human Genome U133A 2.0

14

Array. Out of those, only 2 (GSE2334369, GSE1495170) had gender information readily available in the database. We contacted the principal author for each study, and only the author for GSE143323 provided the missing gender data 71. Consequently we had a total of 3 studies qualifying for the subsequent analysis. Collectively, the datasets included 27 male and 19 female normal liver samples, adding up to 46 samples in total (summarized in Table 2). As the data provided in the expression matrices was processed differently by each group, raw .cel files were downloaded and pre-processed using the same RMA method45.

3.2 Individual Dataset Analysis Using the T-test Method

For each dataset, hierarchical clustering analysis was performed, using the complete linkage method72, and plotted as dendrograms (Figure 1). Hierarchical clustering for all three datasets did not generate a distinct gender cluster (data not shown). Each dataset was then divided into 2 groups of male and female samples and compared using a t-test statistic and FC. A cut-off of

0.05 for p-value and 2 for FC was used to filter the probes that were differentially expressed between the 2 groups. The data are represented in a form of a heat maps depicting up-regulated and down-regulated probes in each dataset (Figure 2). For dataset GSE14323, a total of 6 probes were deferentially expressed and in GSE23343 a total of 19 deferentially expressed probes were identified. Finally, dataset GSE14951 showed only 2 differentially expressed probes, that did not produce complete segregation of the 2 groups. The DAVID67,68 tool was used to map the probe

ID to genes names. Table 3 summarizes the differentially expressed probes in all datasets, highlighting their overlapping probes and their corresponding genes. Since multiple probes are used to monitor the same gene, the total number of genes is sometimes smaller than the number of probes. Therefore, for GSE 14323, 6 probes corresponded to 5 genes including genes involved in spermatogenesis, X-inactivation and protein biosynthesis. For GSE23343, 19 probes

15

corresponded to only 11 genes, including genes involved in hexose metabolism, histone demethylatoin and protein deubiquitination. Only one gene was found to be differentially expressed in all three datasets, which is the Ribosomal protein S4 (RPS4Y1), a Y-linked gene.

Most of the genes found were either Y-linked or X-linked, with only 3 out of 20 found on autosomes. Additionally, none of the genes in all datasets clustered into functional groups when using the DAVID gene functional classification tool. Finally, for the apparent heterogeneity of the GSE14951 dataset sample pool, it was excluded from further analysis.

3.3 Individual Dataset Analysis Using the Limma Package and Bayesian statistics

The Limma Bioconductor package51 was used to further analyze the data as an alternative method to determine differentially expressed genes between the 2 groups in 2 datasets:

GSE14323 and GSE23343 (figure 3). For GSE14323, 9 probes were found to be differentially expressed which included 5 of the 6 probes found by the t-test method. The 4 “new” probes were already identified for the other dataset (GSE23343) using the t-test. On the other hand, only 8 probes were indentified for GSE23343, and again, considerable overlap was observed, as 7 out of the 8 probes were also differentially expressed among the 19 probes identified in the t-test method (data summarized in tables 4 and 5). Using the Limma package added only one new gene, which is La ribonucleoprotein domain family, member 4B (LARP4B), found on 10.

3.4 Analysis of Merged Datasets Using the t-test

To test if combining these 2 datasets would provide additional information, or allow for the detection of changes that were not previously detected, the 2 datasets were merged together and a t-test was used to identify significantly up- or down- regulated probes (figure 4). Ten probes were identified to be differentially expressed, corresponding to 7 genes. All of these probes have

16

been identified in individual dataset analysis, and the genes they correspond to are Y- or X- linked genes, with the exception of just one (summarized in table 6). A hierarchical cluster analysis of the merged datasets resulted in the clustering of the 2 datasets into separate groups

(figure 5), regardless of the gender of the sample

3.5 Batch Effects Removal

To correct for differences between data generated in different laboratories, or at distinct time periods, various batch effect removal tools were used. Two methods were employed to correct for batch effect correction: either the fRMA normalization66 prior to merging the datasets or using the ComBat54 tool after merging the datasets. In both cases, the Limma package51 was employed to identify differentially expressed genes. For GSE14323, 12 probes were identified, corresponding to 9 genes (figure 6A). Three new genes for this dataset were identified after batch effect removal: patatin-like phospholipase domain containing 4 (PNPLA4), ubiquitously transcribed tetratricopeptide repeat gene (UTY), and ubiquitin specific peptidase 9 (USP94)

(table 7). PNPLA4 is X-linked while UTY and USP94 are Y-linked. Also, USP94 was identified before for the other dataset using Limma. For GSE23433, 9 probes were found to be differentially expressed (figure 6B). As shown in table 8, these probes correspond to only 4 genes, none of which are new compared to previously identified probes for this dataset.

After merging the 2 fRMA-normalized datasets, we obtained 10 differentially expressed probes figure 7). These corresponded to 8 genes, 7 of which were sex-chromosome linked, and all of which have been already identified in earlier individual dataset analyses. Using the

ComBat method for batch effect removal, 13 probes were identified as differentially expressed- corresponding to 10 genes, with considerable overlap with those identified using fRMA (table 9).

17

Out of these ten, only one was not previously identified for either dataset, which is lysine (K)- specific demethylase 6A (KDM5D), an X-linked gene.

3.6 Gene Signature Validation

Recently published microarray data showed around 500 genes to be significantly differentially expressed between normal male and female human liver tissue16. Since our analysis did not reveal gender bias in human liver tissues gene expression, we tested if this gene set would produce a similar signature using our two datasets. The gene set was successful in distinguishing between their male and female samples, as they clustered into two distinct classes, but there was no clear signature of up- and down-regulated genes that could be distinguished.

4. Discussion

We used data from microarray studies of human liver tissue to compare female and male gene expression and identify differential expression patterns. We identified a subset of X- and Y- linked genes that are up-regulated in females and males, respectively.

As microarray data available from public data bases are found in a host of different formats, it is usually recommended to start any analysis with raw files to be pre-processed in a unified manner, to make the output comparable. We chose the RMA pre-processing approach due to its effectiveness and precision, as it is becoming the method of choice for Affymetrix microarray data processing.

The initial clustering of the individual datasets did not show any gender-based segregation, but as the analysis was done for all probes for each dataset, subsets of differentially expressed genes could not be determined or visualized using this exploratory step. Therefore, the t-test statistical analysis was carried out to identify differentially expressed genes between males and females. Although using the t-test for this application has been criticized, it was used in the

18

initial exploration of the data because the experimental design is simple (only 2 groups compared, representing 2 conditions) and also for comparison to other methods. Furthermore, the fold change was used as an additional filtering criterion in the selection for biologically meaningful differences. Although the number of identified probes was small for each dataset (19 for GSE23343, 6 for GSE14323, and 2 for GSE14951), there was considerable overlap between them, indicating consistency for those genes. The most notable is RPS4Y1, that is up-regulated in males in all datasets, which is not surprising given it is a Y-linked gene (functional analysis for relevant probes will be discussed in later sections).

To verify these results and refine our analysis by using improved statistical approaches that are more appropriate for small sample size, we used the Limma package for differential expression. The Limma package has been shown to be superior in performance and accuracy than various other statistical and increased power compared to the t-test. When we used Limma for individual dataset analysis, we identified more probes for GSE14323 and less for GSE23343, and overall identified just one new gene: LARP4B, an autosomal gene found on chromosome 10.

It is worth noting that the cut-off used for Limma was a p-adjusted value that corrects for multiple testing, which is a stricter cut-off than using standard p-values in a t-test. This indicates that Limma is indeed more powerful than the t-test, and was able to detect changes that are more significant than the t-test could account for, and therefore Limma was used for all subsequent differential expression analysis. This would also explain the lower number of probes obtained for

GSE23343. Similar to the individual analysis using the t-test, considerable overlap was found between the differentially expressed genes identified using Limma for the 2 datasets.

Merging the two datasets together in a “meta-analysis” approach is expected to increase the statistical power and allow for the detection of other small changes in gene expression.

19

However, using a t-test and fold change to filter out differentially expressed genes did not add any new probes compared to those identified by individual analysis. To further explore the data, a hierarchical analysis revealed that the datasets clustered into 2 different groups, regardless of the gender. This indicates that non-biological differences arising due to the different origin of the samples (i.e. batch effects) are interfering with the analysis and could be obscuring gender- specific changes. Therefore, before carrying out any further analysis on the merged datasets, we used batch effect removal tools to overcome this variation.

For this purpose, 2 methods were used: fRMA and ComBat. fRMA was used to preprocess the raw data for both sets before merging. The advantage of this approach is that each set is analyzed individually and more datasets can be added to the analysis without the need to repeat pre-processing or batch effect removal steps that were already done. For GSE14323, three new genes were identified using Limma, while none were newly found for GSE23343. This could mean that the inter-sample non-biological variations are marginal, or that these batch effects were not obscuring considerable biological changes.

Finally, for the purpose of the meta-analysis approach we are employing in this study, the

2 datasets were merged and processed using ComBat for batch effect removal and limma for the identification of differentially expressed genes. This was compared to merging the fRMA- processed datasets. Again, similar previous results, merging the fRMA datasets did not result in detecting new or different genes, and using ComBat only added one new gene: KDM5D.

Throughout our analysis, the differences in output we observed were obtained when we used different statistical methods, e.g. t-test versus Limma, or RMA versus fRMA or ComBat.

Merging the datasets using the same pre-processing and downstream analysis algorithms resulted in almost the same outcome, which indicated that merging the 2 datasets together does not

20

provide additional power to the analysis and does not add information that would have been otherwise undetected. However, we only used 2 (relatively small) datasets, with sample sizes below 25 for each group, which may have reduced the power of the study.

Compared to the ~500 genes identified by Hatrwell et al. to be differentially expressed between males and females in human normal liver tissues, our identified genes are very few.

Although using these genes in hierarchical analysis allowed us to cluster male and females in each dataset, they did not exhibit any consistent pattern for up-and down-regulated genes.

However, we examine below the genes that were found in common between this study and ours, and additionally, genes that were found to be consistently deferentially expressed in our analysis.

For the male-dominated genes, we identified 6 that are consistently found to be up-regulated in males: RPS4Y1, EIF1AY, CYorf15B, UTY, DDX3Y and USP9Y, all of which are Y-linked.

The first three were also found to be up-regulated in males by Hartwell et al. The most prominent gene that is identified by all approaches employed in this study is RPS4Y1, which is part of the 40S ribosomal small subunit73. The sex-bias in RPS4Y1 expression was observed in human brain74, heart75 and was found to be pronounced in human prostate cancer tissues and cell lines76. RPS4Y1 has a homologue on the X chromosome, RPS4X, which is known to escape X- inactivation, but was not found to be differentially expressed in our sample pool73. Therefore, up- regulation of RPS4Y1 seems to be tissue non-specific, and its role in any potential tumor transformation is unknown.

DEAD (Asp-Glu-Ala-Asp) box polypeptide 3Y (DDX3Y) is an RNA helicase that is essential for spermatogensis. DDX3Y was found to be widely transcribed but the protein was only found in the testis77. The X-linked homologue, DDX3X is also ubiquitously expressed and both function as nucleo-cytoplasmic shuttles for RNA77,78. DDX3Y was also up-regulated in

21

human heart, while DBY (another member of the DEAD box protein family) was also up- regulated in male brain tissue. Another protein that is important for spermatogenesis is USP9Y, is ubiquitously expressed in embryonic and adult tissues and is found up-regulated also in brain, heart and prostate tissues of males74–76. It is a ubiquitin-specific protease that functions in the de- ubiquitination of target proteins, and thus plays a role in protein turnover and regulation 79,80.

Eukaryotic translation initiation factor 1A-Y (EIF1AY) is a translation initiation complex that interacts with the ribosome, and is required for achieving the maximum rate in protein biosynthesis81. It also may function in stabilizing the binding of the initiator Met-tRNA to 40S ribosomal subunits82. UTY is also found to be up-regulated in male prostate, heart and brain tissues, and it encodes a protein with tetratricopeptide repeats that has been found to have multiple splice variants, the functions of which are unknown83. It has been found to code for a male-specific minor histocompatability antigen involved in stem cell graft rejection84. Finally,

CYrof15B has also been found to be up-regulated in liver and heart tissue, but so far is found to be a pseuodogene of the human taxilin gamma (TXLNG).

For the genes found to be up-regulated in females, 2 are X-linked: X (inactive)-specific transcript

(XIST), PNPLA4 , and the third, PZP, is autosomal, found on . XIST is found on the X inactivation center (XIC) 85and which functions in silencing of X-linked genes through X- chromosome inactivation. XIST was found to be non-protein coding and instead functions as structural RNA in the nucleus86. XIST is found to be up-regulated in human brain (specifically neurons), heart and liver16,74,83. PNPLA4 belongs to a family of lipid hyrolases and is a potent retinylester hydrolase in keratinocytes, affecting their morphology87. It is expressed in a variety of tissues, albeit in different transcript lengths, suggesting differential processing across tissues88.

Finally, the only autosomal gene to be up-regulated in liver of females in our study that has been

22

previously reported is PZP. PZP is a plasma protein that functions as a peptidase inhibitor89. It has been found to associate with TGF-beta-1 and TGF-beta-2 and regulates their plasma clearance90. Since TGF-beta signaling is highly implicated in hepatic carcinogenesis, PZP may play a role in this pathway. This is further corroborated by data from Genome-wide associated study where single polymorphisms in PZP associated with high serum AST91.

It is not unexpected to find X- and Y-linked genes to be differentially expressed among females and males, but to explain why only a few of the >1300 gene on the X- have been found to be up-regulated in females and only a few Y-linked genes have been up- regulated in males, some considerations have to be taken. X-inactivation of the X-chromosome in somatic cells of females results in similar expression patterns for X-linked genes in males and females. Dosage compensation for the active X-chromosome results in hypertrasncription of the

X-linked genes to reach the level of autosomal genes (present in 2 copies). This could explain the lack of differential expression of most X-linked genes observed in our study93. To add another level of complication, some genes are found to escape this X-inactivation, which results in higher expression of these genes in females compared to males 92. However, this inactivation is found not be consistent between individuals, or within cells and tissues of an individual, leading to variations in expression that are yet to be further analyzed94. Additionally, due to the presence of various X-Y homologues (homologues of the same gene serving the same function), cross- hybridization of the Y-specific probes to X-homologues has been reported and could compromises the ability of the Y-chromosome probes to differentiate male and female samples, thereby resulting in similar expression levels74,95. Additionally, a lack of correlation between mRNA levels and protein levels has been observed for some genes, as the case for DDX3Y,

23

necessitating downstream functional analysis on the protein level before solid conclusions can be made77.

Overall, while the differentially expressed genes identified in this study did not produce any functional clusters using the gene functional classification tool in DAVID, they are consistent with other studies comparing male and female human tissues. The expression of these genes is not specific to the liver, as they are found in other tissues as well including brain, heart and prostate. However, several studies have linked sex chromosomal aberrations and Y-linked gene expression to cancer. For example, up-regulation of Y-linked gene such as RPS4Y1, UTY,

EIF1AY and USP9Y has been found in prostate cancer tissue and cell lines, compared to benign prostatic hyperplasia and normal testis76,96. In contrast, Y-chromosomal deletions have been also associated with prostate cancer97–99, male breast carcinomas100 and pancreatic adenocarcinomas101. Loss of Y-Chromosome in peripheral blood cells was found to be associated with higher risk of cancer in men102. Additionally, rearrangements of the X- chromosomes, including deletions and gains, have been associated with breast103, ovarian104 and uterine cervix cancer105. Loss of heterozygousity (LOH) has been found in endocrine carcinomas of the gastroenteropancreatic tract, lung and colorectal cancer106–108. Therefore there is a link between sex-chromosome genes and both gender-specific and non-specific cancer, which remains to be fully elucidated.

Although this study provides a preliminary meta-analysis of the gender-bias in normal liver tissue utilizing two microarray datasets, further studies should analyze more samples, with gender information for the utilized datasets. A larger sample pool would allow the stratification of patients into age groups. As cancer is largely a disease of ageing, analyzing data from more uniform age groups could be more informative. In addition, more advanced techniques, such as

24

RNA-Seq could provide more information about mutations in X- and Y- chromosome genes, as well as a more accurate and sensitive measurement of X-Y homologues that could have interfered with the microarray analysis. Finally, specific studies of already identified genes in the liver and their corresponding proteins would provide more information about their role in liver biology and regulation during carcinogenesis.

25

TABLES

Table 1. Summary of the datasets of microarray studies using human normal liver tissue. Datasets with available gender information are highlighted

Accession Sample size Sample types Gender Platform GSE14323 19 Normal Liver Yes Affymetrix (GPL571) GSE14951 4 Normal liver (prior to transplantation) yes Affymetrix (GPL570) 7 Normal liver GSE23343 yes Affymetrix (GPL570) 10 Normal liver( with diabetes) GSE13471 4 Normal liver none Affymetrix (GPL570)

GSE6222 2 Normal liver none Affymetrix (GPL570) GSE45436 39 Normal liver None Affymetrix (GPL570) GSE38941 10 Normal liver None Affymetrix (GPL570)

26

Table 2. Sample information of three microarray datasets with gender information

Acession Male Female GSE14323 12 7

GSE14951 5 5

GSE23343 10 7 Total 27 19

27

Table 3. Summary of differentially expressed probes found in all three data set and their corresponding gene names. Common probes found between datasets are highlighted.

GSE14951 GSE143233 GSE23343 Gene Name 201909_at 201909_at 201909_at Ribosomal protein S4, Y-linked 1 224590_at 224590_at X (inactive)-specific transcript (non-protein coding) 203649_s_at Phospholipase A2, group IIA (platelets, synovial fluid) 204409_s_at 204409_s_at Eukaryotic translation initiation factor 1A, Y- linked 205000_at 205000_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y- linked 214218_s_at 214218_s_at X (inactive)-specific transcript (non-protein coding) 221728_x_at 221728_x_at X (inactive)-specific transcript (non-protein coding) 206700_s_at Lysine (K)-specific demethylase 5D 207063_at Chromosome Y open reading frame 14 207330_at Pregnancy-zone protein 214131_at Chromosome Y open reading frame 15B 204410_at Eukaryotic translation initiation factor 1A, Y- linked 223645_s_at Chromosome Y open reading frame 15B 223646_s_at Chromosome Y open reading frame 15B 224588_at X (inactive)-specific transcript (non-protein coding) 205001_s_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y- linked 227614_at Hexokinase domain containing 1 227671_at X (inactive)-specific transcript (non-protein coding) 228492_at Ubiquitin specific peptidase 9, Y-linked 235942_at Hypothetical LOC401629

28

Table 4.Summary of differentially expressed probes identified in GSE14323 using the Limma Package compared to those found using the T-test method, common probes are highlighted.

GSE14323 Limma T-test Gene Name Y or X- linked 201909_at 201909_at Ribosomal protein S4, Y-linked 1 Y 204409_s_at 204409_s_at Eukaryotic translation initiation factor 1A, Y-linked Y 203649_s_at Phospholipase A2, group IIA (platelets, synovial fluid) Chr. 1 204410_at Eukaryotic translation initiation factor 1A, Y-linked Y 205000_at 205000_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked Y 205001_s_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked Y 206700_s_at Lysine (K)-specific demethylase 5D Y 214131_at Chromosome Y open reading frame 15B Y 214218_s_at 214218_s_at X (inactive)-specific transcript (non-protein coding) X 221728_x_at 221728_x_at X (inactive)-specific transcript (non-protein coding) X

29

Table 5. Differentially expressed probes identified in GSE 23343 using the Limma Package compared to those found using the T-test method, common probes are highlighted.

GSE23343 Limma T-test Gene Name Y or X- linked 201909_at 201909_at Ribosomal protein S4, Y-linked 1 Y 204409_s_at 204409_s_at Eukaryotic translation initiation factor 1A, Y-linked Y 204410_at Eukaryotic translation initiation factor 1A, Y-linked Y 205000_at 205000_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y- Y linked 205001_s_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y- Y linked 206700_s_at Lysine (K)-specific demethylase 5D Y 207063_at Chromosome Y open reading frame 14 Y 207330_at Pregnancy-zone protein Chr. 12 214131_at Chromosome Y open reading frame 15B Y 214218_s_at X (inactive)-specific transcript (non-protein coding) X 221728_x_at 221728_x_at X (inactive)-specific transcript (non-protein coding) X 223645_s_at Chromosome Y open reading frame 15B Y 223646_s_at Chromosome Y open reading frame 15B Y 224588_at 224588_at X (inactive)-specific transcript (non-protein coding) X 224590_at 224590_at X (inactive)-specific transcript (non-protein coding) X 227614_at Hexokinase domain containing 1 Chr. 10 227671_at 227671_at X (inactive)-specific transcript (non-protein coding) X 228492_at Ubiquitin specific peptidase 9, Y-linked Y 235942_at Hypothetical LOC401629 Y 214216_s_at La ribonucleoprotein domain family, member 4B Chr. 10

30

Table 6. Differentially expressed probes and their corresponding gene names, identified in merged datasets using the t-test method.

Merged Datasets (GSE14323 & GSE 23343) Probe ID Gene Name 201909_at Ribosomal protein S4, Y-linked 1 204409_s_at Eukaryotic translation initiation factor 1A, Y-linked 204410_at Eukaryotic translation initiation factor 1A, Y-linked 205000_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked 205001_s_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked 206700_s_at Lysine (K)-specific demethylase 5D 207330_at Pregnancy-zone protein 214131_at Chromosome Y open reading frame 15B 214218_s_at X (inactive)-specific transcript (non-protein coding) 221728_x_at X (inactive)-specific transcript (non-protein coding)

31

Table 7. Differentially expressed probes and their corresponding genes in dataset GSE14343 after batch effect removal using fRMA. Newly identified genes are highlighted.

GSE14323 fRMA Probe ID Gene Name 214131_at Chromosome Y open reading frame 15B 204410_at Eukaryotic translation initiation factor 1A, Y-linked 205001_s_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked 206700_s_at Lysine (K)-specific demethylase 5D 204409_s_at Eukaryotic translation initiation factor 1A, Y-linked 205000_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked 201909_at Ribosomal protein S4, Y-linked 1 209739_s_at patatin-like phospholipase domain containing 4 221728_x_at X (inactive)-specific transcript (non-protein coding) 214218_s_at X (inactive)-specific transcript (non-protein coding) 206624_at ubiquitin specific peptidase 9, Y-linked 211149_at ubiquitously transcribed tetratricopeptide repeat gene, Y-linked

32

Table 8. Differentially expressed probes and their corresponding genes in dataset GSE23343 after batch effect removal using fRMA.

GSE23343 Probe ID Gene Name 221728_x_at X (inactive)-specific transcript (non-protein coding) 227671_at X (inactive)-specific transcript (non-protein coding) 214218_s_at X (inactive)-specific transcript (non-protein coding) 224588_at X (inactive)-specific transcript (non-protein coding) 224590_at X (inactive)-specific transcript (non-protein coding) 205001_s_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked 204409_s_at Eukaryotic translation initiation factor 1A, Y-linked 205000_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked 201909_at Ribosomal protein S4, Y-linked 1

33

Table 9. Differentially expressed probes and their corresponding genes in merged datasets after batch effect removal using fRMA or using ComBat, common probes are highlighted.

Probe ID Gene Name Merged fRMA Merged ComBat Ribosomal protein S4, Y-linked 1 201909_at 201909_at Eukaryotic translation initiation factor 1A, Y-linked 204409_s_at 204409_s_at Eukaryotic translation initiation factor 1A, Y-linked 204410_at 204410_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked 205000_at 205000_at DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked 205001_s_at 205001_s_at Lysine (K)-specific demethylase 5D 206700_s_at 206700_s_at ubiquitously transcribed tetratricopeptide repeat gene, Y-linked 211149_at 211149_at Chromosome Y open reading frame 15B 214131_at 214131_at La ribonucleoprotein domain family, member 4B 214216_s_at X (inactive)-specific transcript (non-protein coding) 221728_x_at 221728_x_at X (inactive)-specific transcript (non-protein coding) 214218_s_at X (inactive)-specific transcript (non-protein coding) 203992_s_at lysine (K)-specific demethylase 6A 206624_at ubiquitin specific peptidase 9, Y-linked 209739_s_at

34

FIGURES

A.

B.

C.

Figure 1: Hierarchical Cluster Analysis of 3 microarray datasets of human normal liver tissues: A. GSE14951 B. GSE14323 C. GSE23343

35

A.

Female Male

B.

Female Male

C.

Figure 2: Differentially Expressed probes in individual datasets (t-test). Heat Maps of differentially expressed probes (t-test, P<0.05, fold change >2) between male and female samples of human normal liver: Datasets: A. GSE14951 B. GSE14323 C. GSE23343

36

A. Male Female

B.

Female Male

Figure 3: Differentially Expressed probes in individual datasets (Limma). Heat maps of differentially expressed probes between male and female samples of human normal liver identified using the Limma Package (P<0.05). A. Dataset GSE14323 B. Dataset GSE23343.

37

Female Male

Figure 4: Differentially Expressed probes in merged datasets. Heat map of differentially expressed probes between male and female samples of human normal liver tissues, identified in merged datasets (GSE14323, GSE23343) using the t-test method ((P<0.05, Fold Change>2).

38

Figure 5: Hierarchical Cluster Analysis of merged microarray datasets (GSE14323, GSE23343) of human normal liver tissues.

39

Male Female

B. Male Female

Figure 6. Differentially Expressed probes in individual datasets after batch effect removal. Heat maps of differentially expressed probes (Limma, P<0.05 )between male and female human normal liver samples in microarray datasets A. GSE14323 and B. GSE23343 after removal of batch effects using the fRMA method.

40

Female Male A.

Female Male

B.

Figure 7: Differentially Expressed probes in merged datasets after batch effect removal. Heat maps of differentially expressed probes (Limma, P<0.05 ) between male and female human normal liver samples in merged microarray data sets (GSE14323 and GSE23343) after removal of batch effects using A. fRMA and B. ComBat methods.

41

A. B.

Figure 8: Gene Signature Validation. Hierarchical cluster analysis for microarray datasets A. GSE14323 and B. GSE23343, using a subset of 500 genes identified previously by Hartewell et al16.

42

REFERENCES

1. Ferlay, J. et al. GLOBOCAN 2012 v1.0, Cancer Incidence and Mortality Worldwide: IARC CancerBase No. 11 [Internet]. Lyon Fr. Int. Agency Res. Cancer 2013 at 2. El-Serag, H. B. Hepatocellular Carcinoma. N. Engl. J. Med. 365, 1118–1127 (2011). 3. Shaker, M. K., Abdella, H. M., Khalifa, M. O. & Dorry, A. K. E. Epidemiological characteristics of hepatocellular carcinoma in Egypt: a retrospective analysis of 1313 Cases. Liver Int. n/a–n/a (2013). doi:10.1111/liv.12209 4. el-Zayadi, A.-R. et al. Hepatocellular carcinoma in Egypt: a single center study over a decade. World J. Gastroenterol. WJG 11, 5193–5198 (2005). 5. Bosch, F. X., Ribes, J., Díaz, M. & Cléries, R. Primary liver cancer: worldwide incidence and trends. Gastroenterology 127, S5–S16 (2004). 6. Mucci, L. A. et al. Age at menarche and age at menopause in relation to hepatocellular carcinoma in women. BJOG Int. J. Obstet. Gynaecol. 108, 291–294 (2001). 7. Yu, M.-W. et al. Role of reproductive factors in hepatocellular carcinoma: Impact on hepatitis B- and C-related risk. Hepatol. Baltim. Md 38, 1393–1400 (2003). 8. Ghebranious, N. & Sell, S. Hepatitis B injury, male gender, aflatoxin, and p53 expression each contribute to hepatocarcinogenesis in transgenic mice. Hepatol. Baltim. Md 27, 383–391 (1998). 9. Nakatani, T., Roy, G., Fujimoto, N., Asahara, T. & Ito, A. Sex hormone dependency of diethylnitrosamine-induced liver tumors in mice and chemoprevention by leuprorelin. Jpn. J. Cancer Res. Gann 92, 249–256 (2001). 10. Maeda, S., Kamata, H., Luo, J.-L., Leffert, H. & Karin, M. IKKbeta couples hepatocyte death to cytokine-driven compensatory proliferation that promotes chemical hepatocarcinogenesis. Cell 121, 977–990 (2005). 11. Verna, L., Whysner, J. & Williams, G. M. N-nitrosodiethylamine mechanistic data and risk assessment: bioactivation, DNA-adduct formation, mutagenicity, and tumor initiation. Pharmacol. Ther. 71, 57–81 (1996). 12. Kemp, C. J., Leary, C. N. & Drinkwater, N. R. Promotion of murine hepatocarcinogenesis by testosterone is androgen receptor-dependent but not cell autonomous. Proc. Natl. Acad. Sci. U. S. A. 86, 7505–7509 (1989). 13. Berasain, C. et al. Inflammation and Liver Cancer: New Molecular Links. Ann. N. Y. Acad. Sci. 1155, 206–221 (2009). 14. Naugler, W. E. et al. Gender Disparity in Liver Cancer Due to Sex Differences in MyD88-Dependent IL-6 Production. Science 317, 121–124 (2007). 15. Bigsby, R. M. & Caperell-Grant, A. The role for estrogen receptor-alpha and prolactin receptor in sex-dependent DEN-induced liver tumorigenesis. Carcinogenesis 32, 1162–1166 (2011). 16. Hartwell, H. J., Petrosky, K. Y., Fox, J. G., Horseman, N. D. & Rogers, A. B. Prolactin prevents hepatocellular carcinoma by restricting innate immune activation of c-Myc in mice. Proc. Natl. Acad. Sci. 111, 11455–11460 (2014). 17. Keng, V. W. et al. Sex bias occurrence of hepatocellular carcinoma in Poly7 molecular subclass is associated with EGFR. Hepatology 57, 120–130 (2013). 18. Lipshutz, R. J., Fodor, S. P. A., Gingeras, T. R. & Lockhart, D. J. High density synthetic oligonucleotide arrays. Nat. Genet. 21, 20–24 (1999).

43

19. Lockhart, D. J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14, 1675–1680 (1996). 20. Alon, U. et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96, 6745– 6750 (1999). 21. Dhanasekaran, S. M. et al. Delineation of prognostic biomarkers in prostate cancer. Nature 412, 822–826 (2001). 22. Van ’t Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002). 23. Chen, X. et al. Gene expression patterns in human liver cancers. Mol. Biol. Cell 13, 1929–1939 (2002). 24. Okabe, H. et al. Genome-wide analysis of gene expression in human hepatocellular carcinomas using cDNA microarray: identification of genes involved in viral carcinogenesis and tumor progression. Cancer Res. 61, 2129–2137 (2001). 25. Garber, M. E. et al. Diversity of gene expression in adenocarcinoma of the lung. Proc. Natl. Acad. Sci. 98, 13784–13789 (2001). 26. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 41, D991–D995 (2013). 27. Edgar, R. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002). 28. Kolesnikov, N. et al. ArrayExpress update--simplifying data submissions. Nucleic Acids Res. 43, D1113–1116 (2015). 29. Kuo, W. P., Jenssen, T.-K., Butte, A. J., Ohno-Machado, L. & Kohane, I. S. Analysis of matched mRNA measurements from two different microarray technologies. Bioinforma. Oxf. Engl. 18, 405–412 (2002). 30. Irizarry, R. A. et al. Multiple-laboratory comparison of microarray platforms. Nat. Methods 2, 345–350 (2005). 31. Ramasamy, A., Mondry, A., Holmes, C. C. & Altman, D. G. Key Issues in Conducting a Meta-Analysis of Gene Expression Microarray Datasets. PLoS Med. 5, e184 (2008). 32. Choi, J. K., Yu, U., Kim, S. & Yoo, O. J. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics 19, i84–i90 (2003). 33. Hong, F. et al. RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 22, 2825–2827 (2006). 34. Mehra, R. Identification of GATA3 as a Breast Cancer Prognostic Marker by Global Gene Expression Meta-analysis. Cancer Res. 65, 11259–11264 (2005). 35. Rhodes, D. R. et al. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc. Natl. Acad. Sci. 101, 9309–9314 (2004). 36. Grützmann, R. et al. Meta-analysis of microarray data on pancreatic cancer defines a set of commonly dysregulated genes. Oncogene 24, 5079–5088 (2005). 37. Lee, H. K. Coexpression Analysis of Human Genes Across Many Microarray Data Sets. Genome Res. 14, 1085–1094 (2004). 38. Ewald, J. A., Downs, T. M., Cetnar, J. P. & Ricke, W. A. Expression Microarray Meta- Analysis Identifies Genes Associated with Ras/MAPK and Related Pathways in Progression of Muscle-Invasive Bladder Transition Cell Carcinoma. PLoS ONE 8, e55414 (2013).

44

39. Hong, F. & Breitling, R. A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics 24, 374–382 (2008). 40. Tseng, G. C., Ghosh, D. & Feingold, E. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res. 40, 3785–3799 (2012). 41. Irizarry, R. A. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31, 15e–15 (2003). 42. Bolstad, B. M., Irizarry, R. ., Astrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003). 43. Li, C. & Hung Wong, W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2, RESEARCH0032 (2001). 44. Naef, F., Hacker, C. R., Patil, N. & Magnasco, M. Empirical characterization of the expression ratio noise structure in high-density oligonucleotide arrays. Genome Biol. 3, RESEARCH0018 (2002). 45. Irizarry, R. A. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003). 46. Cui, X. & Churchill, G. A. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 4, 210 (2003). 47. Jeanmougin, M. et al. Should We Abandon the t-Test in the Analysis of Gene Expression Microarray Data: A Comparison of Variance Modeling Strategies. PLoS ONE 5, e12336 (2010). 48. Murie, C., Woody, O., Lee, A. Y. & Nadon, R. Comparison of small n statistical tests of differential expression applied to microarrays. BMC Bioinformatics 10, 45 (2009). 49. Jeffery, I. B., Higgins, D. G. & Culhane, A. C. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics 7, 359 (2006). 50. Kooperberg, C., Aragaki, A., Strand, A. D. & Olson, J. M. Significance testing for small microarray experiments. Stat. Med. 24, 2281–2298 (2005). 51. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47–e47 (2015). 52. Smyth, G. K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, Article3 (2004). 53. Chen, C. et al. Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods. PLoS ONE 6, e17238 (2011). 54. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007). 55. Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high- throughput data. Nat. Rev. Genet. 11, 733–739 (2010). 56. Lander, E. S. Array of hope. Nat. Genet. 21, 3–4 (1999). 57. Benito, M. et al. Adjustment of systematic microarray data biases. Bioinforma. Oxf. Engl. 20, 105–114 (2004). 58. Sims, A. H. et al. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets – improving meta-analysis and prediction of prognosis. BMC Med. Genomics 1, 42 (2008).

45

59. Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 1724–1735 (2007). 60. Luo, J. et al. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 10, 278–291 (2010). 61. Gautier, L., Cope, L., Bolstad, B. M. & Irizarry, R. A. affy--analysis of Affymetrix GeneChip data at the probe level. Bioinforma. Oxf. Engl. 20, 307–315 (2004). 62. Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004). 63. Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015). 64. Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 65–70 (1979). 65. Taminau, J. et al. Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages. BMC Bioinformatics 13, 335 (2012). 66. McCall, M. N., Bolstad, B. M. & Irizarry, R. A. Frozen robust multiarray analysis (fRMA). Biostatistics 11, 242–253 (2010). 67. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009). 68. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2008). 69. Misu, H. et al. A liver-derived secretory protein, selenoprotein P, causes insulin resistance. Cell Metab. 12, 483–495 (2010). 70. Conti, A. et al. Wide gene expression profiling of ischemia-reperfusion injury in human liver transplantation. Liver Transplant. Off. Publ. Am. Assoc. Study Liver Dis. Int. Liver Transplant. Soc. 13, 99–113 (2007). 71. Mas, V. R. et al. Genes involved in viral carcinogenesis and tumor initiation in hepatitis C virus-induced hepatocellular carcinoma. Mol. Med. Camb. Mass 15, 85–94 (2009). 72. Everitt, B. S., Landau, S. & Leese, M. Cluster Analysis. (London: Arnold.). 73. Fisher, E. M. et al. Homologous ribosomal protein genes on the human X and Y chromosomes: escape from X inactivation and possible implications for Turner syndrome. Cell 63, 1205–1218 (1990). 74. Vawter, M. P. et al. Gender-Specific Gene Expression in Post-Mortem Human Brain: Localization to Sex Chromosomes. Neuropsychopharmacology 29, 373–384 (2004). 75. Isensee, J. et al. Sexually dimorphic gene expression in the heart of mice and men. J. Mol. Med. 86, 61–74 (2008). 76. Dasari, V. K. et al. Expression analysis of Y chromosome genes in human prostate cancer. J. Urol. 165, 1335–1341 (2001). 77. Ditton, H. J., Zimmer, J., Kamp, C., Rajpert-De Meyts, E. & Vogt, P. H. The AZFa gene DBY (DDX3Y) is widely transcribed but the protein is limited to the male germ cells by translation control. Hum. Mol. Genet. 13, 2333–2341 (2004). 78. Yedavalli, V. S. R. K., Neuveut, C., Chi, Y.-H., Kleiman, L. & Jeang, K.-T. Requirement of DDX3 DEAD box RNA helicase for HIV-1 Rev-RRE export function. Cell 119, 381–392 (2004).

46

79. Lee, K. H. et al. Ubiquitin-specific protease activity of USP9Y, a male infertility gene on the Y chromosome. Reprod. Fertil. Dev. 15, 129–133 (2003). 80. Brown, G. M. et al. Characterisation of the coding sequence and fine mapping of the human DFFRY gene and comparative expression analysis and mapping to the Sxrb interval of the mouse Y chromosome of the Dffry gene. Hum. Mol. Genet. 7, 97–107 (1998). 81. Marintchev, A., Kolupaeva, V. G., Pestova, T. V. & Wagner, G. Mapping the binding interface between human eukaryotic initiation factors 1A and 5B: a new interaction between old partners. Proc. Natl. Acad. Sci. U. S. A. 100, 1535–1540 (2003). 82. Luna, R. E. et al. The Interaction between Eukaryotic Initiation Factor 1A and eIF5 Retains eIF1 within Scanning Preinitiation Complexes. Biochemistry (Mosc.) 52, 9510–9518 (2013). 83. Laaser, I., Theis, F. J., de Angelis, M. H., Kolb, H.-J. & Adamski, J. Huge splicing frequency in human Y chromosomal UTY gene. Omics J. Integr. Biol. 15, 141–154 (2011). 84. Vogt, M. H. J. et al. UTY gene codes for an HLA-B60–restricted human male-specific minor histocompatibility antigen involved in stem cell graft rejection: characterization of the critical polymorphic amino acid residues for T-cell recognition. Blood 96, 3126–3132 (2000). 85. Brown, C. J. et al. The human XIST gene: analysis of a 17 kb inactive X-specific RNA that contains conserved repeats and is highly localized within the nucleus. Cell 71, 527–542 (1992). 86. Brockdorff, N. et al. Conservation of position and exclusive expression of mouse Xist from the inactive X chromosome. Nature 351, 329–331 (1991). 87. Kienesberger, P. C., Oberer, M., Lass, A. & Zechner, R. Mammalian patatin domain containing proteins: a family with diverse lipolytic activities involved in multiple biological functions. J. Lipid Res. 50, S63–S68 (2008). 88. Lee, W. C., Salido, E. & Yen, P. H. Isolation of a new gene GS2 (DXS1283E) from a CpG island between STS and KAL1 on Xp22.3. Genomics 22, 372–376 (1994). 89. Valnickova, Z. et al. Activated human plasma carboxypeptidase B is retained in the blood by binding to alpha2-macroglobulin and pregnancy zone protein. J. Biol. Chem. 271, 12937– 12943 (1996). 90. Philip, A., Bostedt, L., Stigbrand, T. & O’CONNOR-McCOURT, M. D. Binding of transforming growth factor-beta (TGF-beta) to pregnancy zone protein (PZP). Comparison to the TGF-beta-alpha2-macroglobulin interaction. Eur. J. Biochem. 221, 687–693 (1994). 91. Chalasani, N. et al. Genome-Wide Association Study Identifies Variants Associated With Histologic Features of Nonalcoholic Fatty Liver Disease. Gastroenterology 139, 1567– 1576.e6 (2010). 92. Nguyen, D. K. & Disteche, C. M. Dosage compensation of the active X chromosome in mammals. Nat. Genet. 38, 47–53 (2006). 93. Sudbrak, R. et al. X chromosome-specific cDNA arrays: identification of genes that escape from X-inactivation and other applications. Hum. Mol. Genet. 10, 77–83 (2001). 94. Heard, E. Dosage compensation in mammals: fine-tuning the expression of the X chromosome. Genes Dev. 20, 1848–1867 (2006). 95. Skaletsky, H. et al. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423, 825–837 (2003). 96. Lau, Y. F. & Zhang, J. Expression analysis of thirty one Y chromosome genes in human prostate cancer. Mol. Carcinog. 27, 308–321 (2000).

47

97. Jordan, J. J., Hanlon, A. L., Al-Saleem, T. I., Greenberg, R. E. & Tricoli, J. V. Loss of the short arm of the Y chromosome in human prostate carcinoma. Cancer Genet. Cytogenet. 124, 122–126 (2001). 98. König, J. J. et al. Loss and gain of chromosomes 1, 18, and Y in prostate cancer. The Prostate 25, 281–291 (1994). 99. Lundgren, R. et al. Cytogenetic analysis of 57 primary prostatic adenocarcinomas. Genes. Chromosomes Cancer 4, 16–24 (1992). 100. Teixeira, M. R. et al. Chromosome banding analysis of gynecomastias and breast carcinomas in men. Genes. Chromosomes Cancer 23, 16–20 (1998). 101. Wallrapp, C. et al. Loss of the Y chromosome is a frequent chromosomal imbalance in pancreatic cancer and allows differentiation to chronic pancreatitis. Int. J. Cancer 91, 340–344 (2001). 102. Forsberg, L. A. et al. Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer. Nat. Genet. 46, 624–628 (2014). 103. Piao, Z. & Malkhosyan, S. R. Frequent loss Xq25 on the inactive X chromosome in primary breast carcinomas is associated with tumor grade and axillary lymph node metastasis. Genes. Chromosomes Cancer 33, 262–269 (2002). 104. Choi, C. et al. Loss of heterozygosity at chromosome segment Xq25-26.1 in advanced human ovarian carcinomas. Genes. Chromosomes Cancer 20, 234–242 (1997). 105. Kersemaekers, A. M., van de Vijver, M. J., Kenter, G. G. & Fleuren, G. J. Genetic alterations during the progression of squamous cell carcinomas of the uterine cervix. Genes. Chromosomes Cancer 26, 346–354 (1999). 106. Azzoni, C. et al. Xq25 and Xq26 identify the common minimal deletion region in malignant gastroenteropancreatic endocrine carcinomas. Virchows Arch. 448, 119–126 (2006). 107. Bottarelli, L. et al. Sex Chromosome Alterations Associate with Tumor Progression in Sporadic Colorectal Carcinomas. Clin. Cancer Res. 13, 4365–4370 (2007). 108. D’Adda, T. et al. Malignancy-associated X chromosome allelic losses in foregut endocrine neoplasms: further evidence from lung tumors. Mod. Pathol. 18, 795–805 (2005).

48