<<

Health Science Campus

FINAL APPROVAL OF DISSERTATION Doctor of Philosophy in Biomedical Science ( Biology)

Aneuploidy: Using genetic instability to preserve a haploid genome?

Submitted by: Ramona Ramdath

In partial fulfillment of requirements for the degree of Doctor of Philosophy in Biomedical Science

Examination Committee Signature/Date

Major Advisor: David Allison, .., Ph.D.

Academic James Trempe, Ph.D. Advisory Committee: David Giovanucci, Ph.D.

Randall Ruch, Ph.D.

Ronald Mellgren, Ph.D.

Senior Associate Dean College of Graduate Studies Michael S. Bisesi, Ph.D.

Date of Defense: April 10, 2009

Aneuploidy: Using genetic instability to preserve a haploid genome?

Ramona Ramdath

University of Toledo, Health Science Campus

2009

Dedication

I dedicate this dissertation to my grandfather who died of lung cancer two years ago, but who always instilled in us the value and importance of education.

And to my mom and sister, both of whom have been pillars of support and stimulating conversations. To my sister, Rehanna, especially- I hope this inspires you to achieve all that you want to in life, academically and otherwise.

ii Acknowledgements

As we go through these academic journeys, there are so many along the way that make an impact not only on our work, but on our lives as well, and I would like to say a heartfelt thank you to all of those people:

My Committee members- Dr. James Trempe, Dr. David Giovanucchi, Dr. Ronald

Mellgren and Dr. Randall Ruch for their guidance, suggestions, support and confidence in me.

My major advisor- Dr. David Allison, for his constructive criticism and positive reinforcement.

Kay Langenderfer, our secretary, who has always been there to help us in whatever we needed, regardless of the time of day or night, and who always advocated for me.

Dr. Andrea Kalinoski, who taught me many of the techniques that I learnt, and who was a helped me understand the projects in the laboratory.

Peter Bazeley, without whom this project would be impossible. His knowledge and expertise was invaluable to me, and all the long hours and countless times that he has explained concepts to me was far beyond his call of duty.

Dr. Saad Sikhanderkhel who worked tirelessly on getting the database up and running.

Dr. David Weaver, for his help not only with managing our SNP data so efficiently, but also, for doing our expression arrays.

iii Case Western Reserve Expression and Genotyping core facility-Debora Poruban,

Tom Merk and Dr. Martina Veigl, for doing all of our SNP arrays, and for working with us to help us achieve our goals.

Dr. Seiichi Matsui and Jeffrey LaDuca for doing the SKY imaging for us.

Dr. Amira Gohara, who spent many early morning hours patiently going through slides with me to identify cancerous and normal regions. Even with her exceptionally busy schedule, she always accommodated my requests for help, and did it with such pleasure. It was a pleasure working with you Dr. Gohara.

University of Michigan, Microscopy and Imaging Centre for the use of their “devil child”

Laser Capture Microdissection Microscope.

All the friends I have made in the last few years, especially during my time in the Cicila laboratory.

Kyle Robinson, my best friend, who always kept me on my toes, asking pertinent questions, and in other cases providing solid answers. He has supported me; been a shoulder to lean on when I was frustrated; been a listening ear when I needed one, or when I wanted to discuss ideas, and been patient with me, no matter what.

My family- particularly my sister, and in many ways, most importantly, my mom, who has through the years picked me up when I fell, stayed up late with me working on projects, and been with me throughout all of my endeavors. Never have I known a stronger woman, thank you for being a superb role model.

iv Table of Contents

Introduction………………………………………………………..1

Literature Review………..………………………………………..5

Materials and Methods…………………………………………...38

Results……………………………………………………………..46

Discussion…………………………………………………………72

Conclusions………………………………………………………..85

Summary…………………………………………………………...86

References…………………………………………………………87

Appendix 1………………………………………………………….97

Appendix 2………………………………………………………….98

Appendix 3………………………………………………………….99

Appendix 4………………………………………………………….100

Appendix 5………………………………………………………….101

Appendix 6………………………………………………………….102

Appendix 7………………………………………………………….106

Appendix 8………………………………………………………….122

Appendix 9………………………………………………………….126

Appendix 10………………………………………………………...127

Abstract……………………………………………………………...128

Introduction

Although lifestyle and environment play a role in carcinogenesis, it is well-known that cancer is a genetic disease, in which the guardians of the genome have been mutated or lost, resulting in uncontrolled division and genetic instability, which favor tumor progression. Genetic instability is defined as subtle changes in DNA sequence such as point mutations, that can subsequently result in chromosomal losses, gains, translocations or copy number changes. The changes from genetic instability can, in turn, lead to the loss of genomic information from one homologous

(parent), known as the loss of heterozygosity (Olshen et al.) (Argos et al., 2008; Dutt &

Beroukhim, 2007; Zheng, Peng, Li, & He, 2005). Thus, LOH can potentially be used as a gauge of genomic instability. LOH can be measured using single polymorphism (SNP) assays, which are common genetic variations of one , widely distributed throughout the genome. SNPs are widely used as markers for genetic diseases, and can detect LOHs in relatively small regions of the genome.

As a result of genetic instability, many tumors are aneuploid and no longer have normal diploid genomes (Zheng et al., 2005). By definition, aneuploidy is a change in chromosome number that is not the exact multiple of the normal haploid karyotype

(Torres, Williams, & Amon, 2008). Due to the genetic instability characteristic of aneuploid , these cancers frequently contain structurally abnormal such as translocations and deletions.

Another subtle type of genetic instability, copy number alterations (CNAs), can lead to genome wide deregulation of , even in diploid cancers and contribute to the development or progression of several cancers (Pollack et al., 2002).

Spectral karyotyping (SKY) is an accurate technique of staining each chromosome to visualize chromosomal aberrations, including CNAs (Schrock et al., 2006).

1 Another commonly employed method to look at genetic instability of chromosomes is to investigate the genome wide alterations of gene expression. Gene expression arrays can be used to explore the expression of that are essential for survival and also determine which genes are involved in carcinogenesis and tumor progression (Christensen, McCoy, & Ford, 2008).

Our first hypothesis is that aneuploidy occurs in cells as a system of retaining at least a haploid genome in cancer cells. We will use genome- wide,

Single Nucleotide Polymorphism (SNPs) patterns to test this hypothesis. The novel method developed in this project, which we call Probability Stripes, will aid the detection of LOHs in the SNP patterns to look at aneuploidy, chromosomal instability and LOH.

This method also eliminates noise in the SNP data.

The probability stripes model will be applied to: 1) normal samples obtained from healthy, twin sisters, to obtain a baseline level of LOH (if any), and to illustrate how the model works; and 2) prostate cancer cell line DU-145 to visualize the LOH patterns in cancer cell lines. This data together with copy number analysis done using Spectral

Karyotyping will allow us to see if aneuploid chromosomal abnormalities and SNP- detected LOHs interact. A potential clinical application of this investigation into the molecular mechanisms of aneuploidy could potentially be an important prognostic indicator for cancer patients.

Our second hypothesis is that there will be genes expressed throughout the genome, both in tissues and in long-term cell lines, which may be responsible for the retention of structurally abnormal chromosomes found in aneuploid cancers.

To test this hypothesis, we first take a genome wide look at the expression of all human genes for which gene expression arrays are available, in a large variety of normal and cancerous cell lines and tissues, to define a set of universally expressed genes. We

2 will then determine whether or not the universally expressed cell survival genes

(hypothesis 2) are preserved in the haploid genome that is retained by the cancer cell lines after LOH formation (hypothesis 1).

The motivation behind our first long term goal of improving the SNP-LOH assay is to obtain a better screening method for deciding optimal, individualized treatment regimens for breast cancer patients. We currently do not have a method to effectively and individually screen breast cancer patients to determine the best patient treatment regimen from the staging criteria of individual breast cancers. This results in many women receiving treatments that may not be needed or beneficial. Our work may aid in determining whether the observed genetic events can be used as a screen which may be part of a diagnostic panel. It is possible that the use of a whole genome SNP technique to obtain a LOH pattern may distinguish different tumor grades and thereby help predict better treatment regimens for individual patients.

The motivation behind our second hypothesis is to explain the reason for selectively retained chromosomal abnormalities in aneuploid cancers. The model proposed in this work is fundamentally different from the standard model which states that such chromosomal abnormalities are selectively retained in aneuploid cancers in order to preserve tumor-promoting (Felsher, 2008; Lundberg et al., 2008).

Further, our present study will also involve a study of tissue-specific gene expression, both in vivo and in vitro. In the future, this new knowledge may serve as a foundation on which to build further targeted (tissue-specific) gene-directed anti-cancer therapy.

The literature section will focus on the knowledge that has already been gained in the various fields that our hypotheses build on. For hypothesis 1, this includes looking at previous work on LOH and copy number variants (CNVs), and the underlying genetic instability of chromosomal aneuploidy, using SNP-array technology. Our analysis method and visualization of this data, using the so-called “Probability Stripes”, was created in our

3 laboratory, based on a technique used by Theodor Boveri at the beginning of the last century. Thus, we will look at the general history of this method. We apply this method of analysis to SNP data from normal DNA samples from healthy twin sisters, and a prostate cancer cell line. The cell line probability stripes will be compared with spectral karyotyping (SKY) results of the same prostate cancer cell line chromosomes. All of these results will be discussed in terms of what others have observed.

For hypothesis 2, the literature search will review the insights provided to cancer research by gene expression results. Also, general information on housekeeping genes and their necessity will be discussed, to emphasize the importance of and validate the universally expressed genes found in our study. We will link the gene expression results to the SKY and SNP results of the aneuploid cancer cell lines to specifically address hypothesis 2.

4 Literature

Loss of heterozygosity (LOH), Copy Number Variations (CNV) and Single Nucleotide Polymorphism (SNP)

Loss of heterozygosity (LOH) or reduction to homozygosity refers to the change from a heterozygous state to a homozygous state, and is a result of loss of genetic information from either homologous chromosome (parent), leading to the loss of chromosomal information (Zheng et al., 2005).

Loss of heterozygosity is the most common molecular genetic alteration observed in human cancers (Zheng et al., 2005). It is therefore one of the most efficient markers to

1) study chromosomal deletion and instability, and 2) more precisely locate a deleted region (Komura et al., 2006; Tai et al., 2006; Walker , 2006; Zhao et al., 2004; Zheng et al., 2005). LOH is regarded as a mechanism, employed by a tumor, for disabling tumor suppressor genes (TSGs) during the course of oncogenesis, thereby supporting its evolution and progression (Beroukhim et al., 2006). Our present work adds to this knowledge, by looking at chromosomal regions that are not only lost, but retained by the tumor. Uncovering LOH exploits the presence of DNA polymorphisms throughout the genome which are measured by polymorphic markers known as single nucleotide polymorphisms (SNPs) (Gaasenbeek et al., 2006; Tucker & Friedman, 2002).

Humans share 99.9% of the genome, but LOH detection techniques take advantage of the variable portion of the genome (de Stahl et al., 2008). Single nucleotide polymorphisms are the most frequent form of DNA variation in the , occurring in the order of 10 million sites at intervals of approximately every 1200 base pairs, making it theoretically possible to identify a genetic marker in every gene, although they can be located in, near or between genes (Clifford et al., 2000; Dutt & Beroukhim,

2007; Tai et al., 2006; Zhao et al., 2004; Zheng et al., 2005). Their abundance, relatively even spacing, and stability across the genome offer significant diagnostic potential for

5 human diseases including cancers, compared to other techniques such as fragment length polymorphisms or microsatellite markers (Davis & Hammarlund, 2006; Zhao et al.,

2004). High density SNP arrays (those that probe more than 262,000 SNPs across the genome) also allow for detection of deletions as small as 2.5kb (Davis & Hammarlund,

2006). This technique has become important for identifying genetic events involved in the development and progression of human cancers (Komura et al., 2006; Torring et al.,

2007).

In addition to their high-throughput and high-resolution features, SNP arrays have the ability to detect chromosomal copy number variation (Torring et al., 2007). It has been found that many aberrant SNP genotypes have significant deletions and duplications of most chromosomes (or parts thereof) (Redon et al., 2006). These deletions, duplications, insertions, and complex multi-site variants, collectively termed copy number variations (CNVs) or copy number polymorphisms (CNPs), are found in all . Complex multi-site variants involve several gains or losses of homologous sequences at multiple sites in the genome (Redon et al., 2006). Investigating CNVs, in collaboration with LOH, allows further discovery into the mechanisms involved in those structural abnormalities.

Another advantage of SNP array technology is that normal DNA (as a control) is not necessary. Although normal paired DNA will allow higher resolution, samples without paired normal controls will still be informative because statistical analyses are applied to identify strings of consecutive homozygous SNPs that are longer than would be expected to appear by chance alone (Dutt & Beroukhim, 2007). Furthermore, comparison to normal samples may not be required because the new generation of SNP arrays provide dense marker coverage, which allows comparison to neighboring SNPs, and subsequent detection of LOH regions (Beroukhim et al., 2006).

6 As the resolution of SNP arrays continue to increase, the more powerful it becomes as a tool to simultaneously study several types of genetic events underlying cancer. These genetic events include LOH, copy number changes and homologous mitotic recombination (Mechanisms of LOH for further discussion) (Dutt & Beroukhim,

2007; Oosting et al., 2007; Zhao et al., 2004). With the ability to simultaneously map regions of copy number changes and LOH in individual tumors, these markers 1) can show how alleles of candidate disease loci correlate with particular disease (Clifford et

Figure 1. Model of LOH al., 2000); and 2) can help classify A individual cancers based upon a comprehensive characterization of

Chromosome loss leads to the genetic changes they have tumor progression. undergone (Dutt & Beroukhim,

B 2007). With these advantages, it is

possible that in the future, SNP array

platforms will supplement or replace Loss of the lower homolog leads to cell death. current diagnostic standard (such as the prostate specific test)

(Hahn et al., 2007).

In previous work from our Loss of fragment leads to tumor progression. laboratory, we proposed a model for

LOH formation (Figure 1). The model D consists of simple scenarios of how

LOH can lead to tumor progression

or death of the cell. In panel A, the

Loss of fragment leads to yellow represents an inactivating tumor progression.

Reprinted with permission from Karger Science, Cytogenetic and Genome Research (2007). 116, 235-247. 7 mutation in a . The yellow diamond is a normal copy of that allele on the homologous chromosome. In this case, the mutation is complemented by a normal allele, so there is no phenotypic effect on the growth of the tumor. If however the normal homolog is lost, then there is a whole homolog LOH, and the tumor will probably grow due to the effect of the mutated tumor suppressor gene. In panel B, the green LR represents a lethal recessive mutation, which is on the same homolog as the mutated tumor suppressor gene. On the other homolog, linked to the normal allele of the tumor suppressor gene is the normal copy of a cell survival gene represented by the green diamond. A loss of this cell survival gene will lead to death of the cell, since there will no longer be a functional copy of this essential gene. If however, only the normal copy of the tumor suppressor gene is lost, the cell will not die, since the intact cell survival gene is still there. However, the tumor will grow, because there is no functional copy of the tumor suppressor gene (panel C). If the lost portion of the chromosome containing the normal copy of the tumor suppressor gene is actually retained as a translocation or simply by itself, the tumor will not progress (panel D) (Nestor, Hollopeter, Matsui, & Allison, 2007).

Mechanisms of LOH

Single nucleotide polymorphism array analysis can, in some cases, distinguish between the genetic mechanisms that lead to loss of heterozygosity (LOH), which include, hemizygous deletions (deletion of 1 allele), chromosomal loss and subsequent duplication, mitotic recombination, gene conversion and mitotic nondisjunction (Zhao et al., 2004). When there is a loss of heterozygosity but no further structural change, the

SNP assay detects a correlating LOH and copy number change, suggesting that this

LOH event may be caused by hemizygous deletion, or a deletion of 1 allele (Zhao et al.,

2004).

8 Some of these mechanisms lead to a change in the normal number of two alleles in non-transformed tissues. While LOH is not always equivalent to copy number losses, based on existing data and theoretical considerations, at least six types of change might cause LOH with or without an accompanying chromosomal copy number change: (a) complete elimination of one chromosomal homolog and duplication of the other homolog,

(regardless of the order), which results in whole-homolog LOH; (b) LOH resulting from break-induced replication (non-reciprocal recombination, classically termed “mitotic recombination” in cancer genetics), which extends from the telomere to any part in the chromosome arm, resulting in a broken-chromosome LOH; (c) centromeric recombinations of somatic chromosomes; (d) LOH resulting from mitotic gene conversion, or the conversion of one allele to another due to faulty mismatch repair; () a deletion eliminating part of one homolog; and () mitotic nondisjunction which occurs when the sister chromatids fail to separate correctly at the metaphase plate during mitosis, usually leading to a chromosome duplication (Cleton-Jansen et al., 2004;

Gaasenbeek et al., 2006; Mohamedali et al., 2007; Nestor et al., 2007; Torring et al.,

2007; Walker & Morgan, 2006). In mechanisms a, b and e, a chromosomal copy number change will be observed.

Mitotic recombinations can result in LOH over whole arms of chromosomes or from the point of recombination to the telomere. In these cases, SNP analysis will show the chromosomal copy number as normal, although 1 allele (or arm of the chromosome) was lost and the remaining allele (or arm of the chromosome) was duplicated. This is referred to as copy-number neutral LOH or uniparental disomy and confers homozygosity across the entire affected region (Cleton-Jansen et al., 2004; Liu et al.,

2008; Mohamedali et al., 2007; Torring et al., 2007; Walker & Morgan, 2006). The majority of the regions of LOH detected using SNP arrays arise from copy neutral events

(Beroukhim et al., 2006).

9 Many genetic researchers have successfully applied SNP technology, as illustrated by the following examples: (1) SNP microarrays have allowed the detection of multiple regions of LOH in cell lines, which were not detected by previously used methods (Gaasenbeek et al., 2006), and (2) SNP microarrays have allowed the identification of small regions of chromosomal gain or loss in lung tumor cells

(Tai et al., 2006). There is evidence that different stages and histological grades of tumors are the result of different mechanisms of LOH (Cleton-Jansen et al., 2004). For example, Cleton-Jansen et al., found that grade I breast cancer shows physical deletion of chromosome arm 16q in the preinvasive stage, whereas grade III shows LOH followed by mitotic recombination (Cleton-Jansen et al., 2004). This indicates that as a tumor progresses, different mechanisms may work in conjunction with each other to make the tumor more aggressive.

A major goal of this work is to produce LOH profiles using SNP arrays. Many other techniques, such as microsatellite markers and Comparative Genomic

Hybridization, have been used to detect LOH regions in the past, but have been unable to do so with the precision, accuracy and efficiency which SNP arrays provide (Walker &

Morgan, 2006). As mentioned previously SNPs detect copy number changes and LOHs whereas other methods can detect only one or the other (Gaasenbeek et al., 2006).

Hence, the reason we chose to use SNP arrays, over the other methods.

Aneuploidy

Aneuploidy

One of the hallmarks of cancers is genomic instability resulting in aneuploidy

(Hedley, Rugg, & Gelber, 1987; Liu et al., 2008; Torres et al., 2008). Changes in chromosome number within different cells of the same tumor are classified as aneuploid

(Torres et al., 2008). This definition also includes the structural abnormalities caused by

10 genetic instability such as translocations and deletions. The theories for how and why aneuploidy occurs will be discussed in later sections.

Monosomy, where there is only one copy of a chromosome, and trisomy where there are three copies of the same chromosome, are types of aneuploidy. In general, the gain of genetic information due to trisomies is better tolerated than the loss that is incurred due to monosomies (de Stahl et al., 2008; Torres et al., 2008). In humans, several genes have been identified which, when present in only one copy, result in disease. Similarly, several different genes have been identified that result in disease when there are more than two copies, as in gene duplication (Conrad, Andrews, Carter,

Hurles, & Pritchard, 2006). Although aneuploidy is frequent in human oogenesis, and also occurs occasionally in human preimplantation embryos and mammalian brain

(Piotrowski et al., 2008), aneuploidy frequently causes death of the embryo and has been associated with disease, sterility, loss of cellular differentiation and tumor formation

(Hedley et al., 1987; Torres et al., 2008).

Aneuploidy interferes with cellular growth and proliferation. The genetic instability associated with cancer aneuploidy may aid in the evolution of the tumor toward a state of high proliferative and/or invasive/metastatic capacity. To enable this, the tumor must also acquire mutations that allow its cells to tolerate the adverse effects of aneuploidy. These progression favoring mutations may be amplified by increasing the selection of the aberrant genomic regions with these abnormalities, leading to greater numbers of translocations, deletions and duplications in a given tumor (Zheng et al., 2005).

The existence of this cycle of retaining aberrant chromosomes or chromosomal regions to allow tumor progression is supported by data that in many cancers certain aberrations occur more frequently than would be expected by chance alone. This suggests there is a selective pressure to preserve these aberrations. The current thinking is that this is done to help preserve oncogenes. However, the vast majority of

11 selectively retained chromosomal abnormalities in aneuploid human cancers usually do not contain progression-favoring oncogenes (Nestor et al., 2007). The selective pressures to retain these chromosomal abnormalities without oncogenes are poorly understood. We suggest the selective pressure to retain these chromosomal abnormalities is to retain essential genes for cell survival.

In regards to gene duplications, the selection may be due to the ability of conserved aberrations to turn pathways on and off, such as those responsible for activation and tumor suppressor gene (TSG) inactivation, which are necessary for tumor development and survival (Cahill, Kinzler, Vogelstein, & Lengauer, 1999;

Guttman et al., 2007; Zhao et al., 2004). This involves a change in gene dosage, in some case of contiguous genes, where tumor suppressor genes may be inactivated by a physical deletion and oncogenes may be enhanced by amplification (Cahill et al., 1999).

Deletions and amplifications can also alter gene expression. It has been suggested that some genomic aberrations may be selected because they alter the expression of multiple genes, which then simultaneously promote tumor progression (Bild et al., 2006; Tsafrir et al., 2006). Further evidence to support these selected aberrations are preferential amplifications of alleles from a single parent (rather than both alleles), containing a beneficial mutation for the tumor (Dutt & Beroukhim, 2007). Therefore, from all the evidence mentioned above, it appears that if a genomic abnormality does not contribute to the overall fitness of the cancer, it is unlikely to be conserved (Guttman et al., 2007).

Additionally, the observation of a lack of essential functional genes relating to cell signaling, cell proliferation and numerous kinase-and phosphorylation-related categories in copy number variants (CNVs) regions, also shows that selective pressure plays a role in evolution. Generally, CNVs do not cause altered copy numbers for essential genes, vital for development. Instead CNVs may be selected to provide sufficient gene dosage

12 for sensitive oncogenes or tumor suppressor genes that could predispose to early-onset tumorigenesis and progression (Redon et al., 2006). This evidence for selective pressure in the somatic evolution of cancers gives credence to our hypothesis that there is a selective process to conserve essential genes in order for the tumor cells to survive and progress.

The mechanisms of aneuploidy

How tumor cells acquire extra chromosomes and maintain them during cell division is currently unknown (Thompson & Compton, 2008). However, many theories have been investigated and suggestions have been presented. It is presumed that most cancer aneuploidy is preceded by an initial 2C to 4C tetraploidization (Nestor et al.,

2007). Cells enter mitosis due to a build-up of cyclin B. If mitosis due to a build-up is not completed within a given time, the cyclin B breaks down and the cells re-enter interphase with a 4C chromosome number. Aneuploid cancers are then thought to be formed by chromosomal losses from the tetraploid (4C) precursor cells.

A reduction in activity due to the reduction in gene dosage (known as haplo-insufficiency) and protein stoichiometry imbalances (abundance of or protein level) are two possible reasons for the defects associated with chromosomal losses (de Stahl et al., 2008; Nestor et al., 2007; Torres et al., 2008). Protein stoichiometry imbalances may also be one possible reason for the defects associated with chromosome gains. Some studies propose that it is the additive effects of increases of a large number of genes, leading to many protein stoichiometry imbalances that are responsible for much of the cellular defects, growth disadvantage and some of the developmental abnormalities associated with aneuploidy. Gene expression also plays a role; an increase in the number of copies of a gene, leads to greater expression of that gene, and a greater deviation from the normal level of expression of that particular gene

13 (Thompson & Compton, 2008; Torres et al., 2008). Torres et al. (2008) speculate that eliminating proteins that are in excess is probably easier than upregulating protein production to compensate for a deficiency. An inability to compensate for gene losses could also explain why chromosome gains are better tolerated than chromosome losses

(Torres et al., 2008). While this may be true, we suggest that the tumor cells can and do compensate for this deficiency when it is necessary to preserve genes that are required for their survival and growth.

Theoretically, some losses and/or gains of genomic DNA may result from a combination of genomic instability and the requirements for long term growth in tissue culture. The aneuploidy seen in long-term cancer cell lines may not accurately reflect alterations of genes “driving” carcinogenesis (Liu et al., 2008). However, it is extremely difficult to distinguish if the aberrations are due to long term cell culture or cancer-related abnormalities. The changes in the cell lines as a result of long term tissue culture may be less significant: For example, when lymphoblast cell lines are compared to the original tissue source, there is a 99.9% similarity between the original tumors and the cell lines with SNP analysis. This suggests that the effects of culturing these cells is minimal and does not significantly interfere with allelic or genotypic associations (gains and losses)

(Simon-Sanchez et al., 2007).

The phenomenon of generating genetic instability in cancers may not be entirely random, as was previously believed (Cahill et al., 1999; Nestor et al., 2007). It is well established that cancer cells frequently have defects in DNA repair mechanisms and gene surveillance pathways such as . While the precise mechanisms creating the genetic instability are still a matter of debate, recent data strongly suggests that many, or most, tumors have evidence for genetic damage, mutations, and chromosomal abnormalities. Further, there is genetic heterogeneity within tumors (different abnormalities within different cells of the same tumor) and even among tumors of the

14 same type. This apparent non-randomness of these changes may be due to selection, but overall does provide evidence that genetic instability does occur to a significant degree in most cancers (Cahill et al., 1999; Guttman et al., 2007).

Many solid tumors that are genetically unstable and highly aneuploid, display high rates of multiple whole chromosome mis-segregation in a phenomenon known as chromosomal instability (CIN). This increased rate of chromosomal mis-segregation promotes tumor cell evolution and growth, which complicate therapeutic efforts.

Chromosomal mis-segregation events occur independently during colony formation, explaining why chromosome numbers among cells within individual colonies may vary in vitro. Aneuploidy and CIN are two separate phenotypes, but the relationship between

CIN and aneuploidy is not clearly defined. CIN helps perpetuate aneuploidy because once cells become susceptible to chromosomal instability, they experience several instances of amplification or loss (Holland & Cleveland, 2008; Thompson & Compton,

2008; Tsafrir et al., 2006).

Another theory for a mechanism for aneuploidy is mitotic infidelity, characterized by a persistent elevated chromosome mis-segregation rate observed in aneuploid tumor cells (Thompson & Compton, 2008). The first specific type of mitotic fidelity problem is spindle checkpoint function which induces premature entry into anaphase before chromosome alignment (Weaver & Cleveland, 2008). Although, this theory has been challenged with the new theory that it is a lag in anaphase rather than premature entry into anaphase (Holland & Cleveland, 2008). The second type is multipolar spindle assembly which induces multipolar anaphase, thereby causing an unequal distribution of chromosomes to the daughter cells (Thompson & Compton, 2008). The third mitotic malfunction mechanism, which has actually been shown to be a common mechanism of chromosome mis-segregation in aneuploid tumor cells, lies in the kinetochores. Normally kinetochores extend bundles towards one spindle pole to allow the

15 chromatids to be separated at the metaphase plate. In the case of kinetochore abnormalities, the kinetochore extends microtubule bundles toward both spindle poles, thereby pulling parts of the same chromatid in different directions. This demonstrates a failure to segregate properly at anaphase onset (Thompson & Compton, 2008).

Lastly, focusing on another related but different mechanism of aneuploidy-the general ; data supports that cell cycle deregulation affects the ability of the cell to adequately respond to potential carcinogenic events. As a result, cells accumulate genetic defects, which in turn promote the development of a more aggressive phenotype

(Kibel et al., 2008).

There are multiple theories for the mechanisms involved in producing aneuploidy, including protein imbalance, gene dosage imbalance, genetic instability, chromosomal instability, mitotic infidelity and cell cycle dysfunction. Some of these have more evidence to support them, while others are difficult to illustrate, but while we have not specifically looked at a mechanism for how aneuploidy occurs, we have looked in detail, at why it occurs.

Why does aneuploidy occur?

As with the mechanism driving aneuploidy, there are multiple hypotheses for why

90 % of cancer cells are aneuploid (Weaver & Cleveland, 2008). Since in general, the tumor cells promote aneuploidy, many of the theories for why aneuploidy occurs, look at the benefit bestowed on the tumor cells themselves.

The first theory is that aneuploidy could be a late event in tumorigenesis, caused by the inactivation of the p53 pathway.

The second is that aneuploidy may be an early and causative event during tumorigenesis. Evidence has been found to support this notion in low-grade, small adenomas and atypical ductal hyperplastic cells. Since these are in the very early stage

16 of development, the observation that there is a low degree of loss of heterozygosity, as well as discrete chromosome gains, indicates the early stages of aneuploidy. In specific tissues or during certain developmental stages, aneuploidy may be causative of tumorigenesis.

The third is that the stress caused by aneuploidy precipitates an increase in mutation rate, gene amplification and/or increased genomic instability, which then spirals into the cells embarking on the path of tumorigenesis. These are the very events that cause aneuploid cells to proliferate, thereby feeding a never ending cycle.

The fourth is that among the first mutations that occur in aneuploid cells are those that allow cells to tolerate the adverse effects of aneuploidy. These combined with growth and proliferation-promoting genomic changes, such as amplification of oncogenes and loss of tumor suppressor genes, now promote growth and eventually lead to the selection of tumor cells with high proliferative capacity.

The fifth is that aneuploidy could potentially provide tumors with the ability to fine- tune gene dosages to promote growth in a particular environment within the body.

Finally, the sixth is that it confers a special advantage on the cells (other than those aforementioned), and that it shields them from lethal mutations. By providing multiple copies of essential genes, aneuploidy could protect cells from lethal events.

Thus, Torres et al. speculate that, the mechanisms that inhibit proliferation and induce stress are the reasons why aneuploidy promotes tumor growth and development (Torres et al., 2008). Our hypothesis fully supports this sixth theory, since we believe that aneuploidy is preserving essential genes required by both the tumor and the other healthy cells in the body for survival, while not expending energy to replicate genes that are not required.

17

Theodor Boveri, the discoverer of aneuploidy

Among Theodor Boveri's many seminal contributions to biology is the discovery that an abnormal number of chromosomes disrupts development (Torres et al., 2008).

Even during the earlier part of the last century, on solely embryological and cytological grounds, Boveri speculated that some aneuploid cells could proliferate better than wild- type cells or in ways that wild-type cells could not (Manchester, 1995; Torres et al.,

2008).

Boveri concluded: ‘only one possibility remains, namely that not a definite number, but a definite combination of chromosomes is essential for normal development, and this means nothing else than that the individual chromosomes must possess different qualities (Manchester, 1995).’ He also noted that any variation of chromosomal combination that deviated from the normal, including chromosome gain or loss leads to either abnormal development, where the cell or organism remains viable, but does not function in a typical way, or death of the organism (Boveri, 1907; Manchester, 1995;

Weaver & Cleveland, 2008).

Our first hypothesis centers on the core conclusions drawn by Theodor Boveri, but with the use of present day techniques and knowledge- that mutations, rearrangements, translocations and deletions that create sites of chromosomal abnormalities are all governed by a haploid genome. The method that we used to analyze these chromosomal aberrations in humans is an extension of the method Boveri used when he studied chromosomal abnormalities in the sea urchin.

18 LOH Analysis Method

Single nucleotide polymorphisms have implicated genes that are not only involved in the development and progression of particular diseases, but also subsequently identify people who may be at higher risk for these diseases (Dutt &

Beroukhim, 2007). However, to facilitate this and to identify particular patterns and profiles that may exist between cancers of one type at different stages or between cancers in different tissues, some form of comparison has to be done. This comparison is made possible by the fact that, in general, the fractional LOH rates (degree of allelic loss) in low- and intermediate-grade tumors are significantly lower than those in high- grade tumors, and are usually associated with different clinical behaviors (Janne et al.,

2004; Wang, Buraimoh, Iglehart, & Richardson, 2006) .

Researchers have started investigating different tissue types and different tumors to gain a more comprehensive understanding of the genes and LOH patterns involved in particular cancers. Various approaches have been taken to solve the problem of how to compare among different samples, including hierarchical clustering, nearest neighbor comparisons and probability analysis. One research group has shown that hierarchical clustering based on genome-wide LOH patterns can distinguish different types of tumor cells based on their shared LOH (Zhao et al., 2004). Another group uses the high-density of the SNP markers to compare to neighboring SNPs, subsequently detecting significant

LOH regions (Beroukhim et al., 2006). Other research groups have inferred chromosomal regions of LOH, in multiple types of cancer, using probability analysis that is based on the assumption that long stretches of homozygous regions along a chromosome are very unlikely to occur by chance (Wong et al., 2004). This probability analysis can also be used for the detection of shared LOH regions among different cancer lines. Additional examples are seen in lung, colorectal and prostate cancers. It

19 has been shown that in lung cancer, clustering of LOH data can distinguish small-cell lung cancer from non-small-cell lung cancer with reasonable accuracy (Janne et al.,

2004). It has also been shown that in colorectal cancer, there is a compilation of five carcinoma specific events that can accurately distinguish adenomas (benign) from carcinomas (malignant), giving them a characteristic pattern of genomic imbalances

(Lips et al., 2007). This analysis identified specific genomic regions highly associated with tumor initiation, metastasis and high-grade disease (Torring et al., 2007). In prostate cancer, poorly-differentiated (more advanced) tumors have a major impact on prognosis compared to well-differentiated (less advanced) tumors. Both an increase in the overall

LOH frequency and the number of affected chromosomal loci was found in grade II and

III tumors when compared to grade I prostate tumors (Hugel & Wernert, 1999). With the advance in technology, researchers are now able to detect significant patterns of LOH that can be used to distinguish between more malignant tumors from those which are benign or less aggressive.

Many researchers also use the Bayesian Robust Linear Model with Mahalanobis

(BRLMM) or the Dynamic Model (DM) algorithms to analyze the data, and produce graphical representations of each chromosome (Affymetrix, 2007). However preliminary studies indicate that the DM algorithm gives much higher rates of homozygosity, longer lengths of homozygosity and more frequent homozygous tracts, than the BRLMM algorithm (Curtis, 2007). For this reason we decided to develop our own method of data analysis.

Loss of heterozygosity patterns observed in many cancers, including prostate, lung, and colorectal can be used as part of diagnostic panels, and therefore might complement clinical data in order to guide individualized treatment protocols (Lips et al.,

2007). From the large datasets generated by SNP arrays, we will build on this previous work to develop a genome view of the LOH profile in prostate cancer. We have

20 developed a different method of analyzing the data that is built on a method used by one of the early 20th century biologist- Theodor Boveri.

Theodor Boveri’s work included the use of artificially fertilized sea urchins. His observations were that when dispermy fertilization occurred, the cell would divide into 4 daughter cells, without ever going through the 2-cell stage, and in most cases the embryo would die. To simulate these observations of random chromosomal assortments occurring in the double-fertilized eggs, Boveri used marbles. In this , a normally fertilized egg would contain 36 chromosomes and a double fertilized egg would contain

54 chromosomes. Boveri used a set of 54 marbles. Three marbles of this group were labeled “#1”, three marbles labeled “#2” and so on, up to 18, representing the haploid genome that each parent would be contributing. These marbles were then randomly sorted into three or four separate bins on 200 separate occasions. After each sort,

Boveri counted the number of bins which had received a complete set of marbles labeled

1-18. He found that the proportions of three and four bin sorts containing at least one copy of all 18 marbles in the artificial distributions, or a complete haploid set of marbles, matched the proportions of viable cells found in the three and four cell embryos (Baltzer,

1967; Boveri, 1907).

We use an extension of Theodor Boveri’s method to compare between random and observed datasets, in order to identify significant lengths of consecutive homozygous SNPs, or LOH regions in the genome. We then apply this algorithm first to normal samples.

21 Normal samples

Extended regions of homozygosity in normal samples

When looking at single nucleotide polymorphisms in normal samples, it is generally thought that it would be mostly heterozygous, since the very essence of SNPs is looking at the very small, variable portion of the genome, the part that is not common among all humans (Gibson, Morton, & Collins, 2006; Simon-Sanchez et al., 2007).

However, multiple recent studies have shown the opposite- that there are significant areas of long stretches of homozygosity, in normal samples, and its occurrence is fairly common (Gibson et al., 2006).

The results of a colorectal study indicate that there were long regions of homozygosity in both the normal and tumor samples (Bacolod et al., 2008). Another study using healthy subjects from four outbred populations of differing ethnic backgrounds (the HapMap samples), also contained these long stretches of homozygosity. This study found a total of 1393 homozygous tracts greater than 1 Mb in length with a minimum SNP density of 1 SNP every 5 kb. In most studies, long stretches of homozygosity were defined as an uninterrupted sequence of at least 1Mb (Curtis,

2007).

The reason for these extended lengths of homozygosity in normal samples, may at initially have been thought to be as a result of cytogenetic abnormalities such as uniparental disomy, but further investigation into these long stretches revealed that it is a result of common ancestry or residual consanguinity, that can usually be traced to the parents having a common ancestor. This rationale is supported by the observation that individuals with at least one large tract of homozygosity are likely to harbor other large regions of homozygosity (Bacolod et al., 2008; Curtis, 2007; Simon-Sanchez et al.,

2007). This inheritance of blocks of DNA together from generation to generation is known as the haplotype (Curtis, 2007).

22 Some studies have shown that these haplotypes blocks have lower than average recombination rates. This reflects a high level of relatedness in their ancestry which is too recent to have been influenced by the local recombination intensity. These regions of low recombination rates would allow longer regions of homozygosity to remain intact and persist over generations, which provides further evidence that these homozygous areas are inherited from a common ancestor. These regions also appear to have lower than normal mutation rates (Gibson et al., 2006).

Thus, the common ancestor explanation for the extended tracts of homozygosity can be summarized in two broad categories: 1) parents have a relatively recent common ancestor, but it was too recent to be influenced by recombination, especially in areas of low recombination rates; or 2) parents have a relatively distant common ancestor, but a lack of recombination in the region has enabled the ancestral segment to persist intact

(Gibson et al., 2006).

Another observation that supports the common ancestry explanation is that SNPs are thought to be of more ancient origin. Since these extended tracts of homozygosity are being measure by SNPs, and these SNPs are most likely inherited from ancient ancestors, it is more likely a reflection of ancestry (Gibson et al., 2006).

Another possible minor influence on extended tracts of homozygosity is linkage disequilibrium (LD) patterns (Gibson et al., 2006). Linkage disequilibrium (LD) is the tendency for specific alleles to be inherited together more often that would be expected under random segregation. In the human genome, there are regions of strong LD broken up by small regions of intense recombination (Gibson et al., 2006). LD could be another reason for long stretches of homozygous SNPs in normal samples, since regions of LD would inevitably have SNP probes in them. Long tracts of homozygosity are significantly more common in regions with high linkage disequilibrium and low recombination, and the

23 location of tracts is similar across all populations (Beroukhim et al., 2006; Gibson et al.,

2006).

It is also important to note that about one of every two to three common SNPs are shared among ethnic populations, including among those of African, Asian and

European ancestry (Anno, Abe, & Yamamoto, 2008; Guthery, Salisbury, Pungliya,

Stephens, & Bamshad, 2007). However, there are also multiple alleles and allele combination that are particularly associated with one ethnic cohort. It has been found that multiple alleles that contribute to a particular phenotype are on the same chromosome and are likely to form haplotypes. Some haplotypes are present in all populations, and some are population-specific (Anno et al., 2008). In those populations where there are more haplotypes than others (example, Ashkenazi Jews compared to a northern and western European population) this indicates close, recent common ancestors (Olshen et al., 2008).

Inbreeding or children produced from a consanguineous mating (between two genetically related individuals), have even longer stretches of homozygous regions, which further supports the idea that much of these regions are inherited from common ancestors. Interestingly enough, there appears to be a higher occurrence of parental consanguinity in older North Americans, with >6% of sampled individuals possessing tracts of homozygous genotypes larger than 5 Mb (Bacolod et al., 2008; Curtis, 2007;

Gibbs & Singleton, 2006).

Although long stretches of homozygosity are observed in normal human samples, it is important to note that normal samples can still be used for comparative purposes, and that SNPs can still be used to look at cancers. The difference between the normal and cancerous samples is the length of the homozygous stretches. The longer length is usually associated with cancerous samples (Bacolod et al., 2008).

24 It is thought that longer homozygous segments cover multiple dose-dependent genes, or genes that require 2 copies to function effectively. If a homozygous segment covers one of these genes, there is only one copy of the gene present, indicating that this gene is likely not functioning at its fullest capacity. This is one way that a decreased copy number (only 1 copy) can lead to disease (Bacolod et al., 2008).

Copy number variants (CNVs) in normal samples

No large stretches of the genome are exempt from copy number variants (Redon et al., 2006). Knowledge of CNVs in normal samples is vital to studying disease, since to correctly interpret genomic data relating to cancer and other diseases, abnormal genetic

CNVs must be distinguished from CNVs that frequently occur in the healthy human genome. CNVs reflect the instability of these genomic regions, because they disrupt genes and modify gene dosage, which influences gene expression and phenotypic variation. It is interesting to note that CNVs are independent of ethnicity, and it has been suggested that this indicates either recurrent events, having occurred independently in multiple founders, or are evolutionarily ancient and present in early human populations, prior to the separation of the ethnic groups (de Stahl et al., 2008; Redon et al., 2006;

Sebat et al., 2004; Sharp et al., 2005).

In a broad sense, there are five types of CNVs: (1) deletions; (2) duplications; (3) deletions and duplications at the same ; (4) multi-allelic loci and (5) complex loci whose precise nature is difficult to discern (Redon et al., 2006).

Much research has been done on CNVs. In normal samples, Liu et al. observed small copy number variations in presumably normal prostate epithelial cells (Liu et al.,

2008). Sebat et al. have observed copy number variation of 70 different genes, in normal adults including genes involved in neurological function, regulation of , regulation of , and several genes known to be associated with disease (Sebat

25 et al., 2004). De Stahl et al, identified 315 autosomal CNV regions, in normal samples, which encompassed approximately 3.5% of the human genome, and most of these regions overlap with genes (de Stahl et al., 2008). Sharp et al. identified 119 regions of copy number variations. They also defined a set of 130 non redundant regions of potential genomic instability, termed rearrangement hotspots, in the human genome.

These are regions that are more likely to be genetically unstable, and therefore more likely to have CNVs (Sharp et al., 2005). In our normal samples, we will therefore expect to see some long stretches of LOH, and some CNVs.

Alteration in copy number of metabolic and immune related genes often have profound effects on an individual’s metabolic rate or resistance to environmental pathogens, which potentially makes CNVs (If they are in those genes) significant susceptibility factors for some common human diseases, because they may lead to increased dosage of gene products, or to the inactivation of a gene affected by a breakpoint position. If these genes are located in genetically unstable regions, they will be more susceptible to these CNVs, including deletions and positive selection after gene duplication (Conrad et al., 2006; Piotrowski et al., 2008; Sharp et al., 2005).

CNVs may also affect tumor suppressor genes or oncogenes. When this occurs in the right cell and in the correct window of time, it may lead to increased proliferation and ultimately cancer (Piotrowski et al., 2008). Therefore, further analysis is necessary of

CNVs in the context of human genetic diversity and evolution, as well as disease susceptibility (de Stahl et al., 2008).

Twins

It is a widely accepted belief that monozygotic twins (MZ) (or twins that are the result of fertilization between one egg and one sperm, that form from one zygote) are genetically identical; and that phenotypic discordances are attributable to environmental

26 influences that modify the expression genes, while the genetic sequence itself remains identical. However, there is now a growing body of evidence that MZ twins are not always genetically identical. Differing epigenetic modifications within a MZ twin pair, can lead to differential expression of genes, including disease related genes. In some cases,

MZ twins can also show different degrees of severeness with certain diseases (Gringras

& Chen, 2001).

A twin pair will only remain identical if “post-zygotic genetic, post-zygotic epigenetic and post-zygotic environmental factors” affect each twin equally. This would include (but is not limited to) in utero differences that could account for their phenotypic and genotypic differences, such as differences in amniotic sac, chorionic and placental anatomical formation, and timing of when the twinning division takes place- the earlier twinning take place, the less the twins will share common supportive structures (Gringras

& Chen, 2001).

Not only is there evidence that MZ twins may have different gene expression, but that they may also have different karyotypes, that is, the number or morphology of chromosomes may vary. There will also be different mutations that each twin may have individually incurred over time (Gringras & Chen, 2001).

Bruder et al present evidence for large-scale CNVs among MZ twins and suggest that these variations may be common, generally occurring in somatic development

(Bruder et al., 2008).

Our study involves the use of monozygotic twin (MZ) sisters to illustrate a novel model created in our lab to analyze SNP data. Twins were chosen to test the model on normal samples prior to applying it to cancerous samples. Based on the aforementioned information, the twins in our study will be expected to be very similar but not identical, since our assay would be able to detect CNVs, and variations in LOH. Once a baseline is

27 established for LOH level in normal samples, we then have a starting control point to compare to future cancer samples.

Gene Expression

Our second hypothesis states that there will be universally expressed genes distributed throughout the genome. To facilitate this investigation gene expression arrays were used.

Gene expression and cancer

Cancer is a heterogenous, complex disease, with patient-specific responses to treatment and significant variability in patient outcome. The technology used to explore this heterogeneity must be able to unravel the complexity of the disease, and genome wide high throughput technology has promised to do that by revolutionizing cancer research. One such technology is the gene expression array. This technology has not only been able to provide gene expression profiles that are condition and tissue specific, but also has the potential to predict treatment response and general prognosis. Within the last few years, the evolution of this technology has already permitted the development of several genetic classifiers, in the form of both groups of genes that are expressed in a specific disease, and whole genome gene expression profiles of a disease. It is believed that in some types of cancer, these profiles can provide a better prognostic classification that the traditional methods used today (Roukos, Murray, &

Briasoulis, 2007).

One group, investigating node-negative, ER-positive breast cancer, which were formalin fixed and paraffin embedded, used gene expression profiling, to create a distant recurrence score to predict the likelihood of the breast cancer recurring in the future.

28 This profile consists of a 21 gene expression profile of the primary tumor (their model), which included the estrogen receptor gene (ER). Prior to using the gene expression profile, they found that younger patients (those less than 50 years of age) have higher rates of distance recurrence at 10 years than older patients; and that patients with smaller tumors (diameter, 2 cm or less) had lower estimated rates of distance recurrence at 10 years than those with larger tumors. However, when using their gene expression profile, age and tumor size were no longer statistically significant. These results indicate that the genes represented in their model are better indicators of recurrence than age and tumor size. The model also has the power to predict chemotherapy efficacy (Paik et al., 2004; Roukos et al., 2007). This finding is significant since both age and tumor size are routinely used as predictors of recurrence in breast cancer and are incorporated into current treatment guidelines (Paik et al., 2006).

Another breast cancer study used gene expression profiling to predict the outcome of the disease or the “prognosis profile”. The patients were all younger than 53, and had either stage I or II breast cancer. There was a relatively even distribution between lymph node positive and lymph node negative cancers. Based on the findings that regardless of the presence or absence of lymph-node involvement, the prognosis profile has a strong predictive power with respect to metastases via the bloodstream, but was independent of metastases via the lymphatic system, the authors concluded that the gene expression profile is a more powerful prognostic indicator in young patients than present, standard systems. These findings provide evidence that metastatic potential is acquired relatively early in tumorigenesis, which is against the widely accepted idea that this is usually a late occurrence, further illustrating how gene expression profiling can add new knowledge to the field (van de Vijver et al., 2002).

The Netherlands Cancer Institute developed a gene expression signature for distant metastases in breast cancer. This signature is referred to as the “mammaprint” or

29 the 70-gene signature, and was used to categorize 295 patients into good and poor prognosis groups, based on their 10-year survival outcome. It was shown to predict distant metastasis and survival of patients with early-stage breast cancer, significantly better than conventional clinicopathologic factors, because it added new, independent prognostic information (Roukos et al., 2007).

Gene expression profiles and signatures can, with relative accuracy, predict multiple factors involved in tumorigenesis, including prognosis, distant metastases, recurrence rates and response to treatment. This is because they illustrate patterns of pathway deregulation in tumors and clinically relevant associations with disease outcomes. Since atypical cells, genes and pathways ultimately affect the expression of a variety of genes, this technology allows the researcher to identify oncogenic cells, deregulated pathways and networks of genes that work together to incur a tumorigenic state (Bild et al., 2006; Ding et al., 2008). Although the examples shown here were in breast samples, the same advantage and wealth of knowledge was found in other cancer types including lung and ovary using gene expression arrays (Bild et al., 2006).

This technology has already started moving into the clinical realm where the hope is that it will help to provide cancer patients with personalized treatment (Roukos et al., 2007).

Gene expression and copy number variants (CNV)

Widespread DNA copy number alteration can lead directly to global deregulation of gene expression, which may contribute to the development or progression of cancer.

DNA copy number variations are found widely distributed throughout the cancer genome, and studies have shown that there is a remarkable correlation between gene copy number variations and gene expression in tumor cells. Genes that have a higher copy number than normal have higher gene expression levels when compared to normal gene

30 expression levels for that gene, and vice versa (Ding et al., 2008; Pollack et al., 2002).

Overall, 63% of significantly overexpressed genes also display DNA content gains, and

62% of the down-regulated genes show a noticeable loss of DNA content (Tsafrir et al.,

2006).

Multiple studies have shown that there is a significant correlation between copy number variations and gene expression levels, for both duplications (copy number increase) and deletions (copy number decrease). In a breast cancer study by Pollack et al., looking at both breast cancer cell lines and primary tumors, four classes of genes were created - no change, low-, medium, and high-level amplification (copy number increase), and compared to their respective gene expression levels. This study found that there was a statistically significant correlation between these 4 groups of copy number changes and gene expression of the respective genes. Further to that, a study using 75 different tumor samples, showed the same correlation (Ding et al., 2008;

Pollack et al., 2002).

Investigating copy number variations, using SNPs and gene expression data simultaneously, provide an excellent complimentary technique set, which enables a more biologically relevant interpretation of the expression data by highlighting the dependence of gene expression on gene dosage (Tsafrir et al., 2006).

There are several studies supporting the positive correlation between copy number changes and gene expression level changes. These studies show that 1) there are a high degree of copy number–dependent gene expression tumors; 2) there is a potential to look at gene expression profiles and predict or infer DNA copy number aberration, particularly aneuploidy; 3) cancer therapies can be created based on the global imbalances in gene expression in cancer relative to the normal cell and, 4) even beyond the amplification of specific oncogenes and deletion of specific tumor suppressor genes, there is a possible role for widespread DNA copy number alterations in

31 tumorigenesis. Our hypothesis builds on these findings, specifically the latter, since we think that the tumors are not only retaining oncogenes in translocations and amplification, but other genes that are responsible for basic cellular functions, including housekeeping genes.

Despite the large body of evidence that supports the association between copy number alterations and gene expression, the correlation is not always perfect. On occasion, in large areas of DNA gain, the expression of some genes is down-regulated in tumors, or expressed at a similar level as normal tissue (Tsafrir et al., 2006).

Housekeeping Genes

Researchers vary on the detailed definition of housekeeping genes (HK) (or maintenance genes or essential genes), but agree on the basic definition- constitutively expressed genes found in all human cells, that are critical for the maintenance of basal cellular function and reproduction. These functions include cell cycling, intermediary metabolism, gene transcription, protein , cell signaling/communication and structure/motility (Eisenberg & Levanon, 2003; Hsiao et al., 2001; Tu et al., 2006;

Warrington, Nair, Mahadevappa, & Tsyganskaya, 2000). Some researchers believe that housekeeping genes are genes that are constitutively expressed in both fetal and adult cells, indicating that they are important for development as well as continued survival.

This may therefore be more encompassing of truly essential genes (Warrington et al.,

2000). In another study, Thomas et al. found that there was significant overlap between human oncogenes and tumor suppressor genes, and “essential genes”, indicating that many of the HK genes are also oncogenes and tumor suppressor genes (Thomas et al.,

2003).

32 Recently, as their importance has become clearer, housekeeping genes have been investigated in greater detail, including looking at their structure, function and evolution. Eli et al compared the full length of, and parts of HK genes, including , and untranslated regions, with other non-HK genes. They found that all parts of the housekeeping genes were, on average, significantly shorter than other genes

(Eisenberg & Levanon, 2003; Tu et al., 2006). Since these genes are constitutively expressed, and constantly have to be transcribed, having them shorter would be beneficial to the organism. Shorter DNA segments to be transcribed means less chance of mutations occurring (Tu et al., 2006). It could be argued that selection of shorter genes should have eliminated the introns in highly expressed genes. However, this may not happen because introns play important roles, such as splicing regulation. Therefore, there is a balance between the advantageous contribution of the introns and the selective pressure for shortening (Eisenberg & Levanon, 2003).

It has also been found that HK genes expression levels are high (Eisenberg &

Levanon, 2003). This can be explained by the results of one study that the HK genes may produce as excess of transcript (Warrington et al., 2000). Warrington et al. compared the abundance levels of all of the non-HK transcripts in tissues with the abundance levels of the HK transcripts, and found that the abundance levels of the HK genes were relatively higher than the non-HK genes (Warrington et al., 2000). However other groups point out that HK genes are not necessarily the most highly expressed genes in all tissues (Zhang & Li, 2004).

Based on the definition of HK genes, they are functionally very important for the survival of the organism. It has been shown by Tu et al. that there is stronger selective pressure on these genes than on other non-HK genes (Eisenberg & Levanon, 2003; Tu et al., 2006). Stronger selective pressure means pressure for the gene to be kept intact, and pressure against changes or mutations (Thomas et al., 2003). A random

33 mutation in HK genes will cause a severe phenotype or disease, therefore strong selective pressure works against that occurrence (Tu et al., 2006). This functional importance also implies that these genes have a slower evolutionary rate than both disease and other genes (Tu et al., 2006; Zhang & Li, 2004). The stronger selective pressure and slower evolutionary rate helps maintain the fitness of the organism, since genes with more essential functions and a larger effect on the fitness of an organism will experience an extremely strong selective pressure (Thomas et al., 2003). Although the mutation rates of HK genes and non-HK genes are identical, HK genes have fewer total mutations because they are shorter (Duret & Mouchiroud, 2000). Also, comparative sequence studies of the coding regions of mouse, human and rat HK and tissue specific genes have shown slower evolution rates in the HK genes. The slower evolution of HK genes is consistent with increased lethality from coding mutations in this gene class

(Duret & Mouchiroud, 2000; Tu et al., 2006; Winter, Goodstadt, & Ponting, 2004).

Since the very essence of HK genes are their important function, researchers have continually been looking at this aspect of these genes. Their findings have determined that use of functional descriptors such as pathways, networks and ontological categories, can be standard tools for quantitative characterization of gene expression (Dezso et al., 2008). They have also found that there is differential function between HK genes and disease genes, such that housekeeping genes are enriched in protein and several other fundamentally important physiological processes, while many disease-related genes are more relevant to sensing and responding to internal/external signals, which are non essential functions (Tu et al., 2006).

The importance of the HK genes in humans can be further supported by looking at their homologous expression in other species closely related to humans; the homologous genes should be essential in the other species also (Tu et al., 2006).

Although, the quantity of support that can be obtained from that may be restricted by the

34 lack of knowledge on essential genes in other high animals (Tu et al., 2006). Despite that, Thomas et al. were able to classify a small portion of their list of HK genes as embryonic lethal in the mouse. In their 3,035 gene set, there were 104 genes homologous to mouse essential genes (Thomas et al., 2003). We will also look at our list of essential genes for mouse embryonic lethality.

While HK genes have to be constitutively expressed in all cells of the body, there are also other genes that are specifically expressed at certain times, and some that are only expressed in certain types of tissue, or tissue-specific genes (Tu et al., 2006).

These could potentially be targets and biomarkers for disease treatment and diagnostics

(Dezso et al., 2008). We look at a subset of tissue specific genes in great detail in our study, specifically, prostate specific genes.

Prostate cancer

Prostate cancer is the third most common cancer in men throughout the world and the most common cancer among men in North America, and some parts of

Africa (Huang et al., 2007). In 2008, the estimated number of prostate cancers was

186,320 in the United States alone. This is the highest incidence of new cancers and the second leading cause of cancer related death in men (Calvo et al., 2005; "Facts and

Figures - Breast Cancer 2008," 2008).

Current treatment options for prostate cancer include surgery, that either removes the surrounding lymph nodes, but not the prostate, or completely removes the prostate; hormonal treatment, that works by inhibiting the activity of hormones that are driving the tumor growth and progression; and radiation therapy ("Prostate Cancer

Treatment," 2008). These options prolong life, but do not prevent or predict recurrence, and most patients will have recurrence or advanced disease. For example, the problem with hormone therapy (other than the side effects) is that the tumors change their

35 hormone status, and stop responding to the treatment (Desai, Jimenez, Kao, & Gardner,

2006). One potential clinical application of our work is the use of targeted tissue specific genes for anti-cancer therapy. This method may work better than the present treatment plans because it works with the genetic composition of tumors, and affects not only the prostate, but also any cells that may have metastasized throughout the body, without significantly affecting other non-prostate cells.

Cancer “results from a complex reorganization of multiple genetic pathways and networks which collectively provide the cancer cell with immortality and, all too often, invincibility (Calvo et al., 2005).” Prostate cancer is no exception, its initiation and progression is a multi-step process that includes chromosomal and gene expression alterations leading to uncontrolled cell proliferation, inhibition of , invasion, angiogenesis and metastasis (Calvo et al., 2005). Our hypothesis uses this knowledge to look at both genetic instability and gene expression simultaneously.

An interesting example of how abnormal genes influence prostate cancer is that of the known tumor suppressor gene, Retinoblastoma (Rb), which plays a role in prostate carcinogenesis, leading to uncontrolled proliferation due to alterations of both alleles. This was demonstrated by rescue experiments conducted by Calvo et al, where

DU-145, a prostate cancer cell line, with a truncated form of the Rb protein, was transfected with functional Rb, and subsequently loss its tumorigenicity (Calvo et al.,

2005).

Using SNP technology, one group found frequent monosomy with accompanying gains, losses and recurrent amplification of regions that harbor a number of ‘risk’ SNPs that have previously been identified in prostate cancer. This group was able to identify specific regions of LOH (Liu et al., 2008).

Using gene expression arrays, changes in gene expression, caused by either hypomethylation or hypermethylation can be detected. These alterations lead to

36 chromosomal instability and transcriptional gene silencing, in many cancers, including prostate cancer (Huang et al., 2007). Using SNP and expression arrays simultaneously, allows for more accurate data, and sound conclusions.

37 Materials and Methods

Cell culture. Only human cell lines were used in this study. The PC-3 cell line

(derived from a metastatic bone site of prostate adenocarcinoma) and DU-145 cell line

(derived from a metastatic brain site of prostate carcinoma) were a generous gift from Dr.

Steven Selman at the University of Toledo, Toledo, Ohio. Both cell lines were grown as a monolayer in RPMI 1640. The MRC-5 (normal lung fibroblast cells), Calu-6 (anaplastic lung carcinoma), H520 (squamous cell lung carcinoma) and A549 cell lines (lung carcinoma) were a generous gift from Dr. James Wiley at the University of Toledo,

Toledo, Ohio. The MRC-5 and Calu-6 lines were grown as a monolayer in MEM

(Modified Eagle’s Medium). The H520 line was grown as a monolayer in RPMI 1640.

The A549 was grown as a monolayer in DMEM. The RWPE (normal prostate epithelial cells), CCD-34Lu (normal lung fibroblast cells), BUD-8 (normal skin fibroblast), LNCap

(derived from a prostate cancer metastatic to a left supraclavicular lymph node) and

22Rv1 (prostate carcinoma) cell lines were obtained from American Type Culture

Collection (www.atcc.com). The RWPE line was grown as a monolayer in keratinocyte- serum free media supplemented with human recombinant EGF (5ng/ml) and bovine pituitary extract (0.05g/ml). The CCD-34Lu and BUD-8 cell lines were grown as a monolayer in MEM. The LNCap and 22Rv1 lines were both grown as a monolayer in

RPMI 1640. SUIT-2 (derived from pancreatic adenocarcinoma) was grown as a monolayer in McCoy’s 5A media. These 12 cell lines are summarized in Appendix 1A. All media was supplemented with 10% FBS (Invitrogen Corporation, Carlsbad, CA) (Nestor et al., 2007) unless otherwise stated. The non-transformed normal human lymphocytes were grown in suspension in RPMI 1640 supplemented with 10% Fetal Bovine Serum

(FBS) (Invitrogen Corporation, Carlsbad, CA) and 1% Phytohemagglutinin (PHA)

(-Aldrich Corporation, St. Louis, MO).

38 Lymphocyte extraction from whole blood. Within an hour of blood collection, an equal volume of 0.9% saline was added to the sample, and mixed by inversion. Using an 18G needle, 10 mls of Ficoll (GE Healthcare BioSciences, Chalfont St. Giles,

United Kingdom) was added to a 50 ml tube, and 10 mls of the sample was gently layered on top of the Ficoll (GE Healthcare BioSciences, Chalfont St. Giles, United

Kingdom). This was centrifuged at 1600 rpm for 20 minutes at room temperature. The layer of Ficoll (GE Healthcare BioSciences, Chalfont St. Giles, United Kingdom) was removed and discarded, and the middle buffy coat containing the lymphocytes was extracted and transferred to a fresh tube. Fifteen mls of wash media was added and mixed by gentle inversion. This was centrifuged at 1000 rpm for 10 minutes at room temperature. The wash media was poured off, and the pelleted cells were resuspended in 5 mls RPMI 1640 media supplemented with 10% FBS (Invitrogen Corporation,

Carlsbad, CA) and 1% Phytohemagglutinin (PHA) (Sigma-Aldrich Corporation, St. Louis,

MO) and grown at 37°C in 5% CO2 for 3 days.

DNA Extraction from cell lines and lymphocytes. DNA was extracted from 5 x

106 cells from a 75cm2 flask in exponential growth as outlined in Nestor et al., 2007.

Briefly, the cells were directly lysed with Proteinase (Qiagen, Valencia, CA), and the lysates loaded onto DNeasy spin columns. The DNA binds to the columns after centrifugation, and was then washed twice. Pure DNA was eluted in reduced TE buffer

(10 mM Tris HCl, 0.1 mM EDTA, pH 8.0) ready for use. DNA yield for each sample was determined by measuring the concentration of DNA in reduced TE buffer by its absorbance at 260 nm with an Eppendorf Biophotometer (www.eppendorf.com,

Westbury, ). Each sample was adjusted to a concentration of 100 ng/ ml for a total of

1 mg dsDNA with A 260 /A 280 purity ratio greater than 1.8.

39 250 Nsp SNP microarray and Copy Number Variation. Genomic DNA samples were sent to the Gene Expression and Genotyping Facility (GEGF) which is supported by the Comprehensive Cancer Center of Case Western Reserve University and

University Hospitals of Cleveland (P30 CA43703). Genomic DNA was digested with Nsp

I, the fragments were ligated to adaptors, labeled and hybridized to the 250K SNP Chip.

Labeled DNA fragments from each cell type were hybridized to GeneChip®

MappingArrays (www.Affymetrix.com). There are approximately 262,000 probes on this array. All protocols were followed precisely as written and directed by Affymetrix

(www.affymetrix.com, Santa Clara, CA)(Affymetrix, 2006-2007). SNP heterozygosity was only measured for chromosomes 1-22, as it could not be assayed for the in the male-derived cell lines. The same DNA sample from each twin was run in triplicates. Both SNP calls and Copy Number Variation (CNVs) are calculated from fluorescence intensity based on SNP hybridization signal intensity data from the experimental sample relative to the intensity distributions derived from the reference set containing over 100 ethnically diverse individuals. A Gaussian Kernal smoothing is performed (Affymetrix, 2007).

Pseudo-distribution analysis and Probability Stripes. To detect non-random possible loss of heterozygosity (LOHs) (-SNP runs), a variation of Boveri’s sorting method (Baltzer, 1967) was used, which creates a pseudo-distribution of L-SNP runs for comparison to the real data. L-SNPs can be due to three occurrences: 1) “non- informative markers”, in which both parental chromosomes have the same base-pair at the SNP site; 2) assay errors; or 3) true LOHs in which one of the parental alleles in lost from the genome. The pseudo-distribution was constructed by simulating 1000 random arrangements of the LOH and heterozygous SNP calls present in each chromosome under study (one chromosome at a time) to detect non-random runs of consecutive L-

SNPs. The longest consecutive L-SNP run in each of the 1000 random arrangements

40 was identified. This was then compared to the longest run in the actual data. A significant

L-SNP run (p-value <0.001) in the actual data was one that is longer than all of the longest L-SNP runs found in 1000 random rearrangements of a given chromosome

SNPs. In other words, the level of significance (p-value) was calculated by counting the number of (longest) random L-SNP runs that are longer than the true L-SNP run (original data). For example, if there are 10 (of the 1000 random) runs that are longer than the true (raw data) L-SNP run then it is given a p-value of 10/1000 = 0.01. The number of random rearrangements of a given chromosome can be increased as necessary to find lower p-values, such as p < 0.0001, p < 10-5.

Probability stripe analysis is an extension of the pseudo-distribution model to objectively detect LOH genomic regions. This method involves re-labeling non-significant

LOH regions as heterozygous and filtering out non-significant heterozygous regions. The first step of this method was to set a “stripe width” for each chromosome in each cell line.

The stripe width is obtained from the pseudo-distribution for that chromosome, and is the length of the L-SNP run with a p-value of 0.001 (i.e. the longest L-SNP run in the pseudo-distribution). SNPs are subsequently assigned to probability stripes as follows:

The length of each L-SNP run is divided by the chromosome's stripe width. If the run is shorter than the stripe width, the SNPs within the run are re-labeled as heterozygous

SNPs. In other words, these L-SNP runs are too short to be considered statistically significant, and since they are surrounded by heterozygous SNPs, the LOH designations are likely erroneous. If the run was longer than the stripe width, it was labeled as a single

LOH probability stripe. If the length of the run is ≥ 2 stripe widths, two stripes are established, where the first is 1 stripe width long and the second is 1 stripe width plus any remainder of the run. For example, if the probability stripe width of a given chromosome is 20, then runs of 20-39 L-SNPs would be classified as one LOH stripe, runs of 40-59 L-SNPs as two LOH stripes, and so on.

41 After all of the L-SNP runs were transformed, the heterozygous SNP (-SNP) runs were transformed into heterozygous probability stripes (HET stripes) in a similar fashion. The only difference is that heterozygous SNP runs that were statistically insignificant in length were removed from the chromosome, rather than being converted to L-SNPs. Since any surrounding L-SNP runs were already transformed, short L-SNP runs would have been converted to heterozygous SNPs. In other words, a statistically insignificant heterozygous run is surrounded by highly probable LOH regions; therefore, it was filtered out of the chromosome as erroneous. As a result of probability stripe transformation, each chromosome comprises only significant heterozygous and LOH regions.

SNP sites for which an allele cannot be determined are called “No calls”. No calls are simply deleted in our algorithm.

Probability stripe transformation accomplished two principal tasks: 1) re-assigned

SNPs as heterozygous when their designation as an LOH is probabilistically unlikely, and 2) converted the genomic maps of the chromosome (Appendix 9) into larger functional regions, that is, probability stripes, as opposed to SNPs, which facilitates visual assessment of the chromosomes.

Spectral Karyotyping (SKY). Actively growing cell line (DU-145) was sent to the SKY Core Resource, Roswell Park Cancer Institute (Buffalo, NY). SKY analysis was performed on 36 karyotypes of the DU-145 cell line. The procedure was followed according to Nestor et al., 2007. Briefly, cells were treated with 0.06µg/ml Colcemid

(Sigma-Aldrich Corporation, St. Louis, MO) for 2-4 hours, and harvested. The cells were treated with Carnoy’s (3:1 methanol: glacial acetic acid), after being treated with Cancer

Hypotonic Solution (CHS). Chromosome spreads were prepared using air-drying methods. After sequential digestion with RNAse and pepsin according to the procedure recommended by the Applied Spectral Imaging, Inc. (ASI, Vista, CA), the chromosome

42 preparations were denatured by 70% formamide and hybridized to the human Spectral

Karyotyping (SKY) paint probes. An algorithm combines the fluorochromes, and images are captured using a Nikon microscope equipped with a spectral cube and Interferometer module (ASI, Vista, CA) (Nestor et al., 2007).

RNA extraction from cell lines. Cells were lysed by adding 7.5 ml TRIzol

(Invitrogen Corporation, Carlsbad, CA) reagent to a 75cm2 flask, and manually scraping the cells. The lysate was transferred to a 15 ml tube and incubated at room temperature for 5 minutes. 1.5 mls of chloroform was added, and vigorously shaken for 15 seconds, followed by a 3 minute incubation at room temperature. To separate the three phases, the sample was centrifuged at 12,000g for 15 minutes at 4°C. The colorless aqueous phase (top), containing the RNA, was transferred to a fresh tube and 3.75 ml isopropyl alcohol was added. This was incubated at room temperature for 10 minutes and then centrifuged at 12,000g for 10 minutes at 4°C. The supernatant was removed, the pellet washed with 75% ethanol, and centrifuged at 12,000g for 5 minutes at 4°C. Finally, the supernatant was removed and the RNA left to air dry for 10 minutes. RNA was dissolved in 50µl RNAse-free water and stored at -80°C.

Human Genome U133 plus 2.0 Gene Expression array. Samples of total RNA were reverse transcribed into cDNA using an oligo dT primer, containing the T7 polymerase promoter sequence at the 5’ end. Second strand cDNA (-DNA) was synthesized to use as a template for in vitro transcription via the T7 RNA polymerase, thus generating cRNA in the presence of biotinylated ribonucleotide tri-phosphates

(rNTPs). The labeled cRNA from each sample was individually hybridized onto the oligo- chip, and the signal was detected by the biotin-streptavidin-fluorochrome method (Calvo et al., 2005). The total number of probes on the Human Genome U133 plus 2.0 Gene

Expression chip is 56,000 representing approximately 33,000 genes. All protocols were followed precisely as written and directed by Affymetrix (www.affymetrix.com, Santa

43 Clara, CA)(Affymetrix, 2004). Expression analysis of the calls categorized the individual probes as present (including marginal expression) or absent. Degrees of positive expression were not considered in this analysis.

Databases for cells lines and tissues for in-silico analysis. The gene expression of 198 cell lines, from both cancerous and normal tissue sources were obtained from National Center for Biotechnology Information (NCBI) Gene expression

Omnibus (GEO) (Barrett et al., 2006; Edgar, Domrachev, & Lash, 2002), and are summarized in Appendix 1B. The gene expression of 2035 tissues from different types of cancers were obtained from expO (Expression Project For Oncology)

(http://www.intgen.org/expo.cfm).

Exon/intron data was obtained from the EID Database.

(http://hsc.utoledo.edu/depts/bioinfo/database.html). Genes were mapped to EID entries using gene symbols for the probe sets. Wilcoxon rank sum ( version 2.8.1) p-values were calculated by combining all and lengths for each gene.

Gene Expression Index (GEI) Calculation. To differentiate between genes that were expressed in one cell line relative to another, our laboratory developed a simple formula. An example of how to detect tissue specific gene expression in prostate cell lines is given:

Formula: number of prostate cell lines expressing gene X – number of non-prostate cell lines expressing gene X.

Since there were only 5 prostate cell lines in our experimental dataset of a total of 12 cell lines, the highest GEI possible was 5.

Housekeeping genes list generation. A list of 3516 housekeeping genes were compiled from 5 publications (Dezso et al., 2008; Eisenberg & Levanon, 2003; Hsiao et al., 2001; Tu et al., 2006; Warrington et al., 2000) each with varying numbers of

44 housekeeping genes in their lists. The lists consisted of 451; 535; 575; 1789 and 2374 genes, but all repeated genes were removed. The final list consisted of 3,516 genes.

Mouse Embryonic Lethal Gene List. A list of 16,764 mouse genes with human orthologs was obtained from the (MGI) website

(HMD_Human5). (www.informatics.jax.org) A separate list of 2,429 mouse embryonic lethal genes was also obtained from MGI. These two lists were compared to obtain common genes, which consisted of a list of 1,955 mouse embryonic lethal genes with human orthologs.

Dilution Experiment. DNA was extracted from PC-3 (prostate cancer cell line) and RWPE (benign prostate cell line), and each diluted to 100ng/µl. Specific percentages of DNA from the normal cell line (RWPE) was added to the cancer line according to

Table I. Total number of H-SNPs that were gained in an otherwise homozygous region was observed.

Table I. Dilution Percentages for RWPE (benign prostate cell line).

PC-3 (µl) RWPE (µl) Dilution % 0 50 0

1 45.5 0.5 5 47.5 2.5

10 45 5

20 40 10 25 37.5 12.5 100 0 50

45 Results

Probability Stripes

Normal samples- twins. The probability stripes algorithm (see Materials and

Methods for a detailed description) was applied to fresh DNA harvested from lymphocyte cultures of a pair of monozygous twins. This was done to obtain a baseline of loss of heterozygosity (LOH) in normal samples and to test the pseudo-distribution and probability stripes algorithms, using a p- value of 0.001 (see Materials and Methods). We noticed differences not only between the twins (which may have been normal), but also between the replicates of identical DNA samples (Table II). Compared to the raw SNP data (Table II), there was a striking difference after the application of the probability stripes algorithm, specifically in regards to the percent similarities between the replicates of each twin (Table III). Thus, the probability stripes algorithm decreased the variability in

LOHs detected in replicate DNA samples and between monozygous twins (Tables II, III).

Table II. Percent similarity of the raw SNP data amongst, and between, twin E and twin , before the probability stripes algorithm was applied.

Twin E 2 Twin E 3 Twin O 1 Twin O 2 Twin O 3 Twin E 1 91% 91% 89% 91% 91% Twin E 2 95% 90% 94% 94% Twin E 3 89% 94% 95% Twin O 1 89% 89% Twin O 2 93%

Table III. Percent similarity of stripes amongst and between twin E and twin O, after the probability stripes algorithm was applied.

Twin E 2 Twin E 3 Twin O 1 Twin O 2 Twin O 3 Twin E 1 98% 98% 97% 98% 98% Twin E 2 99% 97% 98% 98% Twin E 3 97% 98% 99% Twin O 1 97% 97% Twin O 2 98% Percentages shown in red are the percent similarities between replicates, which are much different from what they were prior to the probability stripes algorithm being applied. The numbers in the headings indicate each sample replicate.

46 Appendices 2 and 3 show application of the probability stripes algorithm at a p value of 0.001 to the individual repeat samples for each twin. There are many visible differences between the probability stripes of the three replicates of identical for each twin. After the probability stripes algorithm (p = 0.001) was applied to the average

SNP values of all three repeats for each twin, we still found significant regions of LOH which remained in the same chromosomal areas for both twins (Figures 2a and b). In short, many of the chromosomes are very similar between twin “E” and twin “O” averages, although minor differences are still found (Figures 2a and b). The LOHs found in the normal chromosomes of both twins are most likely due to residual consanguinity.

Of interest, the blank spaces seen in chromosomes 1, 9, 16 and the beginning of 13, 14,

15, 21 and 22 indicate areas of non-SNP coverage. In the latter chromosomes this is because these are acrocentric chromosomes, without distinct p chromosomal arms.

In summary, any study of cancer LOHs must take these naturally occurring LOHs into consideration. This would require the performance of SNP-LOH assays of non- transformed, healthy DNA samples from patients whose tumors are being studied.

47 Figure 2. Average probability stripes at a 0.001 p-value for each chromosome

a. Twin “E”

#

b. Twin “O”

#

48 Normal samples- twins. Copy Number Variation. As a quick test of copy number variants (CNVs) in normal samples, we randomly choose 6 chromosomes

(chromosomes 3, 5, 10, 15, 18, 21) and looked for copy number variants in each twin.

Copy number was measured using the Copy Number Analysis Tool (CNAT), which is part of the SNP analysis software (See Materials and Methods). Even in this small sampling of chromosomes, we found significant regions of copy number variations on 2 of the 6 chromosomes, most of which were relatively common between the twins, and one which was not (Figure 3 only shows the 2 chromosomes with significant CNVs).

This test was done using an average of the three replicates for each twin. In summary, any study of CNVs in cancers must take the CNVs normally occurring in non- transformed tissues into account.

Figure 3. Significant Copy Number Variants (CNVs) in the twins in chromosomes 15 and 21

Twin E

Twin O

0 25 50 75 100 Mb

Chromosome 21

Twin E

Twin O

0 25 50 75 100 Mb

indicate deletions (copy number of 1) and indicate duplications (copy number of 3)

There are areas of copy number variants in some chromosomes in normal human samples.

49 Contamination. Dilution Experiment. To determine how sensitive the SNP assay is, we performed a simple dilution experiment (See Materials and Methods).

Chromosome 9 is shown as an example in Figure 4. In PC-3, this region of chromosome

9 is a homozygous region. Figure 4 illustrates that as higher concentrations of RWPE

(benign prostate cell line) are added, there is an increase in the number of heterozygous

SNPs, indicating that contamination from normal cells in a cancer sample is detected by the SNP assay, and can create false heterozygous regions. The area of that was completely homozygous slowly shows areas of heterozygosity as more normal

RWPE is added. As low as 1% contamination can be detected with the SNP assay.

When studying CNVs and LOHs in cancer cells, these results highlight the importance of using normal controls, and ensuring the cancer samples are free of normal cell influences.

Figure 4. An example of Percentage of Heterozygous SNPs (represented by the different color dots) as different percentages of RWPE (benign prostate cell line) were added to PC-3 (prostate cancer cell line). This is a portion of chromosome 9. 100 % RWPE or 0% PC-3

25 % RWPE

20 % RWPE

10 % RWPE

5 % RWPE

1 % RWPE

0 % RWPE or 100% PC-3

50 DU-145 prostate cancer cell lines. The probability stripes algorithm allows clear visualization of the LOH regions in the DU-145 cancer cell line (Figure 5).

Figure 5. Probability stripes of DU-145

The probability stripes algorithm was applied to the DU-145 prostate cancer cell line. There are multiple significant areas of LOH in this cancer cell line, indicating several areas of genetic instability. The spaces seen in chromosomes 1, 9, 16 and the beginning of 13, 14, 15, 21 and 22 indicate areas of non-SNP coverage. In the latter chromosomes this is because these are acrocentric chromosomes, and do not have a distinct p chromosomal arm.

DU-145 shows extensive areas of LOH. Chromosomes 13 and 22 have completely lost one homolog, as is apparent from the entire blue chromosome. There are no completely heterozygous chromosomes in DU-145, although there are some that are considered heterozygous: based on the fact that there are fewer blue areas in the chromosome relative to other chromosomes with greater than 25% blue in the

51 chromosome. These include chromosomes 2, 5, 7, 8, 14, 15, 17, 20 and 21, which are illustrated by mostly red chromosomes. The remaining chromosomes are all broken chromosome LOHs, or chromosomes that have lost some portion of one homolog, while still maintaining a complete haploid genome.

DU-145 Spectral Karyotyping and SNP data. Spectral Karyotyping (SKY) was used to look at chromosome copy numbers and chromosomal abnormalities of DU-145 prostate cancer line. The average number of normal homologs and marker chromosomes per karyotype for DU-145 is summarized in Appendix 4. There was an excellent correlation between the SKY data and the probability stripes for DU-145 cell line. The numbers of SKY-detected normal homologs and/or marker chromosomes

(abnormal chromosomes found in greater than 60% of the cells looked at) maintain at least a haploid map, as seen in the probability stripes for the chromosomes in DU-145

(Figure 5).

The blue areas in the probability stripes of DU-145 chromosomes of Figure 6 represent homologous haploid chromosomal regions, or DNA from only one parental chromosome in that part of the karyotype, and the red areas represent heterozygous chromosomal areas. Briefly, the probability stripes represent only the chromosome shown, with the heterozygosity or homozygosity only being for that chromosome and not representing any translocated parts of other chromosomes. As an example, the probability stripes of in DU-145 cell line shows a broken chromosome

LOH as seen in the partly homozygous and partly heterozygous probability stripe for chromosome 6 only (Figure 6). Similarly, chromosomes 1 and 11 in DU-145 are also broken chromosomes, and study of the probability stripes shows which homolog or translocation the heterozygosity is derived from. Also, in DU-145 (Figure

6) has a whole homolog loss, as seen by both the haploid probability stripes (all blue) and the lack of a completely intact normal-appearing homolog by SKY.

52

Figure 6. DU-145- Probability stripes and SKY correlation in both whole homolog loss and broken chromosome LOHs

Chromosome 13 Chromosome 6 Whole Homolog loss Broken Chromosome

-6

-16

Chromosome 1 Broken Chromosome Broken Chromosome

4

-11

1 -12

-21

Areas of blue in the probability stripes are LOH, corresponding with only one genetically unique parental homolog in the SKY pictures. The red areas in the probability stripes are heterozygous regions, and correspond to the presence of 2 genetically unique parental homologs in the SKY pictures.

53 Gene Expression

Universally expressed genes. Cells. We first looked at the expression data of

12 experimental cell lines (summarized in Appendix 1A) prepared in this laboratory to determine the total number of genes that were expressed by all of these cell lines. The microarray chip has a total of approximately 56,000 probe sets, represented as the starting point on the left side of the graph (Figure 7). As more cell lines were investigated, the total number of genes expressed by all samples decreased, until there was a relative plateau at 13,123 probes sets, corresponding to 7,742 genes (Figure 7, right side of graph).

Figure 7. Congruent expression in 12 experimental cell lines

60,000

50,000

40,000

30,000

20,000

10,000

Total # of probe sets 0

PC-3 A549 H520 22Rv1 Calu-6 BUD-8 RWPE Total # LNCap MRC-5 SUIT-2 DU-145

CCD-34Lu Cell Lines The total probe sets expressed in all 12 cell lines are 13,321 which represents 7,742 genes most likely required for the survival of these cell lines.

We then looked at the expression results from the 198 cell lines publicly available from the GEO database (see Materials and Methods), and compared it to our 12 cell lines (Figure 8). This comparison was done by distributing the 198 cell lines into 13 intervals based on the number of cell lines a gene is expressed in. For example, the

54 cumulative number of genes expressed in 1 to 18 samples would be in interval 1; and 19 to 36 would be interval 2, and so on. We see that between intervals 2 and 10 (variably expressed genes), on average, there are approximately 485 genes expressed in the experimental cell lines, and 721 genes expressed in the GEO database cell lines.

Figure 8. Gene expression comparison between our 12

experimental cell lines and 198 database cell lines (GEO)

7,000

6,000 In silico (GEO) 5,000 Experimental lines

4,000

3,000

Number of probe sets 2,000

1,000

0 1 2 3 4 5 6 7 8 9 10 11 12 Expression Frequency Intervals

Appendix 5 has a breakdown of the number of genes expressed in these intervals in both datasets. The number of genes expressed in intervals 2-10 (variably expressed) is relatively similar between our 12 cell lines and the database of 198 cell lines. However, there is discrepancy with the end intervals 0 and 1, and, 11 and 12. The decreased numbers of genes in the 0 and 12 expression intervals and the increased numbers of genes in the 1 and 11 expression intervals of the in silico (GEO) database

(Figure 8) probably reflects the accumulation of false positive and false negative assay errors at both ends pf the larger GEO database (Appendix 10 for detailed explanation).

55 Further comparison between the two datasets revealed 1,918 probe sets representing

1,508 genes expressed by all 198 in-silico cell lines and all 12 experimental cell lines.

Variably Expressed Genes. Cells. While there is a group of genes that are expressed in all the cell lines, there is also a group that is only expressed in some cell lines. These are the variably expressed genes, which are genes expressed in at least 1, but not all 210 cell line samples. Figure 9 illustrates 670 genes (from 2,150 probe sets) that are commonly expressed between both the experimental and GEO database cell lines, in intervals 2-10 (variable genes). Appendix 5 contains the number of genes in each dataset and the number of genes in common between both datasets.

Figure 9. Variably expressed genes in common between both the 12 experimental cell lines and the 198 GEO cell lines 200 180 160

140

120 100 80 60

Number of genes 40 20

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Expression Frequency Intervals

Variably Expressed Genes. Tissues Only (not in cell lines). We then looked at the genes expressed in the 2,035 tissues of the expO database. Here again, the 2035 tissues were divided into 13 intervals (0-12) based on the number of samples a gene is expressed in. For example, interval 0 contains all genes expressed in 1- 157 samples; interval 1 contains all genes expressed in158 -313, and so on. Figure 10 shows 173

56 genes (represented in figure 10 as 205 probes) that are expressed in at least 158 tissues but not in any cell lines.

Figure 10. Number of genes expressed in 158 through 2,035 tissues 80

Number of probes 40

0 158 to 2035 Tissues Expression Interval 1 – 12

Probes representing genes expressed only in tissues, but not any cell lines.

Variably Expressed Genes. Cell lines and Tissues. A comparison was done between the variably expressed cell lines (intervals 2-10, Figure 9) and all 2,035 tissues, and we found 310 probe sets, representing 291 different genes, which were variably expressed in both cell lines and tissues (Figure 11).

Figure 11. Variably expressed genes common between

the tissues and the cell lines

60

50

40

30

20

Number of genes 10

0 0 1 2 3 4 5 6 7 8 9 10 11 12

Expression Frequency Interval

57 Universally Expressed Genes. Cells and Tissues. The 1,508 genes found to be expressed in all 210 cell lines were then compared to the tissue database consisting of 2,035 tissues, to obtain genes in common between the cell lines and tissues. The criterion for comparison was that the gene must be expressed in all 210 cell lines and also expressed in at least 99.75% of the 2,035 tissues. We found 973 probe sets which correspond to 778 different genes that are expressed in all 210 cell lines and at least

99.75-100% of 2,035 tissues (Figure 12, red ). These genes are universally expressed genes for cell growth and survival and will be referred to as “universally expressed genes”, since they are expressed in all of the cell lines and most of the tissues in this study.

Figure 12. Expression in 210 cell lines and 99 % of the 2,035 tissues

*99.75 - 1000 100.0%

800 *99.50 - *Percent of samples 99.75% 600

400

*99.25 - Number of probe sets 200 *99.00 - 99.50% 99.25% 0 2016 to 2021 to 2026 to 2031 to 2020 2025 2030 2035

Number of Tissues Expressing Genes Interval 12

The red bar represents the common probe sets among most of 2,245 samples which represents 778 genes

58 We conducted five tests to investigate these 778 universally expressed genes in order to answer the following questions:

1) Locations of these genes – were these genes clustered to specific chromosomes or were they distributed relatively evenly throughout the genome?

To address this question, we created ideograms of the chromosomal locations of each of the genes (Figure 13). The red lines are the 778 universally expressed genes for growth and survival, and the blue lines are the rest of the genes represented on the chip, common to both tissues and cell lines, which were used as controls. It is clear that the

778 genes are distributed throughout the genome, on all chromosomes except .

Figure 13. LocationsLocations ofof thethe universallyuniversally expressedexpressed genesgenes

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18

The 778 universally expressed genes are distributed throughout the genome 19 20 21 22 X Y

59 2) Housekeeping genes- are our universally expressed genes housekeeping genes?

To address this question, we compiled a list of 3,516 housekeeping genes from various sources in the literature (See Materials and Methods, Housekeeping genes), and found that 600 of our 778 genes were in common with the list of 3,516 housekeeping genes. The list of the gene symbols of our 778 genes divided into housekeeping and non-housekeeping genes can be found in Appendix 6A and B.

3) Functional analysis- were these universally expressed genes involved in essential functions?

To delve into this question, we used the Gene Ontologies (GO) database

(Affymetrix- Netaffex Annotations) (www.affymetrix.com), which consists of three separate ontologies: molecular function, biological process, and cellular component.

Based on the information in these sections, we classified the genes as intracellular metabolic functions (performed in all cells and tissues, and thus essential) and differentiated/developmental functions (performed only in specialized cells and tissues)

(Figure 14). This figure shows a significantly larger percentage of the universally expressed genes having intracellular metabolic functions, or functions that are important to all cells and tissues, when compared to those expressed only in tissues, or the variably expressed genes. Appendix 7 contains a list of all the gene symbols and their individual functional classification.

60 Figure 14. Functional analysis of the genes in the universally expressed genes, tissues only and variably expressed genes classes

100

Intracellular Metabolic Genes Developmental/Differentiated Genes

75

50

Percent of Genes 25

0 Universally expressed genes Tissue Only Variable 778 183 291 Univ. p-values express Tissue Variable Univ. expressed ----- 2.51 X 10-54 1.40 X 10-43

Tissue 2.51 X 10-54 ----- 0.0006

Variable 1.40 X 10-43 0.0006 -----

Intracellular Metabolic Genes were genes that performed functions in all cells, while Developmental/Differentiated genes were genes that performed more specialized functions.

4) Embryonic lethality- were these genes embryonic lethal in the mouse?

We looked at this question by comparing our list of 778 universally expressed genes with the generated list of 1,955 mouse embryonic lethal genes with human orthologs (See Materials and Methods, Mouse Embryonic Lethal Genes). This yielded

140 of 778 (18%) of our universally expressed genes as orthologs of known embryonic lethal genes in the mouse, which was similar to the 49 of 291 (17%) from the variably expressed genes. A list of the universally expressed genes that may be embryonic lethal can be found in Appendix 8. This also supports the conclusion that the universally

61 expressed genes have important cell survival and growth functions. A comparison of the

1,955 mouse embryonic lethal genes with human homologs to the tissue specific genes yields only 14 of 183 (8%) genes, indicating that there are not very many genes that are essential for cell survival and growth in the tissue specific class of genes.

5) Gene Lengths- what were the lengths of the introns and exons in each of these genes?

To look at this question, we looked at the full length of the introns and the exons for each gene investigated. We see that there is a significant (p = 1.46 x 10-9) difference in the composite length of the introns and exons between the universally expressed genes for growth and survival, and the variably expressed genes. However, there does not appear to be a significant difference between the tissues only and the universally expressed genes (Figure 15).

Figure 15. Comparison of the composite length of introns and exons in the different groups of genes- universally expressed, tissue only or variable.

100 90 80 70

60 50 40 30 20

Average Composite Length (Kbp) 10

0 Universally Tissue Only Variable Expressed 146 270 698 Univ. p-values expressed Tissue Variable Univ. expressed ----- 0.9747 1.46 x 10-9

Tissue 0.9747 ----- 8.62 x 10-5

Variable 1.46 x 10-9 8.62 x 10-5 -----

Fewer samples are seen here due to the lack of introns/exons length information available for all the genes that we originally investigated. 62 Tissue-specific Genes. A further look into the variably expressed genes in both cells and tissues revealed subsets of genes that were expressed specifically in different tissue types (data not shown). We looked in greatest detail at genes specifically expressed in prostate cells, and this work is detailed below.

Prostate specific Genes. Cells. In our 12 experimental lines, we looked at the gene expression index (See Materials and Methods, Gene Expression Index), as a way to compare between the prostate specific genes and those of non-prostate cell types. A

GEI of 5 indicates genes that are expressed in all 5 prostate cell lines, and no non-

prostate cell line; a GEI of 4 Figure 16. GEI +5 to +4: Comparison of prostate specific probe sets to indicates either 4 prostate and no other non-prostate probe sets non-prostate cell line, or 5 prostate

and 1 non-prostate cell line (Figure

16). The five prostate cell lines were

of epithelial origins and the five lung

cell lines consisted of three epithelial

and two mesenchymal (fibroblast)

cell lines. There was a significant Number of Probe sets

difference (p = 1x10-4) when the

genes in the prostate specific group

were compared to the lung specific

There were more probes that were expressed group. There were 46 genes in the prostate compared to those in the lung (represented by 66 probe sets) in the prostate compared to 19 genes (represented by 28 probe sets) in the lungs.

63 Also, many genes were expressed in the prostate cell lines, but not in the prostate tissues. Figure 17 shows a summary of the number of genes expressed in 2-5 prostate cell lines, but not in any of the 69 prostate tissues.

Figure 17. Number of prostate specific genes in each of the 5 prostate cell lines from our 12 experimental cell lines

5

4

3

Number of Prostate Cell Lines 2

Prostate Cell Lines

64 Prostate-specific Genes. Tissues Only. We also found a number of genes that were expressed in only prostate tissues, but not in any prostate or any other cell lines.

These genes are shown in Figure 18, which demonstrates the ratios of percentages of prostate to non prostate gene expression in the 69 prostate and 1966 non-prostate tissue samples. The specificity of the prostate expression of the genes was also independently confirmed by chi squared analysis (p< 0.001)

Figure 18. Prostate specific gene expression in tissues only. These are not expressed in any of the 210 cell lines

> 100

13 - 60

12

11 10 9

8 7

6 5 4

Ratios - Prostate/Non-prostate Gene Expression 3 Prostate Tissues only

65 Prostate-specific Genes. Tissues and Cells. Genes with +5 to +2 GEIs in prostate cell lines were tested for percent expression ratios in the 69 prostate and 1966 non-prostate tissues. Figure 19 shows the results of this test. There are 84 prostate specific genes that were expressed in both prostate cell lines and prostate tissues (p <

0.001). The higher the percent expression ratio, the higher the specificity for expression in prostate cells.

Figure 19. Common prostate specific genes expressed in both tissues and cell lines.

> 100

13 - 60

12 11

10 9 8

7 6

5

Ratios - Prostate/Non-prostate Gene Expression 4 3

Prostate Cell lines and Tissues

66 Prostate-specific genes- Summary. Figure 20 is a model explaining the genes expressed in prostate cell lines and tissues. In this figure, the prostate tissues have two differently shaped cells representing the different types of cells that may be present. The grey cells are epithelial as in the cell lines, and the yellow cells could be any other cell type, such as stromal or blood vessel cells, which are usually present in tissue samples.

The red squares represent the universally expressed genes for growth and survival, expressed in all cell lines and 99.75 % of the tissues. The black circles are genes expressed in cell lines only, and may have been turned on in response to different requirements for growth in cell culture, including immortalizing. The green circles represent genes that are expressed only in prostate epithelial cells both in vivo and in vitro. The blue circles are those genes that are expressed in prostate tissues but not cell lines, indicating that these genes perform differentiated in vivo functions specific to the prostate, or that these genes were more expressed by cell types (stroma, lymphocytes, etc) not selected for culture growth.

Figure 20. Summary of genes expressed in cell lines and tissues

Prostate Cell Lines Prostate Tissues

Universally expressed genes Genes expressed exclusively in cell lines Genes expressed exclusively in epithelial cells Genes expressed exclusively in tissues

67 DU-145 prostate cancer and the universally expressed genes- The link. To look at a global perspective of the link between the universally expressed genes for growth and survival expression data and the loss of heterozygosity and copy number variations, we look at DU-145 (prostate cancer cell line) as a specific example. This will allow us to see the connection between the prostate specific genes, universally expressed genes and the regions that are lost and retained in the cell line (Figure 21).

The red ideograms represent the locations of the 778 universally expressed genes. The green lines and stars in the ideogram represent genes that are specific to DU-145 (not found in any other cell line in this study) (Figure 21). These figures show primarily broken chromosome LOH examples in DU-145, although there is an example of a heterozygous chromosome, and a whole homolog loss chromosome as well. The SNP and SKY data correspond very well as seen in Figure 6. The addition of the universally expressed genes locations helps to provide evidence for our hypothesis, which will be further explained in the discussion.

In general, from the examples shown in DU-145, the observation is that all 8 broken chromosomes contain translocations. Six of the 8 broken chromosomes also appear to have maintained heterozygosity in regions that have a higher density of cell survival genes, as shown in black circles and arrows indicating the corresponding heterozygous regions in Figure 21. Some chromosomes also seemed to have two normal appearing homologs in the SKY, but are not actually heterozygous in the probability stripes.

68

Figure 21. Universally expressed genes, SNP and SKY correlation in broken, heterozygous and whole homolog loss chromosomes in DU-145

The red lines on the ideograms represent the genomic locations of the 778 universally expressed genes for growth and survival, the green lines and stars represents DU-145 specific genes and the black circles and arrows represent heterozygous regions that correlate to high density areas of cell survival genes

Broken chromosomes

1

-4

-1

3

69 6

-6

-16

9

10

12

70

16 -6

-16

18

Heterozygous chromosome

5

Whole-Homolog LOH

13

71 Discussion

Pseudo-distribution and probability stripes. Prior to the creation of the probability stripes, we did a simple scatter plot of all the heterozygous SNPs on one line, and all the homozygous SNPs on another line, to produce a single nucleotide polymorphisms (SNP) map, in order to visualize the large quantity of data generated by the SNP assay. Appendix 9 shows an example of these SNP maps using DU-145 prostate cancer cell line. However, there were two limitations with this method which provided the impetus to develop a better method of analysis and visualization of the SNP data: (1) there were multiple single Heterozygous SNPs (H-SNPs) interspersed within areas of loss of heterozygosity SNPs (L-SNPs) which, if erroneous, would incorrectly truncate the runs of consecutive L-SNPs making up LOHs; and (2) this approach does not allow evaluation of the spatial information. To try to overcome these limitations, the pseudo-distribution and spatial information present in the 250K SNP-maps were combined to facilitate the detection of LOH - producing probability stripes.

Twins. Similarity comparison. We looked at the percent similarity between the twin data prior to applying the probability stripes algorithm as follows: we compared replicate SNP assays: 1) of the same DNA sample, and 2) between the twins (Table II).

We observed between 91and 95% similarity between replicates of twin E, and 89 and

93% similarity between replicates of twin O. This large discrepancy between replicates of the same DNA samples are due to noise, or errors, in the SNP assay. As a result, our novel probability stripes algorithm was applied to the twin data, treating the replicate

SNP maps of each twin as individual samples. After applying the algorithm, there were still differences in the replicates which could be accounted for by assay error, although now the error rate was in the expected range (~5%). This was objectively shown by the finding that the percent similarity between the replicates significantly improved after

72 application of the probability stripe algorithm (Table III). Our results of 98-99% similarity between the twins are now also in accordance with a study done using the 50K Xba1

SNP GeneChip, where Montgomery et al., found a concordance rate of 99.995% between monozygotic twins (Montgomery et al., 2005).

Twins. Probability Stripes. Figures 2a and b show the average probability stripes for each twin which was derived from the 3 individual replicates (Appendix 2 and

3). The individual replicates did appear to have some differences, but an average was used to create Figures 2a and b as follows: each SNP was coded as “H” (Heterozygous), or “L” (Homozygous) or “” (No Call) for each sample. The averages were calculated using the 3 individual replicate samples. At a given SNP locus, 3 of 3 or 2 of 3 calls were used as the average. If all 3 SNPs at a given locus were different from each other, they were discarded. In general the average probability stripes are similar to each other, but there are small variations as in chromosomes 1, 3, 4, 5, 7, 10, 12, 15, 18, 20 and 21. The differences between the twins could be accounted for by a combination of assay error and environmental influences (Gringras & Chen, 2001). The areas of LOH seen in normal samples from both twins confirm other studies that have shown long stretches of homozygosity throughout the human genome resulting from common ancestry.

Specifically, the finding that both twins have similar LOH patterns (Table III and Figures

2a and b) supports the common ancestry theory. Common ancestry refers to the twins originating from a population with some degree of inbreeding, leading to both parental alleles being the same, or homozygous, at a specific loci, extending over a significant length of that portion of the chromosome (Bacolod et al., 2008; Curtis, 2007; Gibson et al., 2006; Simon-Sanchez et al., 2007). These areas are important to note, especially if these normal samples are going to be used as controls for cancer samples, since these

LOH areas in the normal samples should not be attributed to somatically acquired LOHs in the genotype.

73 An interesting point that the results from the replicates raise is the importance of using triplicates for this SNP assay, given the large amount of variability in replicates of the same DNA sample.

Twins. Copy number variants (CNVs) comparison. Using the Copy Number

Analysis Tool (CNAT), as part of the SNP analysis software, we looked at copy number variations (CNVs) in our normal twin samples. Some researchers have observed in normal samples that there are locations such as 8p and 15q13-14 that contained clusters of three to four CNVs, which may be evidence that these regions are “hotspots” of copy number variation (Sebat et al., 2004). We compared a few regions of copy number variations observed in our twin samples with the rearrangement “hotspots” from Sharp et al., 2005. The significant regions of duplications on chromosome 15 corresponded very well with significant “hotspot” regions of rearrangement in the human genome.

Chromosome 15 CNVs were relatively similar between the twins, but

Chromosome 21 was different (Figure 3). Our data found that 15q11 also contained an area of CNVs for the twins in our study. Although there are only two samples in this study, this test supports previously published data that CNVs also exist in healthy adults, and that there are specific regions of a chromosome that are more prone to rearrangements and copy number variations (de Stahl et al., 2008; Redon et al., 2006;

Sebat et al., 2004; Sharp et al., 2005). This information is necessary to know when looking at cancer, to ensure that copy number variants are not associated with cancer when they actually occur in a large proportion of the healthy population. This comparison was simply done as a quick test in the twins to confirm that these variants do exist in normal cells, and is a factor that should be considered when SNP analysis is done on normal tissues of a cancer patient as a control to detect tumor-specific LOHs and CNVs.

Contamination. Dilution Experiment. Figure 4 shows that the SNP assay can detect as low as 1% contamination from normal cells in a tumor sample. This indicates

74 the importance of ensuring that the cancer samples that are being studied are pure and free of all normal cells, as this will result in false areas of heterozygosity.

Summary- Normal cells. Any detailed study of human cancer LOH must also study the DNA from the patients’ normal cells to correct for naturally occurring LOHs and

CNVs. Also any study of fresh tumors must carefully dissect the cancer cells from the normal cells in the tumor in order to prevent the masking of tumor LOHs.

DU-145 probability stripes and SKY data. Our probability stripes algorithm was applied to DU-145 prostate cancer cell line (Figure 5). Although aneuploidy is selected against in normal proliferating cells early in transformation (Thompson & Compton,

2008), this does not apply to established aneuploid cancers or cancer cell lines. The karyotypes of the DU-145 cell line are clearly aneuploid by spectral karyotyping (SKY).

Spectral Karyotyping data (SKY) was used in this study on the DU-145 prostate cancer cell line to: 1) confirm areas of chromosomal LOH and heterozygosity; 2) to visualize specific abnormalities, such as translocations and deletions, and 3) to investigate SKY copy number variations in the prostate cancer cell lines. SKY abnormalities seen in greater than 60% of the karyotypes investigated are referred to as marker chromosomes; those that are seen in less than 20% of the karyotypes investigated are referred to as low-copy number structurally abnormal chromosomes (SACs); and those occurring between 20% and 60% are referred to as intermediate-copy number SACs. These abnormalities explain many of the aberrant probability stripe maps observed. For example (Figure 6), in DU-145, , the probability stripes show a broken chromosome LOH, and the SKY data shows part of one of the homologs has been deleted and replaced by part of . The combination of the SNP and SKY data help to paint a complete picture. Based on the SKY data, there appears to be 2.5 copies of the chromosome (normal = 2 copies). However, the probability stripes show only 1.5 homologs, which would indicate that the extra copy of the chromosome seen in

75 the SKY data is probably a duplicate of the other normal appearing homolog. Using SNP data, a duplicate homolog would be considered as 1 homolog, not 2 separate homologs, since they are, in theory genetically identical to each other. The same explanation would be true for chromosome 6. In chromosome 11, the (11:21) translocation may be a duplicate of the normal appearing homolog. The other translocations may confer the heterozygosity seen in the probability stripes. In another DU-145 example, chromosome

13, the probability stripes show a whole homolog loss of heterozygosity. The SKY data shows that there is not even one normal appearing homolog, but there are 2 translocations that account for the haploid chromosome seen in the probability stripes.

These findings together suggest that normal and abnormal chromosomes play complimentary roles in supporting the cancer genomes.

The possibility that considerable numbers of small, but significant, clusters of L-

SNPs due to true LOHs could occur during the somatic chromosomal evolution of cancers and cell lines without the production of structurally abnormal chromosomes would seem to be remote. Balanced reciprocal translocations are one possible chromosomal mechanism of creating broken chromosome LOHs while still preserving normal diploid homologs. Again, it is also important to acknowledge that there are regions of smaller LOH in the normal DNA samples, some of which may be present in multiple populations. These regions should not be included as part of LOHs that occur with cancer progression. The need for normal control samples for comparison to cancer samples is evident.

Hypothesis 1 summary and conclusions: The areas of L-SNPs observed in the normal twin samples are characteristic of residual consanguinity, possibly produced by hundreds of generations of variable inbreeding among a closed population. In contrast, LOHs acquired during the evolution of aneuploid cancer cells often involve loss of entire parental homologs producing whole-homolog LOHs, or loss of large genomic

76 regions from major chromosome breaks, leading to broken-chromosome LOH with the preservation of a haploid genome. The application of the probability stripes algorithm allows for easier visualization and comparison of the data, by objectively set levels, or p values of the probability stripe widths. This algorithm allows a significant reduction of the noise in the data for both normal and prostate cancer cell lines.

Gene Expression. The series of experiments and analyses done using 210 cell lines and 2,035 tissues were to see if a set of genes existed that are universally expressed in both cancer and normal samples to explain the selective retention of abnormal chromosomes.

Congruent expression. Cells. Figure 7 shows that in our 12 experimental lines there are 7,742 genes that appear to be commonly expressed among all 12 cell lines, and therefore indicates that these genes are important for the survival of these cell lines.

Given the fact that these samples are from various tissues in the body (Appendix 1A), we can eliminate the chance that these congruently expressed genes were only specifically expressed in one type of tissue.

Comparison between our cell lines and the 198 GEO database cell lines. In general there was agreement in the data between our 12 cell lines and the 198 cell lines in the GEO database. A more in-depth explanation for any discrepancies can be found in

Appendix 10. We found that there were 1,508 genes that were commonly expressed in all 210 cell lines studied.

Universally expressed genes, tissues and cell lines. We then compared those

1,508 genes to the genes expressed in the tissues. We found 778 genes that are expressed in all 210 cell lines and at least 99.75% of 2,035 tissues, which are expected to be the universally expressed genes for growth and survival (Figure 12). These are genes that are expressed in multiple cell lines of differing origins, both cancer and normal, as well as various cancerous tissues from different organs. These genes are

77 therefore expressed in multiple conditions, and in every cell type/tissue origin investigated, indicating that they are extremely important for many adult cells to grow and/or survive. Each of the further investigations that we did on this group of 778 genes helped us to identify additional characteristics of these universally expressed genes for growth and survival.

Locations of the universally expressed genes. As seen in Figure 13, the 778 genes are widely distributed throughout the genome, being on all chromosomes except

Y. This supports the idea that since these essential genes are housed on each of the autosomal chromosomes and X, the retention of a haploid genome is necessary for the survival of all cells, even for cancer cells.

Housekeeping Genes. Six hundred of our 778 (77%) universally expressed genes for growth and survival are known housekeeping genes (Appendix 6). This supports our claim that these genes are essential to the survival of all cells. It is possible that the remaining 175 genes may also be housekeeping genes that are not represented in the list that we compiled. A further investigation into a larger collection of housekeeping genes may reveal that these are also housekeeping genes. It is also entirely possible that some of these genes have not yet been coined housekeeping genes, but are. There are very few genes listed in this group that perhaps produce viable mouse knockouts, arguing against these particular genes being universally expressed genes for growth and survival.

Also, 145 of our 778 universally expressed genes were fetal genes, that is, genes that are expressed in the early stages of development, and then continue to be expressed throughout adulthood. (Obtained from and compared to a study by Warrington et al.)(Warrington et al., 2000).These genes would be critical to an organisms’ survival given the fact that they are expressed from early development to adulthood.

78 Functional Analysis of the universally expressed genes. Generally, the functions of the housekeeping genes are universally important functions, and we see that our list of 778 universally expressed genes have a large proportion of intracellular metabolic functions (essential functions), when compared to genes expressed in tissues only or those expressed in some, but not all, tissues and cell lines (variably expressed)

(Figure 14 and Appendix 7). This provides further evidence that these genes are housekeeping genes, and are needed for general survival of all cells.

Mouse embryonic lethal genes. We found that 140 (18%) of our 778 universally expressed genes were mouse embryonic lethal, indicating that their deletion in humans could lead to death of the organism (Appendix 8). We are unaware of mouse knock-outs for the remaining 638 universally expressed genes, as they were not in the list of 2,429 mouse embryonic lethal genes with human orthologs. Thomas et al found 3.4 % of their set of 3,035 housekeeping genes were embryonic lethal (Thomas et al., 2003). Our finding supports the theory that these genes are essential for the survival of the organism. It is possible that our list of 140 mouse embryonic lethal genes is not an exhaustive one. We also found 17% of our variably expressed genes are embryonic lethal which may be because these genes are important for certain subtypes of cells such as prostate specific or lung specific cells to survive and grow. In contrast there were only 8% of the tissue specific group that were embryonic lethal which implies that the majority of these genes are not as important as the variably expressed and universally expressed genes for growth and survival of the cells.

Intron/Exon Length. Figure 15 shows a significant difference in gene length between the universally expressed genes and the variably expressed genes. Previous studies show that shorter introns length may be more abundant in universally expressed genes or housekeeping genes, since these genes are constantly being transcribed in every cell. Shorter gene lengths would have less chances of transcription error as well as

79 be more energetically efficient to transcribe (Eisenberg & Levanon, 2003). This is again further evidence that our 778 genes are housekeeping genes and are expressed in multiple cells and tissues.

Prostate specific genes. Cells. We then looked at a subset of variably expressed genes (Figures 9 and 11) to investigate the presence of tissue specific genes, specifically prostate specific genes. In our 12 cell lines, there was a p value of 1x10-4 when prostate specific probe sets were compared to lung specific probe sets, which indicates that the number of prostate specific genes was significantly different than those that were lung specific (Figure 16). This shows that there is a subset of genes that are more highly expressed or more frequently expressed in certain cell lines than in others.

Some of these genes were also expressed in tissues. However, there was a subset of genes that were expressed only in prostate cell lines, but not in prostate tissues (Figure

17). Presumably, those genes were selectively upregulated to allow growth in cell culture conditions or in vivo gene expression was lost.

Prostate specific genes. Tissues only. The ratios shown in Figure 18 represent genes that are more frequently expressed in prostate tissues. These are a subset of genes that are expressed in prostate tissues only, but not prostate cell lines. These genes may have been expressed by stromal, or other types of cells, that are present in the prostate tissues but not selected for growth in prostate cell lines.

Prostate specific genes. Tissues and Cell lines. A direct comparison between prostate specific genes in common between cell lines and tissues was done. All of the genes with GEI +5 to +2 in the prostate cell lines were tested for prostate-specific expression in prostate tissues. The higher ratios of prostate to non-prostate gene expression in the tissues indicates increased specificity of a particular gene being expressed in prostate tissues (Figure 19). Therefore, those genes expressed at a ratio of

> 13 (the top of the graph) are expressed in more prostate tissues than those genes with

80 ratios of < 13 (the bottom of the graph). Thus, there are also a subset of prostate-specific genes that are specifically expressed in both cell lines and tissues more than they are expressed in any other cell or tissue type.

Hypothesis 2 summary and conclusions: We found a group of 778 genes that are expressed in 210 cell lines and > 99.75% of 2,035 tissues. We believe that these universally expressed genes (in our study samples) are mainly housekeeping genes with essential cell survival and growth functions, that are located throughout the human genome. We also found a subset of 83 genes more highly expressed in prostate cell lines and tissues, than in other tissue types. We think that these genes have the potential to be used as targets for anti-prostate cancer therapy. Additionally, there are two other classes of prostate specific genes: 1) those expressed in cell lines, but not tissues, and

2) those expressed in tissues, but not cell lines. A summary of the findings of the prostate specific genes is shown in Figure 20, and is discussed in detail in the Results section.

THE LINK BETWEEN HYPOTHESES 1 AND 2.

DU-145 comparison of probability stripes, SKY and expression data. The combination of the broken chromosome LOH probability stripes and the SKY data for

DU-145 chromosome 1 (Figures 21) shows a duplication of one of the homologs.

Therefore, the heterozygosity in this chromosome’s probability stripes is from the conserved region of chromosome 1 in the (1:4) translocation. The conservation of this heterozygous region suggests the possibility that there may have been a mutation in one of the cell survival genes (red lines). Therefore that area of chromosome 1 may have been specifically retained in the translocation to keep at least one functional copy of the() gene(s). , 6, 9, 10, 12, 16 and 18 all show similar examples of this

81 phenomenon: All of those chromosomes are broken chromosome LOH from the probability stripes analysis. Chromosomes 1, 3, 6, 9, 12, 16 and 18 all also have duplicates of normal appearing homologs, presumably with inactivation mutations in universally expressed genes for growth and survival. The duplication of these normal appearing homologs with, presumably some inactivated cell-survival genes may have been selected for gene dosage compensation to the part of the opposing parental homolog lost in the LOH. We speculate that in these cases, there may be multiple mutated cell survival genes in these chromosomes, and that the duplicated homolog may actually have a functional copy of another gene, which we would not be able to observe with our analysis. Therefore one homolog may be compensating for another mutation in the other homolog, and would not need to have a translocation.

Chromosome 5 in DU-145 (Figure 21) is generally a heterozygous chromosome, but can still be explained with our universally expressed gene conservation hypothesis, whereby there are obviously more universally expressed genes, as well as a DU-145 specific gene, in the arm of the chromosome. Therefore, the conservation of the small regions of could again be to conserve functional copies of a gene that may be mutated in another homolog. It is also possible that these cell survival genes may be conserved, even in cases where there is no mutation, in order to maintain a particular gene dosage for the gene to function at its fullest capacity.

Chromosome 13 in DU-145 (Figure 21) is a whole homolog loss, but can also be explained with our universally expressed gene conservation hypothesis. From the SKY data, there is no normal appearing homolog, and from the probability stripes data, the chromosomes is completely homozygous. This indicates that through conserving the translocated regions of this chromosome, there was conservation of at least the haploid genome of this chromosome. The cell survival genes on chromosome 13 seem to be

82 located mainly on the top portion of the q arm, so these would be the primary genes that would be conserved in the haploid genome.

If the translocated regions are trying to conserve one gene that may be mutated on a homolog, why are some of the translocated areas so large or appear to have regions that do not contain cell survival genes? This may be because there are other genes, “bystander genes”, that may be linked to the conserved gene or genes in that area, and therefore are retained as well. This is in agreement with other studies done, where they found that “it is likely that bystander genes, located in close physical proximity to cancer pathway genes, are included within these segments but do not confer a selective advantage (Tsafrir et al., 2006).” Those translocations are presumably formed by chance. Also, extraneous, non-functional DNA can be carried in chromosomal abnormalities, but may still be capable of mitotic segregation. The bystander genes do not functionally impact the 778 genes, but may simply lead to larger areas of conserved regions, leading to the conservation of both the universally expressed genes and the bystander genes.

The observation that in many of the broken chromosomes (Figure 21, black circles and arrows) there seemed to be heterozygous regions of the chromosomes

(probability stripes) that corresponded to areas of a higher density of universally expressed genes (ideograms, red lines) may be due to two reasons. First, it may be that many of the cell survival genes have mutations, and therefore require a functional copy of that gene, which is achieved by retaining that portion of the chromosome in a translocation. Second, it may imply that many of these universally expressed genes require a specific dosage of genes. Therefore, having one functional copy is not sufficient for the gene to function at its fullest capacity, so a second copy is retained as a translocation to ensure the survival of the tumor (gene dosage compensation theory).

83 Our laboratory has previously found that through the use of SNP and SKY technology that a haploid genome is always preserved, even if the haploid genotypes don’ exist as whole chromosomes, but rather as translocations (Nestor et al., 2007). The results in this present work help to solidify this finding and add further evidence that there is a strong selective pressure to retain cell survival genes at least in a haploid genome.

This selective pressure to retain cell survival genes is so strong, that in cases where a cell survival gene is mutated, the gene is still conserved in a translocated portion of that chromosome carried in a chromosomal abnormality.

.

84 Conclusions

Hypothesis 1: Aneuploidy is used to retain at least a haploid genome, and using the novel method of probability stripes allows not only for easy visualization of the retained haploid genome, but also for noise elimination in the data in cancerous samples. Normal samples also exhibit a reduction in noise with the use of the probability stripes algorithm.

Hypothesis 2: We found a group of 778 genes that are expressed in multiple cell types and tissues, in both cancer and normal samples that are distributed throughout the genome. We believe these are universally expressed genes for growth and survival, and therefore housekeeping genes. We also found a subset of genes that are more highly expressed in specific tissues, such as prostate.

Overall conclusion: The tumor genome strives to retain a haploid genome. In cases where there is a mutation of an essential gene, that part of the chromosome may be retained on a translocation to ensure at least one functional copy of the gene. The data is consistent with the hypotheses.

85 Summary

We have found that aneuploidy is used to maintain a haploid genome, containing essential cell survival genes. The widely accepted reason for why certain areas in the cancer genome are retained is to maintain oncogenes, which are essential for survival, growth and progression of the tumor. Our results show an additional reason for why areas are retained- to conserve housekeeping genes that are essential not only for the survival of the tumor, but normal cells as well.

We have used a novel method at looking at the lost regions in cancer samples.

Many studies use single nucleotide polymorphisms (SNPs) to look at loss of heterozygosity (LOH) and aneuploidy in the human genome, including our present study.

However, our method of visualization and analysis was different from those cited in the literature. Our novel technique allows visualization of the LOH regions, but also reduces noise in the data, allowing for a more accurate description of the genetically unstable regions.

Additionally, our sample set for studying gene expression, had a very different composition than in most other studies looking at a universal expression pattern. Our study compared a large-scale combination of human tissues and cell lines of various origins from both normal and cancerous samples.

86 References

Affymetrix. (2004). GeneChip Expression Analysis Technical Manual P/N 701021 Rev. 5.

Affymetrix. (2006-2007). Manual: Affymetrix Genome-wide Human SNP Nsp/Sty Assay

5.0.

Affymetrix. (2007). Copy Number and Genotype Analysis of FFPE-extracted DNA:

Recommendations and Guidelines for GeneChip Mapping Arrays. Technical Note

Anno, S., Abe, T., & Yamamoto, T. (2008). Interactions between SNP alleles at multiple

loci contribute to skin color differences between caucasoid and mongoloid

subjects. Int Biol Sci. , 4(2), 81-86.

Argos, M., Kibriya, M. ., Jasmine, F., Olopade, O., Su, T., Hibshoosh, H., et al. (2008).

Genomewide scan for loss of heterozygosity and chromosomal amplification in

breast carcinoma using single-nucleotide polymorphism arrays. Cancer Genet

Cytogenet. , 182(2), 69-74.

Bacolod, M., Schemmann, G., Wang, S., Shattock, R., Giardina, S., Zeng, ., et al.

(2008). The signatures of autozygosity among patients with colorectal cancer.

Cancer Res, 68(8), 2610-2621.

Baltzer, F. (1967). Theodor Boveri: The Life of a Great Biologist 1862-1915 (D. Rudnick,

Trans. 8th ed.). Sunderland: Sinauer Associates.

Barrett, T., Troup, D., Wilhite, S., Ledoux, P., Rudnev, D., Evangelista, C., et al. (2006).

NCBI GEO: mining tens of millions of expression profiles--database and tools

update Nucleic Acids Res.

Beroukhim, R., Lin, M., Park, Y., Hao, K., Zhao, X., Garraway, L., et al. (2006). Inferring

loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide

SNP arrays. PLoS Comput Biol., 2(5).

87 Bild, A. H., Yao, G., Chang, J. T., Wang, Q., Potti, A., Chasse, D., et al. (2006).

Oncogenic pathway signatures in human cancers as a guide to targeted

therapies. Nature, 439(7074), 353-357.

Boveri, T. (1907). Die Entwicklung dispermer Seeigel-Eier. Ein Beitrag zur

Befruchtungslehre und zur Theorie des Kerns (Zellenstudien VI). (Translated).

Bruder, C. E. G., Piotrowski, A., Gijsbers, A. A. C. J., Andersson, R., Erickson, S., Diaz

de Ståhl, T., et al. (2008). Phenotypically Concordant and Discordant

Monozygotic Twins Display Different DNA Copy-Number-Variation Profiles. Am J

Hum Genet. , 82(3), 763-771.

Cahill, D. P., Kinzler, K. ., Vogelstein, B., & Lengauer, C. (1999). Genetic instability

and darwinian selection in tumours. Trends in Cell Biology, 9(12), M57-M60.

Calvo, A., Gonzalez-Moreno, O., Yoon, C., Huh, J., Desai, K., Nguyen, Q., et al. (2005).

Prostate cancer and the genomic revolution: Advances using microarray

analyses. Mutat Res., 576(1-2), 66-79.

Christensen, K., McCoy, E., & Ford, H. (2008). The six family of homeobox genes in

development and cancer. Adv Cancer Res., 101, 93-126.

Cleton-Jansen, A.-M., Buerger, H., Haar, N. t., Philippo, K., van de Vijver, M. J., Boecker,

W., et al. (2004). Different mechanisms of loss of heterozygosity

in well- versus poorly differentiated ductal breast cancer. Genes, Chromosomes

and Cancer, 41(2), 109-116.

Clifford, R., Edmonson, M., Hu, Y., Nguyen, C., Scherpbier, T., & Buetow, K. H. (2000).

Expression-based genetic/physical maps of single-nucleotide polymorphisms

identified by the cancer genome anatomy project. Genome Res., 10(8), 1259-

1265.

88 Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E., & Pritchard, J. K. (2006). A

high-resolution survey of deletion polymorphism in the human genome. Nat

Genet, 38(1), 75-81.

Curtis, D. (2007). Extended homozygosity is not usually due to cytogenetic abnormality.

BMC Genet, 8(67).

Davis, M., & Hammarlund, M. (2006). Single-nucleotide polymorphism mapping.

Methods in Molecular Biology, 351, 75-92. de Stahl, T., Sandgren, J., Piotrowski, A., Bruder, C., Nord, H., Andersson, R., et al.

(2008). Profiling of copy number variations (CNVs) in healthy individuals from

three ethnic groups using a human genome 32 K BAC-clone-based array. Hum

Mutat. , 29(3), 398-408.

Desai, P., Jimenez, J., Kao, C., & Gardner, T. (2006). Future innovations in treating

advanced prostate cancer. Urol Clin North Am., 33(2), 247-272.

Dezso, Z., Nikolsky, Y., Sviridov, E., Shi, W., Serebriyskaya, T., Dosymbekov, D., et al.

(2008). A comprehensive functional analysis of tissue specificity of human gene

expression. BMC Biol. , 6, 49.

Ding, L., Getz, G., Wheeler, D., Mardis, E., McLellan, M., Cibulskis, K., et al. (2008).

Somatic mutations affect key pathways in lung adenocarcinoma. Nature. ,

455(7216), 1069-1075.

Duret, L., & Mouchiroud, D. (2000). Determinants of substitution rates in mammalian

genes: expression pattern affects selection intensity but not mutation rate. Mol

Biol Evol, 17(1), 68-74.

Dutt, A., & Beroukhim, R. (2007). Single nucleotide polymorphism array analysis of

cancer. Curr Opin Oncol, 19(1), 43-49.

89 Edgar, R., Domrachev, M., & Lash, A. (2002). Gene Expression Omnibus: NCBI gene

expression and hybridization array data repository Nucleic Acids Res, 30(1), 207-

210.

Eisenberg, E., & Levanon, E. Y. (2003). Human housekeeping genes are compact.

Trends in Genetics, 19(7), 362-365.

Facts and Figures - Breast Cancer 2008. (2008). American Cancer Society.

Felsher, D. (2008). Tumor dormancy and oncogene addiction. APMIS, 116(7-8), 629-

637.

Gaasenbeek, M., Howarth, K., Rowan, A. J., Gorman, P. A., Jones, A., Chaplin, T., et al.

(2006). Combined Array-Comparative Genomic Hybridization and Single-

Nucleotide Polymorphism-Loss of Heterozygosity Analysis Reveals Complex

Changes and Multiple Forms of Chromosomal Instability in Colorectal Cancers.

Cancer Res, 66(7), 3471-3479.

Gibbs, J., & Singleton, A. (2006). Application of genome-wide single nucleotide

polymorphism typing: simple association and beyond. PLoS Genet, 2(10), e150.

Gibson, J., Morton, N. E., & Collins, A. (2006). Extended tracts of homozygosity in

outbred human populations. Hum. Mol. Genet., 15(5), 789-795.

Gringras, P., & Chen, W. (2001). Mechanisms for differences in monozygous twins. Early

Hum Dev., 64(2), 105-117.

Guthery, S., Salisbury, B., Pungliya, M., Stephens, J., & Bamshad, M. (2007). The

structure of common genetic variation in United States populations. Am J Hum

Genet., 81(6), 1221-1231.

Guttman, M., Mies, C., Dudycz-Sulicz, K., Diskin, S. J., Baldwin, D. A., Stoeckert, C. J.,

et al. (2007). Assessing the Significance of Conserved Genomic Aberrations

Using High Resolution Genomic Microarrays. PLoS Genetics, 3(8), e143.

90 Hahn, N., Kelley, M., Klaunig, J., Koch, M., Li, L., & Sweeney, C. (2007). Constitutional

polymorphisms of prostate cancer: prognostic and diagnostic implications. Future

Oncol., 3(6), 665-682.

Hedley, D. W., Rugg, C. A., & Gelber, R. D. (1987). Association of DNA Index and S-

Phase Fraction with Prognosis of Nodes Positive Early Breast Cancer. Cancer

Res, 47(17), 4729-4735.

Holland, A., & Cleveland, D. (2008). Beyond genetics: surprising determinants of cell fate

in antitumor drugs. Cancer Cell, 14.

Hsiao, L.-L., Dangond, F., Yoshida, T., Hong, R., Jensen, R. V., Misra, J., et al. (2001). A

compendium of gene expression in normal human tissues. Physiol. Genomics,

7(2), 97-104.

Huang, Y.-C., Lee, C.-M., Chen, M., Chung, M.-Y., Chang, Y.-H., Huang, W. J.-S., et al.

(2007). Haplotypes, Loss of Heterozygosity, and Expression Levels of Glycine N-

Methyltransferase in Prostate Cancer. Clin Cancer Res, 13(5), 1412-1420.

Hugel, A., & Wernert, N. (1999). Loss of Heterozygosity (LOH), malignancy grade and

clonality in microdissected prostate cancer. British Journal of Cancer, 79(3/4),

551-557.

Janne, P. A., Li, C., Zhao, X., Girard, L., Chen, T.-H., Minna, J., et al. (2004). High-

resolution single-nucleotide polymorphism array and clustering analysis of loss of

heterozygosity in human lung cancer cell lines. Oncogene, 23(15), 2716-2726.

Kibel, A. S., Jin, C. H., Klim, A., Luly, J., Roehl, K. A., Wu, W. S., et al. (2008).

Association between polymorphisms in cell cycle genes and advanced prostate

carcinoma. The Prostate, 68(11), 1179-1186.

Komura, D., Shen, F., Ishikawa, S., Fitch, K. R., Chen, W., Zhang, J., et al. (2006).

Genome-wide detection of human copy number variations using high-density

DNA oligonucleotide arrays. Genome Res., 16(12), 1575-1584.

91 Lips, E. H., de Graaf, E. J., Tollenaar, R., van Eijk, R., Oosting, J., Szuhai, K., et al.

(2007). Single nucleotide polymorphism array analysis of chromosomal instability

patterns discriminates rectal adenomas from carcinomas. The Journal of

Pathology, 212(3), 269-277.

Liu, W., Xie, C., Zhu, Y., Li, T., Sun, J., Cheng, Y., et al. (2008). Homozygous deletions

and recurrent amplifications implicate new genes involved in prostate cancer.

Neoplasia., 10(8), 897-907.

Lundberg, G., Rosengren, A., Hakanson, U., Stewenius, H., Jin, Y., Stewenius, Y., et al.

(2008). Binomial mitotic segregation of MYCN-carrying double minutes in

neuroblastoma illustrates the role of randomness in oncogene amplification. Plos

One, 3(8), e3099.

Manchester, K. L. (1995). Theodor Boveri and the origin of malignant tumours. Trends in

Cell Biology, 5(10), 384-387.

Mohamedali, A., Gaken, J., Twine, N. A., Ingram, W., Westwood, N., Lea, N. C., et al.

(2007). Prevalence and prognostic significance of allelic imbalance by single-

nucleotide polymorphism analysis in low-risk myelodysplastic syndromes. Blood,

110(9), 3365-3373.

Montgomery, G., Campbell, M., Dickson, P., Herbert, S., Siemering, K., Ewen-White, K.,

et al. (2005). Estimation of the rate of SNP genotyping errors from DNA extracted

from different tissues. Twin Res Hum Genet., 8(4), 346-352.

Nestor, A. L., Hollopeter, S. L., Matsui, S.-I., & Allison, D. C. (2007). A model for genetic

complementation controlling the chromosomal abnormalities and loss of

heterozygosity formation in cancer. Cytogenetic and Genome Research, 116,

235-247.

92 Olshen, A., Gold, B., Lohmueller, K., Struewing, J., Satagopan, J., Stefanov, S., et al.

(2008). Analysis of genetic variation in Ashkenazi Jews by high density SNP

genotyping. BMC Genet., 9(14).

Oosting, J., Lips, E. H., van Eijk, R., Eilers, P. H. C., Szuhai, K., Wijmenga, C., et al.

(2007). High-resolution copy number analysis of paraffin-embedded archival

tissue using SNP BeadArrays. Genome Res., 17(3), 368-376.

Paik, S., Shak, S., Tang, G., Kim, C., Baker, J., Cronin, M., et al. (2004). A Multigene

Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast

Cancer. N Engl J Med, 351(27), 2817-2826.

Paik, S., Tang, G., Shak, S., Kim, C., Baker, J., Kim, W., et al. (2006). Gene Expression

and Benefit of Chemotherapy in Women With Node-Negative, Estrogen

Receptor-Positive Breast Cancer. J Clin Oncol, 24(23), 3726-3734.

Piotrowski, A., Bruder, C., Andersson, R., de Stahl, T., Menzel, U., Sandgren, J., et al.

(2008). Somatic mosaicism for copy number variation in differentiated human

tissues. Hum Mutat., 29(9), 1118-1124.

Pollack, J. R., Sørlie, T., Perou, C. M., Rees, C. A., Jeffrey, S. S., Lonning, P. E., et al.

(2002). Microarray analysis reveals a major direct role of DNA copy number

alteration in the transcriptional program of human breast tumors. Proceedings of

the National Academy of Sciences of the United States of America, 99(20),

12963-12968.

Prostate Cancer Treatment. National Institutes of Health. National Cancer Institute

(2008).

Redon, R., Ishikawa, S., Fitch, K., Feuk, L., Perry, G., Andrews, T., et al. (2006). Global

variation in copy number in the human genome. Nature, 444(7118), 444-454.

93 Roukos, D., Murray, S., & Briasoulis, E. (2007). Molecular genetic tools shape a

roadmap towards a more accurate prognostic prediction and personalized

management of cancer. Cancer Biol Ther., 6(3), 308-312.

Schrock, E., Zschieschang, P., O'Brien, P., Helmrich, A., Hardt, T., Matthaei, A., et al.

(2006). Spectral karyotyping of human, mouse, rat and ape chromosomes--

applications for genetic diagnostics and research. Cytogenet Genome Res.,

114(3-4), 199-221.

Sebat, J. L., B.Troge, J. Alexander, J. Young, J. Lundin, P., Maner, S., Massa, H.,

Walker, M., Chi, M., Navin, N., et al. (2004). Large-Scale Copy Number

Polymorphism in the Human Genome. Science, 305(5683), 525-528.

Sharp, A., Locke, D., McGrath, S., Cheng, Z., Bailey, J., Vallente, R., et al. (2005).

Segmental duplications and copy-number variation in the human genome. Am J

Hum Genet. , 77(1), 78-88.

Simon-Sanchez, J., Scholz, S., Fung, H., Matarin, M., Hernandez, D., Gibbs, J., et al.

(2007). Genome-wide SNP assay reveals structural genomic variation, extended

homozygosity and cell-line induced alterations in normal individuals. Hum Mol

Genet, 16(1), 1-14.

Tai, A. L. S., Mak, W., Ng, P. K. M., Chua, D. T. T., Ng, M. Y. M., Fu, L., et al. (2006).

High-throughput Loss-of-Heterozygosity Study of Chromosome 3p in Lung

Cancer Using Single-Nucleotide Polymorphism Markers. Cancer Res, 66(8),

4133-4138.

Thomas, M. A., Weston, B., Joseph, M., Wu, W., Nekrutenko, A., & Tonellato, P. J.

(2003). Evolutionary Dynamics of Oncogenes and Tumor Suppressor Genes:

Higher Intensities of Purifying Selection than Other Genes. Mol Biol Evol, 20(6),

964-968.

94 Thompson, S. L., & Compton, D. A. (2008). Examining the link between chromosomal

instability and aneuploidy in human cells. J. Cell Biol., 180(4), 665-672.

Torres, E., Williams, B., & Amon, A. (2008). Aneuploidy: cells losing their balance.

Genetics, 179(2), 737-746.

Torring, N., Borre, M., Sorensen, K., Anderson, C., Wuif, C., & Orntoft, T. (2007).

Genome-Wide analysis of allelic imbalance in prostate cancer using the

Affymetrix 50K SNP mapping array. British Journal of Cancer, 96, 499-506.

Tsafrir, D., Bacolod, M., Selvanayagam, Z., Tsafrir, I., Shia, J., Zeng, Z., et al. (2006).

Relationship of Gene Expression and Chromosomal Abnormalities in Colorectal

Cancer. Cancer Res, 66(4), 2129-2137.

Tu, Z., Wang, L., Xu, M., Zhou, X., Chen, T., & Sun, F. (2006). Further understanding

human disease genes by comparing with housekeeping genes and other genes.

BMC Genomics, 7(31).

Tucker, T., & Friedman, J. (2002). Pathogenesis of hereditary tumors: beyond the "two-

hit" hypothesis. Clinical Genetics, 62, 345-357. van de Vijver, M. J., He, Y. D., van 't Veer, L. J., Dai, H., Hart, A. A. M., Voskuil, D. W., et

al. (2002). A Gene-Expression Signature as a Predictor of Survival in Breast

Cancer. N Engl J Med, 347(25), 1999-2009.

Walker, B., & Morgan, G. (2006). Use of Single nucleotide polymorphism-based mapping

Arrays to detect Copy Numer Changes and Loss of Heterozygosity in Multiple

Myeloma Clinical Lymphoma and Myeloma, 7(3), 186-191.

Wang, Z. C., Buraimoh, A., Iglehart, J. D., & Richardson, A. L. (2006). Genome-wide

analysis for loss of heterozygosity in primary and recurrent phyllodes tumor and

fibroadenoma of breast using single nucleotide polymorphism arrays. Breast

Cancer Res Treat, 97(3), 301-309.

95 Warrington, J. A., Nair, A., Mahadevappa, M., & Tsyganskaya, M. (2000). Comparison of

human adult and fetal expression and identification of 535

housekeeping/maintenance genes. Physiol. Genomics, 2(3), 143-147.

Weaver, B., & Cleveland, D. (2008). The aneuploidy paradox in cell growth and

tumorigenesis. Cancer Cell, 14, 431-433.

Winter, E., Goodstadt, L., & Ponting, C. (2004). Elevated rates of protein secretion,

evolution, and disease among tissue-specific genes. Genome Res., 14, 54-61.

Wong, K.-K., Tsang, Y. T. M., Shen, J., Cheng, R. S., Chang, Y.-M., Man, T.-K., et al.

(2004). Allelic imbalance analysis by high-density single-nucleotide polymorphic

allele (SNP) array with whole genome amplified DNA. Nucl. Acids Res., 32(9),

e69-.

Zhang, L., & Li, W.-H. (2004). Mammalian Housekeeping Genes Evolve More Slowly

than Tissue-Specific Genes. Mol Biol Evol, 21(2), 236-239.

Zhao, X., Li, C., Paez, J. G., Chin, K., Janne, P. A., Chen, T.-H., et al. (2004). An

Integrated View of Copy Number and Allelic Alterations in the Cancer Genome

Using Single Nucleotide Polymorphism Arrays. Cancer Res, 64(9), 3060-3071.

Zheng, H. T., Peng, Z. H., Li, S., & He, L. (2005). Loss of heterozygosity analyzed by

single nucleotide polymorphism array in cancer. World J Gastroenterol, 11(43),

6740-6744.

96 Appendix 1. List of cell lines

A. List of 12 experimental cell lines Tissue Normal/ Number of Cancer Samples

lung cancer 3

normal 2

pancreas cancer 1

prostate cancer 4

normal 1

skin normal 1

B. List of 198 GEO database cell lines Tissue Normal/Cancer Number of Samples blood cancer 8 bone cancer 2 brain cancer 18 normal 1 normal 6 normal 1 cancer 14 normal 8

kidney normal 11

liver cancer 1

normal 1

cancer 82 lung normal 11

lymph cancer 2 muscle cancer 1 ovary normal 7 pancreas normal 2 prostate cancer 4 normal 2 salivary normal 2 skin cancer 13 urinary cancer 1 97 Appendix 2. Twin “E” – probability stripes at a p-value = 0.001 for each repeated assay of the same DNA sample Repeat 1 Repeat 2

Repeat 3

98 Appendix 3.Twin “O” probability stripes at a p-value = 0.001 for each repeated assay of the same DNA sample

Repeat 1 Repeat 2

Repeat 3

99

Appendix 4. Summary table of DU-145 karyotypes. Number of marker chromosomes and normal chromosomes, that create at least a haploid genome.

SKY # of Probability # Normal marker Stripes Csome (apparent) Csomes Designation 1 2 1 BC 2 2 1 HET 3 2 1 BC 4 1 1 BC 5 2 2 HET 6 2 1 BC 7 3 0 HET 8 2 0 HET 9 2 2 BC 10 2 2 BC 11 1 3 BC 12 2 1 BC 13 0 2 WC 14 2 2 HET 15 1 1 HET 16 2 1 BC 17 3 0 HET 18 2 1 BC 19 2 2 BC 20 1 2 HET 21 3 2 HET 22 2 0 WH BC = Broken chromosome LOH HET = Heterozygous chromosomes WH = Whole homolog LOH

100

Appendix 5. Number of genes expressed in the variably expressed cell lines- both experimental and in-silico

Genes Experimental In-silico in Bin (12) (198) common 2 687 1063 189 3 545 721 95 4 468 649 71 5 450 605 71 6 389 543 45 7 416 595 48 8 412 655 58 9 462 683 53 10 536 986 92

101 Appendix 6A. List of gene symbols of 600 housekeeping genes from our 778 universally expressed cell survival genes

ACLY ATP5J2 CCT4 CYB5R3 EIF3EIP GNG5 ACTB ATP5L CCT8 CYCS EIF3F GOLGA4 ACTG1 ATP5O CD164 DARS EIF3H GOLGA7 ACTR2 ATP6AP2 CD40 DAZAP2 EIF3K GOLGB1 ACTR3 ATP6V0E1 CD59 DBI EIF4A1 GOLPH3 ACVR1B ATP6V1A CD63 DCTN3 EIF4A2 GORASP2 GPI /// ADAR ATP6V1B2 CDC16 DDX17 EIF4B LOC100133 951 ALDOA AZIN1 CDC2L2 DDX21 EIF4G2 GPX4 ANAPC13 B2M CDC5L DDX39 EIF4H GRHPR ANP32B BAT1 CDIPT DDX5 EIF5 GSTO1 ANXA2 BAT3 CDK2AP1 DERA ENO1 GTF3A ANXA5 BCLAF1 CDK4 DLG1 ERP29 H2AFV AP2S1 BIRC2 CFL1 DNAJA1 FAM89B H2AFY AP3S1 BNIP3L CHSY1 DNAJB6 FBL H2AFZ APLP2 BRE CKS1B DRG1 FBXO21 H3F3A /// H3F3B APP BRP44 CLCN3 DSTN FBXO9 HADHB ARF1 BTF3 CLIC4 DUT FDFT1 HBXIP hCG_1781062 /// ARF4 BTG1 CLTC DYNC1I2 FKBP15 SRP9 hCG_21078 /// ARHGAP1 BUB3 CNBP DYNLL1 FKBP1A RPL27A ARL5A BUD31 CNIH DYNLRB1 FLJ21865 HDGF ARL6IP5 BZW1 CNOT1 DYNLT1 FNTA HERPUD1 ARPC2 C11orf10 COG2 ECHS1 FTH1 HEXB ARPC3 C14orf166 COPB1 EDF1 FTL HIGD1A ARS2 C15orf15 COPB2 EEF1A1 FTSJ2 HIGD2A ASAH1 CACYBP COPS8 EEF1B2 G3BP1 HINT1 CALM1 /// CALM2 /// ATP13A3 CALM3 COX11 EEF1D GABARAP HMGB1 ATP1A1 CALR COX4I1 EEF1G GABARAPL2 HMGB2 ATP1B3 CANX COX5B EEF2 GAK HMGN1 ATP2A2 CAP1 COX6A1 EIF1 GALNT1 HMGN2 ATP5A1 CAPRIN1 COX6C EIF1B GAPDH HMGN3 ATP5B CAPZA1 COX7A2 EIF2AK1 GDI2 HNRNPA1 ATP5C1 CAPZB COX7A2L EIF2B2 GGA1 HNRNPA2B1 ATP5E CARS COX7C EIF2S1 GLS HNRNPA3 ATP5F1 CBX1 COX8A EIF2S2 GLTSCR 2 HNRNPC ATP5H CCNI CSDE1 EIF3A GNAS HNRNPH1 ATP5I CCT2 CSNK1A1 EIF3D GNB1 HNRNPK CSNK2B /// ATP5J CCT3 LY6G5B EIF3E GNB2L1 HNRNPM

102 LOC649299 /// HNRNPU RPL36A NDUFB2 PHB2 PTTG1IP RPL23 LOC653566 /// HNRPDL SPCS2 NDUFB5 PJA2 PUM1 RPL23A LOC653737 /// LOC729402 /// HSBP1 RPL21 NDUFB8 PLOD3 PUM2 RPL24 RPL24 /// HSD17B10 LTA4H NDUFS6 POLR1D QARS SLC36A2 HSD17B12 LUC7L2 NFE2L1 POLR2C RAB14 RPL26L1 NME1 /// HSP90AB1 MACF1 NME2 POLS RAB1A RPL27 HSP90B1 MAP4 NOL5A PPIA RAB2A RPL28 RPL29 /// HSPA5 MARS NOL7 PPME1 RAB5A RPL29P4 HSPA8 MAT2A NONO PPP1CB RAB6A RPL3 HSPA9 MAT2B NPM1 PPP1CC RAC1 RPL30 HSPD1 MATR3 NPTN PPP1R11 RAE1 RPL31 HSPE1 MAX NRD1 PPP2CB RALBP1 RPL32 IK MCL1 NUCKS1 PPT1 RPL35 ILF2 MDH2 NUDC PRDX6 RBBP4 RPL35A ILVBL MED13L NUTF2 PRKAR1A RBBP7 RPL36 IMMT METTL5 OAZ1 PRPF31 RBM16 RPL36A ISCU MFAP1 OGT PRPSAP1 RBM39 RPL36AL ITGB1 MKNK1 OPHN1 PSAP RBMX RPL37 JTB MKNK2 OS9 PSMA4 RBPJ RPL37A JUND MORF4L1 PABPC1 PSMB1 REXO2 RPL38 KARS MORF4L2 PABPC3 PSMB2 RHEB RPL39 KDELR2 MRCL3 /// MRLC2 PABPN1 PSMB4 RHOA RPL4 KIAA0391 /// PSMA6 MRLC2 PAPOLA PSMB5 RNF13 RPL41 KIAA0494 MRPL15 PARK7 PSMC1 RNPS1 RPL5 KLHDC2 MTDH PCBP2 PSMC2 RPL10 RPL6 KPNB1 MYD88 PCMT1 PSMD14 RPL10A RPL7 KTN1 NACA PCNP PSMD2 RPL11 RPL7A LAMP1 NAMPT PDCD6IP PSMD4 RPL12 RPL8 LARP1 NAP1L1 PDIA3 PSMD6 RPL13 RPL9 LASP1 NARS PDIA6 PSME2 RPL13A RPLP0 RPLP0 /// LDHA NCL PEA15 PSMG2 RPL14 RPLP0-like LOC100131713 /// RPL14 /// RPL29 /// RPL29P4 NCOR2 PEBP1 PTBP1 RPL14L RPLP1 LOC100133665 /// TALDO1 NDRG1 PEX16 PTDSS1 RPL15 RPLP2 LOC390354 /// RPL18A NDUFA5 PFDN5 PTGES3 RPL17 RPN2 LOC439992 /// RPS3A NDUFA6 PFN1 PTMA RPL18 RPS10 LOC643287 /// PTMA NDUFAB1 PGAM1 PTP4A1 RPL19 RPS11 LOC644315 /// RPS7 NDUFB11 PGK1 PTP4A2 RPL22 RPS12

103

RPS13 SEC63 SSR2 TSN YWHAQ RPS14 SEPHS2 SSR4 TSPAN3 YWHAZ RPS15 SERBP1 STAT1 TTC3 YY1 RPS15A SERINC3 STAU1 TUBA1B ZFR RPS16 SERP1 STRAP TUBB ZNF207 RPS17 SET SUB1 TXN ZNF706 RPS19 SFRS1 SUMO1 TXNDC4 ZNHIT3 RPS2 SFRS10 SUMO2 UBA52 RPS20 SFRS11 SUPT16H UBE2B RPS23 SFRS2 SYNCRIP UBE2D3 RPS24 SFRS3 TAF10 UBE2L3 RPS25 SFRS5 TARDBP UBE2L6 RPS27 SFRS7 TARS UBE2N RPS27A /// UBA52 /// UBB /// UBC SFRS9 TAX1BP1 UBE2Q1 RPS27A /// UBB /// UBC SH3GLB1 TBC1D9B UCRC RPS28 SHFM1 TBCA UGP2 RPS29 SLC20A1 TEGT UPF3A RPS3 SLC25A16 TIMM8B UQCRB RPS3A SLC25A3 TJP1 UQCRFS1 RPS4X SLC25A5 TKT UQCRQ RPS5 SLC25A6 TM9SF2 USP22 RPS6 SLC39A6 TMBIM4 USP4 RPS7 SLC7A1 TMED10 USP9X RPS8 SMS TMED2 VAPA RPS9 SNRPD2 TMEM123 VCL TMEM189-UBE2V1 /// RTCD1 SNRPG UBE2V1 VDAC2 RTN4 SNW1 TMEM59 VDAC3 RYK SNX3 TMEM66 VEZF1 SAP18 SON TMSB10 VPS4B SAR1A SPCS1 TNPO1 WBSCR22 SCAMP2 SPEN TOMM20 WIPF2 SCP2 SPG21 TOMM70A XBP1 SDCBP SPOP TPI1 XPO1 SDHC SPTBN1 TPP1 XRCC5 SEC11A SRP14 TPT1 YBX1 SEC61B SRRM1 TRMT5 YME1L1 SEC61G SSB TSEN34 YWHAB

104 Appendix 6B. List of gene symbols of 175 non-housekeeping genes from our 778 universally expressed cell survival genes AASDHPPT LOC388524 /// LOC653162 DKC1 /// RP11-556K13.1 /// RPSA RTF1 TSPYL4 LOC644191 /// LOC728937 ABCD3 DLD /// RPS26 RWDD1 TTLL5 ABCE1 DNAJC9 LPGAT1 SDCCAG1 U2AF2 ACSL3 DNTTIP2 LSM6 SEC23A UAP1 DOCK5 /// ADH5 PPP2R2A MAPRE1 SEC62 UBE2G1 ADNP DYRK1A MYL6 /// MYL6B SENP6 UBE2V2 ADSS EFHA1 NBN SERTAD2 USP3 AHCYL1 EIF3C /// EIF3CL NBPF10 SF3B1 UTP14C AKAP11 ELAVL1 NCBP2 SFPQ VPS26A AMD1 EXOC1 NCK1 SFRS12 VPS35 ARL3 FICD NDUFS5 /// RPL10 SIAH1 XPOT ARL6IP1 GCLC NEK7 SIP1 YTHDC1 GCSH /// ATG12 LOC730107 NFE2L2 SIRT3 ZBED5 BCAS2 GLO1 NGRN SKIV2L2 ZC3H11A BRD4 GOLGA5 NME1-NME2 /// NME2 SLC30A9 ZMPSTE24 hCG_16001 /// BTBD1 RPL23A NOLC1 SLC5A3 ZNF160 hCG_1757335 /// C19orf22 RAP1B NUDT4 /// NUDT4P1 SMARCC2 Septin 2 CALU HDAC2 NUP153 SMARCE1 Septin 7 15 kDa CAPZA2 HIBCH NUP50 SMC1A selenoprotein CBX3 /// LOC653972 HIF1A OPA1 SNRK CDK2 HLA-DMB OSBP SNRPA1 CDKN1B HMGB3 PER2 SNRPE CHD9 HNRNPA0 PIN4 SP3 CNOT7 HNRNPH3 POM121 /// POM121C SPTLC1 CORO1C HNRNPR PPP2R5E SQLE COX15 HPRT1 PRKCI SRP72 CPNE1 /// RBM12 HSPH1 PRKRIR SSR1 CRLF3 IFNGR1 PRPF40A ST13 CROP IFT88 PTPRO STK38L SUMO2 /// CUGBP1 INPP5A PURA SUMO4 CUL3 IPO5 RACGAP1 TGOLN2 CUX1 ITGAE RAD1 TLK1 DAZAP1 IVNS1ABP RANBP2 TMED5 DDX10 JARID2 RANBP9 TMEM41B DDX18 JMJD1A RCN2 TMF1 DDX3X KIAA1012 RIOK3 TOMM40 KPNA2 /// DEK LOC728860 RNF6 TOP1 DGUOK LARP4 RNMT TPRKB DHFR LARP7 RRM1 TRAM1 LOC100130553 /// DICER1 RPS18 RRP1B TROVE2

105 Appendix Table 7A. Functional Classification of “Universally expressed genes” expressed in all cell lines and > 99.75% tissues. Intra Met = Intracellular Metabolic Function Diff/Dev = Differentiated and/or Developmental Function Unfilled spaces = unknown function Probe Id Gene Symbol/Title Category Probe Id Gene Symbol/Title Category 15 kDa Intra Met Intra Met 201322_at ATP5B 200902_at selenoprotein 205711_x_at ATP5C1 Intra Met 202169_s_at AASDHPPT Intra Met 217801_at ATP5E Intra Met 202850_at ABCD3 Intra Met 211755_s_at ATP5F1 Intra Met 201872_s_at ABCE1 Diff/Dev 210149_s_at ATP5H Intra Met 201128_s_at ACLY Intra Met 207335_x_at ATP5I Intra Met 201661_s_at ACSL3 Intra Met 202325_s_at ATP5J Intra Met 200801_x_at ACTB Intra Met 202961_s_at ATP5J2 Intra Met 213214_x_at ACTG1 Diff/Dev 208746_x_at ATP5L Intra Met 200728_at ACTR2 Intra Met 200818_at ATP5O Intra Met 200996_at ACTR3 Intra Met 201443_s_at ATP6AP2 Intra Met 213198_at ACVR1B Diff/Dev 214150_x_at ATP6V0E1 Intra Met 201786_s_at ADAR Intra Met 201972_at ATP6V1A Intra Met 208847_s_at ADH5 Intra Met 201089_at ATP6V1B2 Intra Met 201773_at ADNP Intra Met 212461_at AZIN1 Intra Met 221761_at ADSS Intra Met 201891_s_at B2M Diff/Dev 200850_s_at AHCYL1 Intra Met 200041_s_at BAT1 Intra Met 203156_at AKAP11 Intra Met 210208_x_at BAT3 Diff/Dev 200966_x_at ALDOA Diff/Dev 203053_at BCAS2 Intra Met 201197_at AMD1 Intra Met 201084_s_at BCLAF1 Intra Met 209001_s_at ANAPC13 Intra Met 202076_at BIRC2 Intra Met 201305_x_at ANP32B Intra Met 221478_at BNIP3L Intra Met 201590_x_at ANXA2 Diff/Dev 202102_s_at BRD4 Intra Met 200782_at ANXA5 Intra Met 205550_s_at BRE Intra Met 211047_x_at AP2S1 Intra Met 202427_s_at BRP44 Intra Met 202442_at AP3S1 Intra Met 217945_at BTBD1 Intra Met 208248_x_at APLP2 Diff/Dev 208517_x_at BTF3 Intra Met 214953_s_at APP Diff/Dev 200921_s_at BTG1 Intra Met 200065_s_at ARF1 Intra Met 201457_x_at BUB3 Intra Met 201096_s_at ARF4 Intra Met 205690_s_at BUD31 Intra Met 202117_at ARHGAP1 Intra Met 200777_s_at BZW1 Intra Met 202641_at ARL3 Intra Met 218213_s_at C11orf10 Intra Met 218150_at ARL5A Intra Met 217768_at C14orf166 Intra Met 211935_at ARL6IP1 Intra Met 217915_s_at C15orf15 Intra Met 200761_s_at ARL6IP5 Intra Met 55705_at C19orf22 Intra Met 207988_s_at ARPC2 Intra Met 211761_s_at CACYBP Intra Met 208736_at ARPC3 Intra Met CALM1 /// CALM2 /// Diff/Dev 201680_x_at ARS2 207243_s_at CALM3 213702_x_at ASAH1 Diff/Dev 214315_x_at CALR Intra Met 213026_at ATG12 Intra Met 200757_s_at CALU Intra Met 212297_at ATP13A3 Intra Met 200068_s_at CANX Intra Met 220948_s_at ATP1A1 Diff/Dev 213798_s_at CAP1 Intra Met 208836_at ATP1B3 Intra Met 200723_s_at CAPRIN1 Intra Met 209186_at ATP2A2 Diff/Dev 208374_s_at CAPZA1 Intra Met 213738_s_at ATP5A1 Intra Met 201238_s_at CAPZA2 Intra Met

106

Probe Id Gene Symbol/Title Category Probe Id Gene Symbol/Title Category 37012_at CAPZB Intra Met 201134_x_at COX7C Intra Met 212971_at CARS Intra Met 201119_s_at COX8A Intra Met 201518_at CBX1 Intra Met 205474_at CRLF3 CBX3 /// Intra Met Intra Met 203804_s_at CROP 200037_s_at LOC653972 202646_s_at CSDE1 Diff/Dev 208656_s_at CCNI Diff/Dev 213086_s_at CSNK1A1 Intra Met 201947_s_at CCT2 Intra Met CSNK2B /// Intra Met 200910_at CCT3 Intra Met 201390_s_at LY6G5B 200877_at CCT4 Intra Met 221743_at CUGBP1 Diff/Dev 200873_s_at CCT8 Intra Met 201371_s_at CUL3 Intra Met 208405_s_at CD164 Diff/Dev 214743_at CUX1 Diff/Dev 35150_at CD40 Intra Met 201885_s_at CYB5R3 Intra Met 200983_x_at CD59 Diff/Dev 208905_at CYCS Intra Met 200663_at CD63 Intra Met 201623_s_at DARS Intra Met 202717_s_at CDC16 Intra Met 218443_s_at DAZAP1 Diff/Dev 212401_s_at CDC2L2 Intra Met 200794_x_at DAZAP2 Intra Met 209056_s_at CDC5L Intra Met 202428_x_at DBI Intra Met 201253_s_at CDIPT Intra Met 204246_s_at DCTN3 Intra Met 204252_at CDK2 Intra Met 204977_at DDX10 Intra Met 201938_at CDK2AP1 Intra Met 208718_at DDX17 Intra Met 202246_s_at CDK4 Intra Met 208895_s_at DDX18 Intra Met 209112_at CDKN1B Intra Met 208152_s_at DDX21 Intra Met 200021_at CFL1 Intra Met 201584_s_at DDX39 Intra Met 212616_at CHD9 Intra Met 201210_at DDX3X Intra Met 203044_at CHSY1 Intra Met 200033_at DDX5 Intra Met 201897_s_at CKS1B Intra Met 200934_at DEK Diff/Dev 201734_at CLCN3 Intra Met 218102_at DERA Intra Met 201560_at CLIC4 Intra Met 209549_s_at DGUOK Intra Met 200614_at CLTC Intra Met 202534_x_at DHFR Intra Met 206158_s_at CNBP Intra Met 212888_at DICER1 Diff/Dev 201653_at CNIH Intra Met 201479_at DKC1 Intra Met 200860_s_at CNOT1 Intra Met 209095_at DLD Intra Met 218250_s_at CNOT7 Intra Met 202515_at DLG1 Intra Met 203073_at COG2 Intra Met 200880_at DNAJA1 Intra Met 201358_s_at COPB1 Intra Met 208810_at DNAJB6 Intra Met 201098_at COPB2 Intra Met 213088_s_at DNAJC9 Intra Met 202141_s_at COPS8 Intra Met 202776_at DNTTIP2 Intra Met 221676_s_at CORO1C Intra Met 202810_at DRG1 Diff/Dev 211727_s_at COX11 Diff/Dev 201022_s_at DSTN Intra Met 219547_at COX15 Intra Met 208956_x_at DUT Intra Met 200086_s_at COX4I1 Intra Met 211684_s_at DYNC1I2 Intra Met 202343_x_at COX5B Intra Met 200703_at DYNLL1 Diff/Dev 200925_at COX6A1 Intra Met 217918_at DYNLRB1 Diff/Dev 201754_at COX6C Intra Met 201999_s_at DYNLT1 Diff/Dev 201597_at COX7A2 Intra Met 209033_s_at DYRK1A Intra Met 201256_at COX7A2L Intra Met 201135_at ECHS1 Intra Met

107

Probe Id Gene Symbol/Title Category Probe Id Gene Symbol/Title Category 201503_at G3BP1 Intra Met 209058_at EDF1 Intra Met 200645_at GABARAP Intra Met 204892_x_at EEF1A1 Intra Met 209046_s_at GABARAPL2 Intra Met EEF1A1 /// Intra Met 40225_at GAK Intra Met 213477_x_at EEF1AL3 201723_s_at GALNT1 Intra Met 200705_s_at EEF1B2 Intra Met 212581_x_at GAPDH Intra Met 203113_s_at EEF1D Intra Met 202922_at GCLC Diff/Dev 200689_x_at EEF1G /// TUT1 Intra Met 213129_s_at GCSH /// LOC730107 Intra Met 204102_s_at EEF2 Intra Met 200009_at GDI2 Intra Met 212410_at EFHA1 Intra Met 45572_s_at GGA1 Intra Met 202021_x_at EIF1 Intra Met 200681_at GLO1 Intra Met 201738_at EIF1B Intra Met 221510_s_at GLS Intra Met 217736_s_at EIF2AK1 Intra Met 217807_s_at GLTSCR2 Intra Met 202461_at EIF2B2 Intra Met 200780_x_at GNAS Diff/Dev 201142_at EIF2S1 Intra Met 200744_s_at GNB1 Intra Met 208726_s_at EIF2S2 Intra Met 200651_at GNB2L1 Diff/Dev 200595_s_at EIF3A Intra Met 207157_s_at GNG5 Intra Met 210949_s_at EIF3C /// EIF3CL Intra Met 201567_s_at GOLGA4 Intra Met 200005_at EIF3D Intra Met 218241_at GOLGA5 Intra Met 208697_s_at EIF3E Intra Met 217819_at GOLGA7 Intra Met 217719_at EIF3EIP Intra Met 201057_s_at GOLGB1 Intra Met 200023_s_at EIF3F Intra Met 217803_at GOLPH3 Intra Met 201592_at EIF3H Intra Met 208842_s_at GORASP2 Intra Met 212716_s_at EIF3K Intra Met 208308_s_at GPI /// LOC100133951 Diff/Dev 201530_x_at EIF4A1 Intra Met 201106_at GPX4 Diff/Dev 200912_s_at EIF4A2 Intra Met 201347_x_at GRHPR Intra Met 211937_at EIF4B Intra Met 201470_at GSTO1 Intra Met 200004_at EIF4G2 Intra Met 201338_x_at GTF3A Intra Met 206621_s_at EIF4H Intra Met 212205_at H2AFV Intra Met 208705_s_at EIF5 Intra Met 207168_s_at H2AFY Intra Met 201726_at ELAVL1 Diff/Dev 200853_at H2AFZ Intra Met 201231_s_at ENO1 Intra Met 209069_s_at H3F3A /// H3F3B Intra Met 201216_at ERP29 Intra Met 201007_at HADHB Intra Met 222127_s_at EXOC1 Intra Met 202300_at HBXIP Diff/Dev 32209_at FAM89B hCG_16001 /// Intra Met 211623_s_at FBL Intra Met 208825_x_at RPL23A 212229_s_at FBXO21 Intra Met hCG_1757335 /// Intra Met 210638_s_at FBXO9 Intra Met 200833_s_at RAP1B 208647_at FDFT1 Intra Met hCG_1781062 /// Intra Met 219910_at FICD Intra Met 201273_s_at SRP9 hCG_21078 /// 31826_at FKBP15 Intra Met Intra Met 203034_s_at RPL27A 200709_at FKBP1A Intra Met 201833_at HDAC2 Intra Met 65635_at FLJ21865 Intra Met 200896_x_at HDGF Intra Met 200090_at FNTA Intra Met 217168_s_at HERPUD1 Intra Met 200748_s_at FTH1 Diff/Dev 201944_at HEXB Intra Met 212788_x_at FTL Diff/Dev 213374_x_at HIBCH Intra Met 218356_at FTSJ2 Intra Met

108 Probe Id Gene Symbol/Title Category Probe Id Gene Symbol/Title Category 200989_at HIF1A Intra Met 211945_s_at ITGB1 Intra Met 217845_x_at HIGD1A Intra Met 206245_s_at IVNS1ABP Intra Met 209329_x_at HIGD2A Intra Met 203297_s_at JARID2 Diff/Dev 200093_s_at HINT1 Intra Met 212689_s_at JMJD1A Diff/Dev 203932_at HLA-DMB Intra Met 200048_s_at JTB Intra Met 200679_x_at HMGB1 Intra Met 203752_s_at JUND Intra Met 208808_s_at HMGB2 Intra Met 200079_s_at KARS Intra Met 203744_at HMGB3 Diff/Dev 200698_at KDELR2 Intra Met 200943_at HMGN1 Intra Met 208805_at KIAA0391 /// PSMA6 Intra Met 208668_x_at HMGN2 Intra Met 201778_s_at KIAA0494 Intra Met 209377_s_at HMGN3 Intra Met 207305_s_at KIAA1012 Intra Met 201054_at HNRNPA0 Intra Met 217906_at KLHDC2 Intra Met KPNA2 /// 200016_x_at HNRNPA1 Intra Met Intra Met 205292_s_at HNRNPA2B1 Intra Met 201088_at LOC728860 211929_at HNRNPA3 Intra Met 208974_x_at KPNB1 Intra Met 200014_s_at HNRNPC Intra Met 200915_x_at KTN1 Intra Met 201031_s_at HNRNPH1 Intra Met 201553_s_at LAMP1 Intra Met 208990_s_at HNRNPH3 Intra Met 212137_at LARP1 Intra Met Intra Met 200097_s_at HNRNPK Intra Met 212714_at LARP4 Intra Met 200072_s_at HNRNPM Intra Met 212785_s_at LARP7 Intra Met 208766_s_at HNRNPR Intra Met 200618_at LASP1 Intra Met 200594_x_at HNRNPU Intra Met 200650_s_at LDHA LOC100130553 /// 201993_x_at HNRPDL Intra Met Intra Met 201049_s_at RPS18 202854_at HPRT1 Diff/Dev LOC100130862 /// Intra Met 200941_at HSBP1 Intra Met 201398_s_at TRAM1 202282_at HSD17B10 Intra Met LOC100131713 /// Intra Met 217869_at HSD17B12 Intra Met 200823_x_at RPL29 /// RPL29P4 LOC100133665 /// 200064_at HSP90AB1 Intra Met Intra Met 200598_s_at HSP90B1 Intra Met 201463_s_at TALDO1 211936_at HSPA5 Intra Met LOC388524 /// LOC653162 /// Intra Met Intra Met 208687_x_at HSPA8 RP11-556K13.1 /// 200691_s_at HSPA9 Intra Met 213801_x_at RPSA 200806_s_at HSPD1 Intra Met LOC390354 /// Intra Met 205133_s_at HSPE1 Intra Met 200869_at RPL18A Intra Met LOC439992 /// 206976_s_at HSPH1 Intra Met 202727_s_at IFNGR1 Diff/Dev 200099_s_at RPS3A LOC643287 /// 204703_at IFT88 Diff/Dev Diff/Dev 216515_x_at PTMA /// PTMAP5 200066_at IK Diff/Dev LOC644315 /// Intra Met Intra Met 200052_s_at ILF2 200082_s_at RPS7 210624_s_at ILVBL Intra Met LOC649299 /// Intra Met 200955_at IMMT Intra Met 217256_x_at RPL36A 203006_at INPP5A Diff/Dev LOC653737 /// 211954_s_at IPO5 Intra Met LOC729402 /// Intra Met 209075_s_at ISCU Diff/Dev 200012_x_at RPL21 LOC728937 /// 205055_at ITGAE Intra Met Intra Met 217753_s_at RPS26

109

Probe Id Gene Symbol/Title Category Probe Id Gene Symbol/Title Category 202651_at LPGAT1 Intra Met 201757_at NDUFS5 /// RPL10 Intra Met 205036_at LSM6 Intra Met 203606_at NDUFS6 Intra Met 208771_s_at LTA4H Intra Met 212530_at NEK7 Intra Met 220099_s_at LUC7L2 200759_x_at NFE2L1 Intra Met 207358_x_at MACF1 Intra Met 201146_at NFE2L2 Intra Met 212566_at MAP4 Intra Met 217722_s_at NGRN Diff/Dev 200713_s_at MAPRE1 Intra Met 201577_at NME1 /// NME2 Intra Met Intra Met NME1-NME2 /// 201475_x_at MARS Diff/Dev 200768_s_at MAT2A Diff/Dev 201268_at NME2 217993_s_at MAT2B Intra Met 200875_s_at NOL5A Intra Met 214363_s_at MATR3 Intra Met 202882_x_at NOL7 Intra Met 209332_s_at MAX Intra Met 211951_at NOLC1 Intra Met 200798_x_at MCL1 Diff/Dev 200057_s_at NONO Intra Met 209036_s_at MDH2 Intra Met 200063_s_at NPM1 Intra Met 212208_at MED13L Intra Met 202228_s_at NPTN Diff/Dev 221570_s_at METTL5 Intra Met 208709_s_at NRD1 Diff/Dev 203406_at MFAP1 217802_s_at NUCKS1 Intra Met 209467_s_at MKNK1 Intra Met 210574_s_at NUDC Intra Met NUDT4 /// 218205_s_at MKNK2 Intra Met Intra Met 212181_s_at NUDT4P1 217982_s_at MORF4L1 Intra Met 202097_at NUP153 Intra Met 201994_at MORF4L2 Intra Met 213682_at NUP50 Intra Met 201318_s_at MRCL3 /// MRLC2 Diff/Dev 202397_at NUTF2 Intra Met 221474_at MRLC2 Intra Met 200077_s_at OAZ1 Intra Met 218027_at MRPL15 Intra Met 209240_at OGT Intra Met 212250_at MTDH Intra Met 212213_x_at OPA1 Diff/Dev 209124_at MYD88 Diff/Dev 206323_x_at OPHN1 Diff/Dev 212082_s_at MYL6 /// MYL6B Diff/Dev 200714_x_at OS9 200735_x_at NACA Intra Met 201800_s_at OSBP Intra Met 217738_at NAMPT Intra Met 215157_x_at PABPC1 Intra Met 208752_x_at NAP1L1 Intra Met 208113_x_at PABPC3 Intra Met 200027_at NARS Intra Met 201544_x_at PABPN1 Diff/Dev 202907_s_at NBN Intra Met 222035_s_at PAPOLA Intra Met 212854_x_at NBPF10 Intra Met 200006_at PARK7 Diff/Dev 201517_at NCBP2 Intra Met 204031_s_at PCBP2 Intra Met 211063_s_at NCK1 Intra Met 208857_s_at PCMT1 Intra Met 200610_s_at NCL Intra Met 217816_s_at PCNP Intra Met 207760_s_at NCOR2 Intra Met 217746_s_at PDCD6IP Intra Met 200632_s_at NDRG1 Intra Met 208612_at PDIA3 Intra Met 201304_at NDUFA5 Intra Met 207668_x_at PDIA6 Intra Met 202001_s_at NDUFA6 Intra Met 200788_s_at PEA15 Intra Met 202077_at NDUFAB1 Intra Met 210825_s_at PEBP1 Intra Met 218320_s_at NDUFB11 Intra Met 205251_at PER2 Intra Met 218200_s_at NDUFB2 Intra Met 49878_at PEX16 Intra Met 203621_at NDUFB5 Intra Met 200634_at PFN1 Diff/Dev 201227_s_at NDUFB8 Intra Met

110

Probe Id Gene Symbol/Title Category Probe Id Gene Symbol/Title Category 200886_s_at PGAM1 Intra Met 208615_s_at PTP4A2 Intra Met 200738_s_at PGK1 Intra Met 211600_at PTPRO Intra Met 201600_at PHB2 Intra Met 200677_at PTTG1IP Diff/Dev 204571_x_at PIN4 Intra Met 201166_s_at PUM1 Intra Met 201133_s_at PJA2 Intra Met 216221_s_at PUM2 Intra Met 202185_at PLOD3 Intra Met 204020_at PURA Diff/Dev 218258_at POLR1D Intra Met 217846_at QARS Intra Met 214263_x_at POLR2C Intra Met 200927_s_at RAB14 Intra Met 202466_at POLS Intra Met 208724_s_at RAB1A Intra Met POM121 /// Intra Met Intra Met 208731_at RAB2A 213360_s_at POM121C 209089_at RAB5A Intra Met 201293_x_at PPIA Intra Met 201047_x_at RAB6A Intra Met 49077_at PPME1 Intra Met 208640_at RAC1 Intra Met 201407_s_at PPP1CB Intra Met 222077_s_at RACGAP1 Diff/Dev 200726_at PPP1CC Intra Met 204461_x_at RAD1 Intra Met 201500_s_at PPP1R11 Intra Met 201558_at RAE1 Intra Met 201375_s_at PPP2CB Intra Met 202845_s_at RALBP1 Intra Met 202313_at PPP2R2A Intra Met 200750_s_at RAN Intra Met 203338_at PPP2R5E Intra Met 201711_x_at RANBP2 Intra Met 200975_at PPT1 Intra Met 202582_s_at RANBP9 Intra Met 200845_s_at PRDX6 Intra Met 217301_x_at RBBP4 Intra Met 200603_at PRKAR1A Intra Met 201092_at RBBP7 Diff/Dev 209678_s_at PRKCI Intra Met 212168_at RBM12 Intra Met 209323_at PRKRIR Intra Met 203250_at RBM16 Intra Met 202408_s_at PRPF31 Intra Met 207941_s_at RBM39 Intra Met 213729_at PRPF40A Diff/Dev 213762_x_at RBMX Intra Met 202529_at PRPSAP1 Intra Met 211974_x_at RBPJ Diff/Dev 200871_s_at PSAP Intra Met 201486_at RCN2 Intra Met 203396_at PSMA4 Intra Met 218194_at REXO2 Intra Met 200876_s_at PSMB1 Intra Met 201453_x_at RHEB Intra Met 200039_s_at PSMB2 Intra Met 200059_s_at RHOA Intra Met 202244_at PSMB4 Intra Met 202130_at RIOK3 Intra Met 208799_at PSMB5 Intra Met 201779_s_at RNF13 Intra Met 204219_s_at PSMC1 Intra Met 203403_s_at RNF6 Intra Met 201068_s_at PSMC2 Intra Met 202683_s_at RNMT Intra Met 212296_at PSMD14 Intra Met 200060_s_at RNPS1 Intra Met 200830_at PSMD2 Intra Met 200725_x_at RPL10 Intra Met 200882_s_at PSMD4 Intra Met 200036_s_at RPL10A Diff/Dev 202753_at PSMD6 Intra Met 200010_at RPL11 Intra Met 201762_s_at PSME2 Intra Met 200088_x_at RPL12 Intra Met 218467_at PSMG2 Intra Met 208929_x_at RPL13 Intra Met 211271_x_at PTBP1 Intra Met 200715_x_at RPL13A Intra Met 201433_s_at PTDSS1 Intra Met 213588_x_at RPL14 Intra Met 200627_at PTGES3 Intra Met 200074_s_at RPL14 /// RPL14L Intra Met 200772_x_at PTMA Diff/Dev 221475_s_at RPL15 Intra Met 200733_s_at PTP4A1 Intra Met

111

Gene Symbol/Title Category Probe Id Probe Id Gene Symbol/Title Category Intra Met 200038_s_at RPL17 200819_s_at RPS15 Intra Met Intra Met 200022_at RPL18 200781_s_at RPS15A Intra Met Intra Met 200029_at RPL19 201258_at RPS16 Intra Met Intra Met 208768_x_at RPL22 201665_x_at RPS17 Intra Met Intra Met 200888_s_at RPL23 202649_x_at RPS19 Diff/Dev Intra Met 203012_x_at RPL23A 203107_x_at RPS2 Intra Met Intra Met 200013_at RPL24 200949_x_at RPS20 Intra Met Intra Met 214143_x_at RPL24 /// SLC36A2 200926_at RPS23 Intra Met Intra Met 218830_at RPL26L1 200061_s_at RPS24 Intra Met Intra Met 200025_s_at RPL27 200091_s_at RPS25 Intra Met Intra Met 200003_s_at RPL28 200741_s_at RPS27 Intra Met 213969_x_at RPL29 /// RPL29P4 Diff/Dev RPS27A /// UBB /// Diff/Dev 201217_x_at RPL3 Intra Met 200017_at UBC 200062_s_at RPL30 Intra Met 208904_s_at RPS28 Intra Met 200963_x_at RPL31 Intra Met 201094_at RPS29 Intra Met 200674_s_at RPL32 Intra Met 208692_at RPS3 Intra Met 200002_at RPL35 Intra Met 201257_x_at RPS3A Intra Met 213687_s_at RPL35A Intra Met 200933_x_at RPS4X Intra Met 219762_s_at RPL36 Intra Met 200024_at RPS5 Intra Met 201406_at RPL36A Intra Met 200081_s_at RPS6 Intra Met 207585_s_at RPL36AL Intra Met 213941_x_at RPS7 Intra Met 200092_s_at RPL37 Intra Met 200858_s_at RPS8 Intra Met 201429_s_at RPL37A Intra Met 217747_s_at RPS9 Intra Met 202029_x_at RPL38 Diff/Dev 201477_s_at RRM1 Intra Met 208695_s_at RPL39 Intra Met 212846_at RRP1B Intra Met 200089_s_at RPL4 Intra Met 203594_at RTCD1 Intra Met 201492_s_at RPL41 Intra Met 212301_at RTF1 Intra Met 200937_s_at RPL5 Intra Met 210968_s_at RTN4 Diff/Dev 200034_s_at RPL6 Intra Met 219598_s_at RWDD1 Intra Met 200717_x_at RPL7 Intra Met 202853_s_at RYK Diff/Dev 217740_x_at RPL7A Intra Met 208742_s_at SAP18 Intra Met 200936_at RPL8 Intra Met 201542_at SAR1A Intra Met 200032_s_at RPL9 Intra Met 218143_s_at SCAMP2 Intra Met 201033_x_at RPLP0 Intra Met 211733_x_at SCP2 Intra Met RPLP0 /// RPLP0- Intra Met 200958_s_at SDCBP Intra Met 214167_s_at like 218649_x_at SDCCAG1 Intra Met Intra Met 200763_s_at RPLP1 215088_s_at SDHC Intra Met Intra Met 200909_s_at RPLP2 201290_at SEC11A Intra Met Intra Met 208689_s_at RPN2 212887_at SEC23A Intra Met Intra Met 200095_x_at RPS10 203133_at SEC61B Intra Met Intra Met 200031_s_at RPS11 203484_at SEC61G Intra Met Intra Met 213377_x_at RPS12 208942_s_at SEC62 Intra Met Intra Met 200018_at RPS13 201916_s_at SEC63 Intra Met Intra Met 208645_s_at RPS14 202318_s_at SENP6 Intra Met

112 Probe Id Gene Symbol/Title Category Probe Id Gene Symbol/Title Category 200961_at SEPHS2 Intra Met 214988_s_at SON Intra Met 200778_s_at septin 2 Intra Met 213168_at SP3 Intra Met 213151_s_at septin 7 Intra Met 217927_at SPCS1 Intra Met 209669_s_at SERBP1 Intra Met 201240_s_at SPCS2 Intra Met 221471_at SERINC3 Intra Met 201997_s_at SPEN Intra Met 200971_s_at SERP1 Intra Met 215383_x_at SPG21 Intra Met 202656_s_at SERTAD2 Intra Met 204640_s_at SPOP Intra Met 200630_x_at SET Intra Met 212071_s_at SPTBN1 Intra Met 211185_s_at SF3B1 Intra Met 202277_at SPTLC1 Intra Met 201586_s_at SFPQ Intra Met 209218_at SQLE Intra Met 211784_s_at SFRS1 Intra Met 200007_at SRP14 Intra Met 200892_s_at SFRS10 Intra Met 208095_s_at SRP72 Intra Met 200686_s_at SFRS11 Intra Met 201225_s_at SRRM1 Intra Met 212721_at SFRS12 Intra Met 201138_s_at SSB Intra Met 200753_x_at SFRS2 Intra Met 200891_s_at SSR1 Intra Met 202899_s_at SFRS3 Intra Met 200652_at SSR2 Intra Met 203380_x_at SFRS5 Intra Met 201004_at SSR4 Intra Met 214141_x_at SFRS7 Intra Met 207040_s_at ST13 Intra Met 201698_s_at SFRS9 Intra Met 200887_s_at STAT1 Intra Met 210101_x_at SH3GLB1 Intra Met 207320_x_at STAU1 Intra Met 202276_at SHFM1 Intra Met 212572_at STK38L Intra Met 202981_x_at SIAH1 Diff/Dev 200870_at STRAP Intra Met 211115_x_at SIP1 Intra Met 212857_x_at SUB1 Intra Met 49327_at SIRT3 Intra Met 211069_s_at SUMO1 Intra Met 212896_at SKIV2L2 Intra Met 213881_x_at SUMO2 Intra Met 201920_at SLC20A1 Intra Met 215452_x_at SUMO2 /// SUMO4 Intra Met 210686_x_at SLC25A16 Intra Met 217815_at SUPT16H Intra Met 200030_s_at SLC25A3 Intra Met 217833_at SYNCRIP Intra Met 200657_at SLC25A5 Intra Met 200055_at TAF10 Intra Met 212085_at SLC25A6 Intra Met 200020_at TARDBP Intra Met 202614_at SLC30A9 Intra Met 201263_at TARS Intra Met 202088_at SLC39A6 Intra Met 200976_s_at TAX1BP1 Intra Met 212944_at SLC5A3 Diff/Dev 212052_s_at TBC1D9B Intra Met 212295_s_at SLC7A1 Intra Met 203667_at TBCA Intra Met 201321_s_at SMARCC2 Intra Met 200804_at TEGT Diff/Dev 211988_at SMARCE1 Intra Met 212043_at TGOLN2 Intra Met 201589_at SMC1A Intra Met 218357_s_at TIMM8B Intra Met 202043_s_at SMS Intra Met 202011_at TJP1 Intra Met 209481_at SNRK Intra Met 208700_s_at TKT Intra Met 206055_s_at SNRPA1 Intra Met 202606_s_at TLK1 Intra Met 200826_at SNRPD2 Intra Met 201078_at TM9SF2 Intra Met 203316_s_at SNRPE Intra Met 219206_x_at TMBIM4 Intra Met 205644_s_at SNRPG Intra Met 212352_s_at TMED10 Intra Met 201575_at SNW1 Intra Met 200087_s_at TMED2 Intra Met 200067_x_at SNX3 Intra Met 202194_at TMED5 Intra Met

113 Probe Id Gene Symbol/Title Category Probe Id Gene Symbol/Title Category 211967_at TMEM123 Intra Met 200083_at USP22 Intra Met TMEM189-UBE2V1 Intra Met 221654_s_at USP3 Intra Met 201001_s_at /// UBE2V1 211800_s_at USP4 Intra Met 212622_at TMEM41B Intra Met 201099_at USP9X Diff/Dev 200620_at TMEM59 Intra Met 203614_at UTP14C Diff/Dev 200847_s_at TMEM66 Intra Met 208780_x_at VAPA Intra Met 214948_s_at TMF1 Intra Met 200931_s_at VCL Intra Met 217733_s_at TMSB10 Intra Met 211662_s_at VDAC2 Intra Met 207657_x_at TNPO1 Intra Met 208845_at VDAC3 Intra Met 200662_s_at TOMM20 Intra Met 202171_at VEZF1 Intra Met 202264_s_at TOMM40 Intra Met 201807_at VPS26A Intra Met 201519_at TOMM70A Intra Met 217727_x_at VPS35 Intra Met 208901_s_at TOP1 Intra Met 218171_at VPS4B Intra Met 200822_x_at TPI1 Intra Met 207628_s_at WBSCR22 Intra Met 200742_s_at TPP1 Diff/Dev 212050_at WIPF2 Intra Met 219030_at TPRKB Intra Met 200670_at XBP1 Intra Met 211943_x_at TPT1 Intra Met 208775_at XPO1 Intra Met 221952_x_at TRMT5 Intra Met 212160_at XPOT Intra Met 212852_s_at TROVE2 Intra Met 208643_s_at XRCC5 Intra Met 218132_s_at TSEN34 Intra Met 208627_s_at YBX1 Intra Met 201515_s_at TSN Intra Met 201351_s_at YME1L1 Intra Met 200972_at TSPAN3 Intra Met 212455_at YTHDC1 Intra Met 212928_at TSPYL4 Intra Met 217717_s_at YWHAB Intra Met 208073_x_at TTC3 Intra Met 200693_at YWHAQ Intra Met 214672_at TTLL5 Intra Met 200638_s_at YWHAZ Intra Met 201090_x_at TUBA1B Intra Met 200047_s_at YY1 Diff/Dev 209026_x_at TUBB Diff/Dev 218263_s_at ZBED5 Intra Met 208864_s_at TXN Intra Met 205788_s_at ZC3H11A Intra Met 208959_s_at TXNDC4 Intra Met 201857_at ZFR Diff/Dev 218381_s_at U2AF2 Intra Met 202939_at ZMPSTE24 Intra Met 209340_at UAP1 Intra Met 214715_x_at ZNF160 Intra Met 221700_s_at UBA52 Diff/Dev 200829_x_at ZNF207 Intra Met 202333_s_at UBE2B Diff/Dev 218059_at ZNF706 Intra Met 200667_at UBE2D3 Intra Met 212544_at ZNHIT3 Intra Met 209141_at UBE2G1 Intra Met 200682_s_at UBE2L3 Intra Met 201649_at UBE2L6 Intra Met 201524_x_at UBE2N Intra Met 217978_s_at UBE2Q1 Intra Met 209096_at UBE2V2 Intra Met 218190_s_at UCRC Intra Met 205480_s_at UGP2 Intra Met 214323_s_at UPF3A Intra Met 205849_s_at UQCRB Intra Met 208909_at UQCRFS1 Intra Met 201568_at UQCRQ Intra Met

114 Appendix Table 7B. Functional Classification of “Tissue Only” genes Intra Met = Intracellular Metabolic Function Diff/Dev = Differentiated and/or Developmental Function

Probe Id Gene Symbol Category Probe Id Gene Symbol Category 206466_at ACSBG1 Intra Met 207406_at CYP7A1 Diff/Dev 205997_at ADAM28 Diff/Dev 208335_s_at DARC Diff/Dev 209612_s_at ADH1B Intra Met 209782_s_at DBP Diff/Dev 207293_s_at AGTR2 Diff/Dev 207814_at DEFA6 Diff/Dev 204941_s_at ALDH3B2 Intra Met 214027_x_at DES /// FAM48A Diff/Dev 204704_s_at ALDOB Diff/Dev 208175_s_at DMP1 Diff/Dev 205639_at AOAH Diff/Dev 206065_s_at DPYS Intra Met 205146_x_at APBA3 Intra Met 206642_at DSG1 Diff/Dev 213592_at APLNR Diff/Dev 222314_x_at EGO Diff/Dev 220023_at APOB48R Diff/Dev 219436_s_at EMCN Diff/Dev 1555076_at ARHGAP25 Intra Met 209962_at EPOR Diff/Dev 219087_at ASPN Intra Met 221056_x_at EPS15L1 Intra Met 203295_s_at ATP1A2 Diff/Dev 211626_x_at ERG Diff/Dev 205638_at BAI3 Diff/Dev 205225_at ESR1 Diff/Dev 206119_at BHMT Intra Met 215851_at EVI1 Diff/Dev 219191_s_at BIN2 Intra Met 210992_x_at FCGR2C Diff/Dev 1554575_a_at BPNT1 Diff/Dev FCGR3A /// Diff/Dev 209642_at BUB1 Intra Met 204006_s_at FCGR3B 218232_at C1QA Diff/Dev 205237_at FCN1 Diff/Dev 210168_at C6 Diff/Dev 205866_at FCN3 Diff/Dev Diff/Dev 220414_at CALML5 1555191_a_at FHL5 Intra Met 202965_s_at CAPN6 Intra Met 204829_s_at FOLR2 Intra Met 208063_s_at CAPN9 Intra Met 214560_at FPR3 Diff/Dev 207317_s_at CASQ2 Diff/Dev 205285_s_at FYB Diff/Dev 201432_at CAT Intra Met 203397_s_at GALNT3 Intra Met 210133_at CCL11 Diff/Dev 204904_at GJA4 Diff/Dev 206407_s_at CCL13 Diff/Dev 221415_s_at GJA9 /// MYCBP Diff/Dev 205392_s_at CCL14 /// CCL15 Diff/Dev 206930_at GLYAT Diff/Dev 204606_at CCL21 Diff/Dev Diff/Dev CCR5 /// 205495_s_at GNLY Diff/Dev 206991_s_at LOC727797 206681_x_at GP2 Intra Met 203645_s_at CD163 Diff/Dev 1563034_at GPD1 Diff/Dev 206206_at CD180 Diff/Dev 221288_at GPR22 Diff/Dev 206517_at CDH16 Diff/Dev 207003_at GUCA2A Diff/Dev 208168_s_at CHIT1 Diff/Dev 210321_at GZMH Intra Met 206149_at CHP2 Intra Met 206666_at GZMK Intra Met 221295_at CIDEA Intra Met 203902_at HEPH Intra Met 205101_at CIITA Diff/Dev 1556426_at HEXA Diff/Dev CLCNKA /// 208569_at HIST1H2AB Intra Met 207047_s_at CLCNKB Intra Met 209728_at HLA-DRB4 Diff/Dev 219947_at CLEC4A Diff/Dev 207577_at HTR4 Diff/Dev 219890_at CLEC5A Diff/Dev 210439_at ICOS Diff/Dev 206244_at CR1 Diff/Dev 1553297_a_at CSF3R Intra Met 221331_x_at CTLA4 Diff/Dev 205242_at CXCL13 Diff/Dev

115 Probe Id Gene Symbol Category IGH@ /// IGHA1 /// IGHA2 /// IGHG1 /// IGHG2 /// IGHG3 /// IGHM /// IGHV4-31 /// LOC100126583 /// LOC100133739 /// Diff/Dev 217281_x_at LOC652494 211649_x_at IGH@ /// IGHA1 /// IGHA2 /// IGHG1 /// IGHM /// LOC642131 Diff/Dev 217022_s_at IGH@ /// IGHA1 /// IGHA2 /// IGHV3OR16-13 /// LOC100126583 Diff/Dev 217198_x_at IGH@ /// IGHD /// IGHG1 /// LOC100134331 /// LOC652128 Diff/Dev IGHA1 /// IGHD /// IGHG1 /// IGHG3 /// IGHM /// IGHV4-31 /// Diff/Dev 211650_x_at IGHV@ /// LOC100126583 216510_x_at IGHA1 /// IGHD /// IGHG1 /// IGHM /// IGHV4-31 /// IGHV@ Diff/Dev IGHA1 /// IGHG1 /// IGHG3 /// IGHM /// IGHV4-31 /// IGHV@ /// Diff/Dev 211868_x_at LOC100133739 217360_x_at IGHA1 /// IGHG1 /// IGHG3 /// IGHM /// LOC652494 Diff/Dev 217084_at IGHA1 /// IGHG1 /// IGHM /// IGHV4-31 /// IGHV@ Diff/Dev 211640_x_at IGHA1 /// IGHG1 /// LOC100133862 Diff/Dev 1555710_at IGHG1 Diff/Dev 211634_x_at IGHM /// LOC100133862 Diff/Dev 215949_x_at IGHM /// LOC652494 Intra Met 216829_at IGK@ /// IGKC /// IGKV1-5 /// LOC647506 /// LOC652694 Diff/Dev 216576_x_at IGKC /// IGKV1-5 /// LOC647506 /// LOC652694 Diff/Dev 209138_x_at IGL@ Diff/Dev 216365_x_at IGL@ /// IGLV3-19 Diff/Dev 216853_x_at IGLV3-19 Diff/Dev 208402_at IL17A Diff/Dev 207008_at IL8RB Diff/Dev 1563003_at ITGAX Diff/Dev 1554710_at KCNMB1 Diff/Dev 214470_at KLRB1 Diff/Dev 207409_at LECT2 Diff/Dev 207240_s_at LHCGR Diff/Dev 210784_x_at LILRA6 /// LILRB3 Diff/Dev 207697_x_at LILRB2 Diff/Dev 210225_x_at LILRB3 Diff/Dev 215173_at LRRC50 Intra Met 205154_at LRRN2 Intra Met 210629_x_at LST1 Diff/Dev 206480_at LTC4S Diff/Dev 202018_s_at LTF Diff/Dev 219059_s_at LYVE1 Diff/Dev 1562440_at MAP3K13 Diff/Dev 205819_at MARCO Intra Met 222348_at MAST4 Intra Met 221862_at MIER2 Intra Met 1555728_a_at MS4A4A Intra Met 219666_at MS4A6A Intra Met

116

Probe Id Gene Symbol Category Probe Id Gene Symbol Category 208422_at MSR1 Diff/Dev 206568_at TNP1 Diff/Dev 219796_s_at MUPCDH Intra Met 207741_x_at TPSAB1 Intra Met 207424_at MYF5 Diff/Dev TPSAB1 /// 205951_at MYH1 Diff/Dev 205683_x_at TPSB2 Intra Met 1568760_at MYH11 Diff/Dev 207134_x_at TPSB2 Intra Met Diff/Dev 206797_at NAT2 Intra Met 211902_x_at TRA@ Diff/Dev 207075_at NLRP3 Diff/Dev 216191_s_at TRA@ /// TRD@ Diff/Dev 206418_at NOX1 Intra Met 213830_at TRD@ Diff/Dev 209798_at NPAT Intra Met 219725_at TREM2 207202_s_at NR1I2 Diff/Dev 202341_s_at TRIM2 Intra Met 207152_at NTRK2 Diff/Dev 220507_s_at UPB1 Intra Met Diff/Dev 205907_s_at OMD Diff/Dev 204787_at VSIG4 Diff/Dev 220005_at P2RY13 Diff/Dev 202112_at VWF 207838_x_at PBXIP1 Diff/Dev 205714_s_at ZMYND10 Intra Met 219656_at PCDH12 Diff/Dev 216453_at ZNF287 Intra Met 203242_s_at PDLIM5 Intra Met 1559921_at PECAM1 Diff/Dev PGA3 /// PGA4 /// Diff/Dev 213265_at PGA5 204213_at PIGR Diff/Dev 207415_at PLA2R1 Intra Met 205093_at PLEKHA6 Intra Met 212821_at PLEKHG3 Intra Met 205913_at PLIN Intra Met 221529_s_at PLVAP Intra Met 1555777_at POSTN Diff/Dev 206007_at PRG4 Intra Met 208169_s_at PTGER3 Diff/Dev 205577_at PYGM Diff/Dev 205326_at RAMP3 Diff/Dev 1565358_at RARA Intra Met 203403_s_at RNF6 Intra Met 216030_s_at SEMG2 Diff/Dev 206664_at SI Intra Met 219519_s_at SIGLEC1 Diff/Dev 1555116_s_at SLC11A1 Diff/Dev 1560884_at SLC17A1 Intra Met 207429_at SLC22A2 Diff/Dev 220722_s_at SLC5A7 Diff/Dev 1555319_at STAB1 Diff/Dev 206835_at STATH Diff/Dev 208854_s_at STK24 Intra Met 209306_s_at SWAP70 Diff/Dev 1555565_s_at TAPBP Diff/Dev 1558452_at TMEM144 Intra Met 205611_at TNFSF12 Diff/Dev

117 Appendix Table 7C. Functional Classification of “Variably Expressed” genes Intra Met = Intracellular Metabolic Function Diff/Dev = Differentiated and/or Developmental Function

Probe Id Gene Symbol Category Probe Id Gene Symbol Category 219962_at ACE2 Diff/Dev 206704_at CLCN5 Intra Met 207990_x_at ACRV1 Diff/Dev 213317_at CLIC5 Diff/Dev 202952_s_at ADAM12 Intra Met 219302_s_at CNTNAP2 Diff/Dev 208268_at ADAM28 Diff/Dev 205832_at CPA4 Intra Met 205481_at ADORA1 Diff/Dev 210688_s_at CPT1A Diff/Dev Diff/Dev 219977_at AIPL1 205489_at CRYM Diff/Dev 211298_s_at ALB Diff/Dev 206224_at CST1 Intra Met 205208_at ALDH1L1 Intra Met 214974_x_at CXCL5 Diff/Dev 205257_s_at AMPH Diff/Dev 206336_at CXCL6 Diff/Dev 205609_at ANGPT1 Diff/Dev 210816_s_at CYB561 Intra Met 208353_x_at ANK1 Intra Met 207498_s_at CYP2D6 Diff/Dev 203299_s_at AP1S2 Intra Met DAZ1 /// DAZ2 /// Diff/Dev Diff/Dev 209870_s_at APBA2 208281_x_at DAZ3 /// DAZ4 207158_at APOBEC1 Intra Met 205818_at DBC1 Intra Met 202208_s_at ARL4C Intra Met 205311_at DDC Intra Met 202986_at ARNT2 Diff/Dev 213631_x_at DHODH Intra Met 214982_at ASCC3L1 Intra Met 208386_x_at DMC1 Diff/Dev 206684_s_at ATF7 Intra Met 215135_at DNPEP Intra Met 219902_at BHMT2 Intra Met Diff/Dev 210201_x_at BIN1 Diff/Dev 215982_s_at DOM3Z 202357_s_at C2 /// CFB Diff/Dev 204751_x_at DSC2 Intra Met 211592_s_at CACNA1C Diff/Dev 205741_s_at DTNA Diff/Dev 210641_at CAPN9 Intra Met 204014_at DUSP4 Intra Met 207686_s_at CASP8 Intra Met 215082_at ELOVL5 Intra Met 200951_s_at CCND2 Intra Met 201340_s_at ENC1 Diff/Dev 210436_at CCT8 Intra Met 220977_x_at EPB41L5 Diff/Dev 206508_at CD70 Intra Met 204505_s_at EPB49 Diff/Dev 204510_at CDC7 Intra Met 204718_at EPHB6 Intra Met 207149_at CDH12 Intra Met 217053_x_at ETV1 Intra Met 206898_at CDH19 Intra Met 205585_at ETV6 Diff/Dev 203440_at CDH2 Intra Met 221680_s_at ETV7 Diff/Dev 210601_at CDH6 Diff/Dev 204503_at EVPL Diff/Dev 219534_x_at CDKN1C Diff/Dev 211051_s_at EXTL3 Intra Met 209832_s_at CDT1 Intra Met 219189_at FBXL6 Intra Met 201884_at CEACAM5 Diff/Dev 208240_s_at FGF1 Diff/Dev 211657_at CEACAM6 Diff/Dev 217310_s_at FOXJ3 Intra Met 204039_at CEBPA Diff/Dev 206708_at FOXN2 Intra Met 205382_s_at CFD Diff/Dev 219764_at FZD10 Diff/Dev 214486_x_at CFLAR Intra Met 205850_s_at GABRB3 Intra Met 204697_s_at CHGA Diff/Dev Diff/Dev 204591_at CHL1 Diff/Dev 208138_at GAST 207486_x_at CHN2 Intra Met 219508_at GCNT3 Intra Met 209763_at CHRDL1 Diff/Dev 210627_s_at GCS1 Intra Met 205007_s_at CIB2 Intra Met 218114_at GGA1 Intra Met

118 Probe Id Gene Symbol Category Probe Id Gene Symbol Category 208915_s_at GGA2 Intra Met 204891_s_at LCK Diff/Dev 209411_s_at GGA3 Intra Met 210948_s_at LEF1 Diff/Dev 204222_s_at GLIPR1 Intra Met 213880_at LGR5 Diff/Dev 205279_s_at GLRB Diff/Dev 214213_x_at LMNA Intra Met 206896_s_at GNG7 Diff/Dev LOC100130648 /// 204983_s_at GPC4 Diff/Dev SSX1 /// SSX10 /// 220773_s_at GPHN Intra Met SSX2 /// SSX2B /// SSX3 /// SSX5 /// 209167_at GPM6B Diff/Dev 215881_x_at SSX7 /// SSX9 Intra Met 206971_at GPR161 Diff/Dev 206723_s_at LPAR2 Diff/Dev 210264_at GPR35 Diff/Dev 215063_x_at LRRC40 Intra Met 219936_s_at GPR87 Diff/Dev 211019_s_at LSS Diff/Dev 217008_s_at GRM7 Diff/Dev 210128_s_at LTB4R Diff/Dev 202947_s_at GYPC Diff/Dev 203514_at MAP3K3 Intra Met 217937_s_at HDAC7 Diff/Dev 209086_x_at MCAM Diff/Dev 211327_x_at HFE Diff/Dev Diff/Dev 214616_at HIST1H3E Intra Met 205375_at MDFI 219976_at HOOK1 Diff/Dev 203003_at MEF2D Intra Met 218959_at HOXC10 Diff/Dev 213764_s_at MFAP5 Intra Met 205975_s_at HOXD1 Diff/Dev 206560_s_at MIA Intra Met 214604_at HOXD11 Diff/Dev 209241_x_at MINK1 Diff/Dev 219984_s_at HRASLS Intra Met 212078_s_at MLL Diff/Dev 221169_s_at HRH4 Diff/Dev 204259_at MMP7 Intra Met 205404_at HSD11B1 Diff/Dev 214614_at MNX1 Diff/Dev 215712_s_at IGFALS Diff/Dev 205413_at MPPED2 Diff/Dev 207160_at IL12A Diff/Dev 219786_at MTL5 Diff/Dev 210118_s_at IL1A Diff/Dev 217204_at MTRF1L Intra Met 205067_at IL1B Diff/Dev 204798_at MYB Intra Met 203828_s_at IL32 Diff/Dev 222153_at MYEF2 Intra Met 204989_s_at ITGB4 Intra Met 215331_at MYH15 Diff/Dev 208083_s_at ITGB6 Diff/Dev 217274_x_at MYL4 Diff/Dev 209784_s_at JAG2 Diff/Dev 209949_at NCF2 Intra Met 220776_at KCNJ14 Diff/Dev 216114_at NCKIPSD Intra Met 211806_s_at KCNJ15 Intra Met 216882_s_at NEBL Intra Met 205952_at KCNK3 Diff/Dev 213298_at NFIC Intra Met 204401_at KCNN4 Intra Met 204621_s_at NR4A2 Diff/Dev 214849_at KCTD20 Intra Met 210841_s_at NRP2 Diff/Dev 201728_s_at KIAA0100 Intra Met 213131_at OLFM1 Diff/Dev 207029_at KITLG Diff/Dev 201981_at PAPPA Diff/Dev 214591_at KLHL4 Intra Met 213534_s_at PASK Intra Met 204734_at KRT15 Diff/Dev 206935_at PCDH8 Intra Met 205157_s_at KRT17 Diff/Dev 210650_s_at PCLO Intra Met 213680_at KRT6B Diff/Dev 219295_s_at PCOLCE2 Intra Met 204385_at KYNU Intra Met 205549_at PCP4 Diff/Dev 204584_at L1CAM Diff/Dev 215671_at PDE4B Diff/Dev 209270_at LAMB3 Diff/Dev 210837_s_at PDE4D Diff/Dev

119 Probe Id Gene Symbol Category Probe Id Gene Symbol Category 204200_s_at PDGFB Diff/Dev 207096_at SAA4 Diff/Dev 205226_at PDGFRL Diff/Dev 213257_at SARM1 Diff/Dev 206347_at PDK3 Intra Met 215754_at SCARB2 Intra Met 210170_at PDLIM3 Intra Met 205508_at SCN1B Intra Met 202928_s_at PHF1 Intra Met 210432_s_at SCN3A Intra Met Diff/Dev SERF1A /// 41469_at PI3 Diff/Dev 204746_s_at PICK1 Diff/Dev 219982_s_at SERF1B 204144_s_at PIGQ Intra Met 202376_at SERPINA3 Diff/Dev 219788_at PILRA Diff/Dev 211361_s_at SERPINB13 Diff/Dev 204269_at PIM2 Diff/Dev 220357_s_at SGK2 Diff/Dev 210969_at PKN2 Intra Met 207351_s_at SH2D2A Diff/Dev 206178_at PLA2G5 Intra Met 205751_at SH3GL2 Diff/Dev 207415_at PLA2R1 Intra Met 210135_s_at SHOX2 Diff/Dev 202924_s_at PLAGL2 Intra Met 211349_at SLC15A1 Intra Met 217610_at POLR2J4 Intra Met 202800_at SLC1A3 Diff/Dev 205096_at POM121 Intra Met 211842_s_at SLC24A1 Diff/Dev 206063_x_at PPIL2 Intra Met 217289_s_at SLC37A4 Intra Met 218849_s_at PPP1R13L Intra Met 205799_s_at SLC3A1 Intra Met 204086_at PRAME Diff/Dev 204394_at SLC43A1 Intra Met 216638_s_at PRLR Diff/Dev 206058_at SLC6A12 Diff/Dev 205515_at PRSS12 Intra Met 210542_s_at SLCO3A1 Intra Met 207638_at PRSS7 Intra Met 205396_at SMAD3 Intra Met 211741_x_at PSG3 Diff/Dev 217189_s_at SMG7 Intra Met 208776_at PSMD11 Intra Met 209420_s_at SMPD1 Diff/Dev 210367_s_at PTGES Intra Met 207827_x_at SNCA Diff/Dev 205911_at PTHR1 Diff/Dev 206359_at SOCS3 Diff/Dev 206482_at PTK6 Intra Met 205236_x_at SOD3 Intra Met 219654_at PTPLA Diff/Dev 219257_s_at SPHK1 Diff/Dev 205846_at PTPRB Intra Met 214549_x_at SPRR1A Diff/Dev 213362_at PTPRD Intra Met 200917_s_at SRPR Intra Met 210675_s_at PTPRR Diff/Dev 214597_at SSTR2 Diff/Dev 218186_at RAB25 Intra Met 206626_x_at SSX1 Intra Met 204680_s_at RAPGEF5 Diff/Dev 210394_x_at SSX4 /// SSX4B Intra Met 205080_at RARB Diff/Dev 203759_at ST3GAL4 Intra Met 221440_s_at RBBP9 Diff/Dev 205743_at STAC Intra Met 205205_at RELB Diff/Dev 205170_at STAT2 Diff/Dev 205879_x_at RET Diff/Dev 211085_s_at STK4 Intra Met 210751_s_at RGN Intra Met 215518_at STXBP5L Intra Met 206290_s_at RGS7 Diff/Dev 211385_x_at SULT1A2 Diff/Dev 211515_s_at RIPK5 Intra Met 205342_s_at SULT1C2 Intra Met 211753_s_at RLN1 Diff/Dev 205759_s_at SULT2B1 Diff/Dev 215040_at RNASEH2B Intra Met 214954_at SUSD5 Intra Met 205529_s_at RUNX1T1 Intra Met 216272_x_at SYDE1 Intra Met 204198_s_at RUNX3 Diff/Dev 207540_s_at SYK Diff/Dev 204351_at S100P Diff/Dev 206552_s_at TAC1 Diff/Dev

120 Probe Id Gene Symbol Category 203938_s_at TAF1C Intra Met 204878_s_at TAOK2 Diff/Dev 207689_at TBX10 Diff/Dev 205513_at TCN1 Intra Met 219735_s_at TFCP2L1 Diff/Dev 215455_at TIMELESS Diff/Dev 206271_at TLR3 Diff/Dev 206025_s_at TNFAIP6 Diff/Dev 218368_s_at TNFRSF12A Diff/Dev 219423_x_at TNFRSF25 Diff/Dev 202807_s_at TOM1 Intra Met 202479_s_at TRIB2 Intra Met 206911_at TRIM25 Intra Met 202504_at TRIM29 Intra Met 219736_at TRIM36 Intra Met 209859_at TRIM9 Intra Met 206425_s_at TRPC3 Intra Met 220558_x_at TSPAN32 Intra Met 210614_at TTPA Diff/Dev 204858_s_at TYMP Diff/Dev 201387_s_at UCHL1 Intra Met UGT1A1 /// UGT1A10 /// UGT1A4 /// UGT1A6 /// UGT1A8 Diff/Dev 204532_x_at /// UGT1A9 208358_s_at UGT8 Diff/Dev 203965_at USP20 Intra Met 206219_s_at VAV1 Diff/Dev 203797_at VSNL1 Intra Met 213081_at ZBTB22 Diff/Dev 203958_s_at ZBTB40 Intra Met 207117_at ZNF117 Intra Met 214823_at ZNF204 Intra Met 220748_s_at ZNF580 Intra Met

121 Appendix 8. List of 140 human genes (of our 778 universally expressed genes) with mouse embryonic lethal genes orthologs

Gene Chromosomal Symbol Gene Name Location ACLY ATP citrate lyase 17q12-q21 ACTB , 7p15-p12 ACTR3 ARP3 actin-related protein 3 homolog (yeast) 2q14.1 ACVR1B activin A receptor, type IB 12q13 ADAR adenosine deaminase, RNA-specific 1q21.1-q21.2 ADNP activity-dependent neuroprotector homeobox 20q13.13 AMD1 adenosylmethionine decarboxylase 1 6q21-q22 11q23- APLP2 amyloid beta (A4) precursor-like protein 2 q25|11q24 ARHGAP1 Rho GTPase activating protein 1 11p12-q12 ARPC3 actin related protein 2/3 complex, subunit 3, 21kDa 12q24.11 ARS2 arsenate resistance protein 2 7q21 ASAH1 N-acylsphingosine amidohydrolase (acid ceramidase) 1 8p22-p21.3 ATP1A1 ATPase, Na+/K+ transporting, alpha 1 polypeptide 1p21 ATP2A2 ATPase, Ca++ transporting, cardiac muscle, slow twitch 2 12q23-q24.1 B2M beta-2-microglobulin 15q21-q22.2 BAT3 HLA-B associated transcript 3 6p21.3 BRD4 bromodomain containing 4 19p13.1 BTF3 basic transcription factor 3 5q13.2 BUB3 BUB3 budding uninhibited by benzimidazoles 3 homolog (yeast) 10q26 CALR calreticulin 19p13.3-p13.2 CDK2 cyclin-dependent kinase 2 12q13 CDK4 cyclin-dependent kinase 4 12q14 CDKN1B cyclin-dependent kinase inhibitor 1B (p27, Kip1) 12p13.1-p12 CFL1 cofilin 1 (non-muscle) 11q13 CNBP CCHC-type zinc finger, nucleic acid binding protein 3q21 COP9 constitutive photomorphogenic homolog subunit 8 COPS8 (Arabidopsis) 2q37.3 CUGBP1 CUG triplet repeat, RNA binding protein 1 11p11 CUL3 cullin 3 2q36.2 CUX1 cut-like homeobox 1 7q22.1 CYCS cytochrome c, somatic 7p15.2 DDX17 DEAD (Asp-Glu-Ala-Asp) box polypeptide 17 22q13.1 DDX5 DEAD (Asp-Glu-Ala-Asp) box polypeptide 5 17q21 DICER1 dicer 1, ribonuclease type III 14q32.13 DKC1 dyskeratosis congenita 1, dyskerin Xq28 DLD dihydrolipoamide dehydrogenase 7q31-q32 DLG1 discs, large homolog 1 (Drosophila) 3q29 DNAJB6 DnaJ (Hsp40) homolog, subfamily B, member 6 7q36.3 DYRK1A dual-specificity tyrosine-(Y)-phosphorylation regulated kinase 1A 21q22.13 EIF2S1 eukaryotic translation initiation factor 2, subunit 1 alpha, 35kDa 14q23.3 EIF4G2 eukaryotic translation initiation factor 4 gamma, 2 11p15 ENO1 enolase 1, (alpha) 1p36.3-p36.2 FBL 19q13.1 FDFT1 farnesyl-diphosphate farnesyltransferase 1 8p23.1-p22

122

Gene Chromosomal Symbol Gene Name Location FKBPIA FK506 binding protein 1A, 12 KDa 20p13 FTH1 ferritin, heavy polypeptide 1 11p13 UDP-N-acetyl-alpha-D-galactosamine:polypeptide N- GALNT1 acetylgalactosaminyltransferase 1 (GalNAc-T1) 18q12.1 GAPDH glyceraldehyde-3-phosphate dehydrogenase 12p13 GCLC glutamate-cysteine ligase, catalytic subunit 6p12 GLS glutaminase 2q32-q34 GNAS GNAS complex locus 20q13.3 GPX4 glutathione peroxidase 4 (phospholipid hydroperoxidase) 19p13.3 H2AFZ H2A family, member Z 4q24 HDAC2 6q21 hypoxia-inducible factor 1, alpha subunit (basic helix-loop-helix HIF1A transcription factor) 14q21-q24 HMGB1 high-mobility group box 1 13q12 HMGN1 high-mobility group binding domain 1 21q22.3|21q22.2 HNRNPC heterogeneous nuclear ribonucleoprotein C (C1/C2) 14q11.2 heterogeneous nuclear ribonucleoprotein U (scaffold attachment HNRNPU factor A) 1q44 HPRT1 Hypoxanthine phosphoribosyltransferase 1 Xq26.1 HSP90AB1 heat shock protein 90kDa alpha (cytosolic), class B member 1 6p12 HSPA5 heat shock 70kDa protein 5 (glucose-regulated protein, 78kDa) 9q33-q34.1 IFT88 88 homolog (Chlamydomonas) 13q12.1 integrin, beta 1 (fibronectin receptor, beta polypeptide, antigen ITGB1 CD29 includes MDF2, MSK12) 10p11.2 JARID2 jumonji, AT rich interactive domain 2 6p24-p23 KPNB1 karyopherin (importin) beta 1 17q21.32 LAMP1 lysosomal-associated 1 13q34 LDHA lactate dehydrogenase A 11p15.4 MACF1 microtubule-actin crosslinking factor 1 1p32-p31 MAX associated factor X 14q23 MCL1 myeloid cell leukemia sequence 1 (BCL2-related) 1q21 MORF4L1 mortality factor 4 like 1 15q24 MORF4L2 mortality factor 4 like 2 Xq22 NAMPT nicotinamide phosphoribosyltransferase 7q22.2 NBN nibrin 8q21 NCK1 NCK adaptor protein 1 3q21 NCOR2 nuclear receptor co-repressor 2 12q24 NFE2L1 nuclear factor (erythroid-derived 2)-like 1 17q21.3 NPM1 nucleophosmin (nucleolar phosphoprotein B23, numatrin) 5q35 NUP50 50kDa 22q13.31 3q28-q29|3q28- OPA1 optic atrophy 1 (autosomal dominant) q29 PDIA3 protein disulfide isomerase family A, member 3 15q15 PFN1 profilin 1 17p13.3

123

Gene Chromosomal Symbol Gene Name Location PHBP prohibitin 2 12p13 PPIA peptidylprolyl isomerase A ( A) 7p13 PPME1 protein phosphatase methylesterase 1 11q13.4 protein kinase, cAMP-dependent, regulatory, type I, alpha (tissue PRKAR1A specific extinguisher 1) 17q23-q24 PLOD3 procollagen-lysine, 2-oxoglutarate 5-dioxygenase 3 7q22 PRKCI protein kinase C, iota 3q26.3 PSAP prosaposin 10q21-q22 PSMC1 (prosome, macropain) 26S subunit, ATPase, 1 14q32.11 PSMD4 proteasome (prosome, macropain) 26S subunit, non-ATPase, 4 1q21.2 PTDSS1 phosphatidylserine synthase 1 8q22 PTGES3 prostaglandin E synthase 3 (cytosolic) 12q13.3|12 Ras-related C3 botulinum toxin substrate 1 (rho family, small GTP RAC1 binding protein Rac1) 7p22 RACGAP1 Rac GTPase activating protein 1 12q13.13 RAE1 RAE1 RNA export 1 homolog (S. pombe) 20q13.31 RANBP2 RAN binding protein 2 2q12.3 RBM39 RNA binding motif protein 39 20q11.22 recombination signal binding protein for immunoglobulin kappa J RBPJ region 4p15.2 RPL24 L24 3q12 RPS19 ribosomal protein S19 19q13.2 RPS20 ribosomal protein S20 8q12 RTN4 reticulon 4 2p16.3 RYK RYK receptor-like tyrosine kinase 3q22 SF3B1 splicing factor 3b, subunit 1, 155kDa 2q33.1 SFRS1 splicing factor, arginine/-rich 1 17q21.3-q22 SFRS2 splicing factor, arginine/serine-rich 2 17q25.2 SFRS3 splicing factor, arginine/serine-rich 3 6p21 SIAH1 seven in absentia homolog 1 (Drosophila) 16q12 SIP1 protein interacting protein 1 14q13 solute carrier family 5 (sodium/myo-inositol cotransporter), member SLC5A3 3 21q22.12 solute carrier family 7 (cationic amino acid transporter, y+ system), SLC7A1 member 1 13q12-q14 SP3 Sp3 transcription factor 2q31 1p36.33- SPEN spen homolog, transcriptional regulator (Drosophila) p36.11 SPTBN1 spectrin, beta, non-erythrocytic 1 2p21 SPTLC1 serine palmitoyltransferase, long chain base subunit 1 9q22.2 SSB Sjogren syndrome antigen B (autoantigen La) 2q31.1 SSR1 signal sequence receptor, alpha 6p24.3

124

Gene Chromosomal Symbol Gene Name Location STRAP Serine/ kinase receptor associated protein 12p12.3 SUMO1 SMT3 suppressor of mif two 3 homolog 1 (S. cerevisiae) 2q33 TAF10 RNA polymerase II, TATA box binding protein (TBP)- TAF10 associated factor, 30kDa 11p15.3 TAX1BP1 Tax1 (human T-cell leukemia virus type I) binding protein 1 7p15 TKT transketolase 3p14.3 TMED10 transmembrane emp24-like trafficking protein 10 (yeast) 14q24.3 TOP1 topoisomerase (DNA) I 20q12-q13.1 TPI1 triosephosphate isomerase 1 12p13 TPT1 tumor protein, translationally-controlled 1 13q12-q14 TXN thioredoxin 9q31 UBE2B -conjugating E2B (RAD6 homolog) 5q23-q31 UBE2L3 ubiquitin-conjugating enzyme E2L 3 22q11.21 UBE2N ubiquitin-conjugating enzyme E2N (UBC13 homolog, yeast) 12q22 VCL vinculin 10q22.2 VEZF1 vascular endothelial zinc finger 1 17q22 VPS26A vacuolar protein sorting 26 homolog A (S. pombe) 10q21.1 XBP1 X-box binding protein 1 22q12.1|22q12 X-ray repair complementing defective repair in Chinese hamster XRCC5 cells 5 (double-strand-break rejoining) 2q35 YBX1 1p34 tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation YWHAQ protein, theta polypeptide 2p25.1 YY1 YY1 transcription factor 14q ZFR zinc finger RNA binding protein 5p13.3

125 Appendix 9. Single Nucleotide Polymorphism (SNP) Map for DU-145

DU-145

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18

19 20 21 22

126 Appendix 10. Detailed explanation for discrepancies in the end intervals of the comparison between our 12 cell lines and the 198 GEO cell lines (Figure 8)

The end interval discrepancies can be explained by assay error. This comparison looks at approximately 56,000 probes in 210 samples, so an error in 10,000 probes

(7,000 + 3,000 from Figure 8) is only 0.1% error. Other than technical errors in the assay, some of the reasons for this error could be that there may have been some samples where more or less genes may have been expressed than we calculated. This would skew the graph in one direction or the other. Additionally, this would indicate that there may be some samples that should be represented in another adjacent interval. For example in interval 1, all the genes that were expressed in 1 to 18 samples were represented. If however, one of those genes was actually expressed in 20 samples, it should be represented in interval 2 (19-36 samples). This possible source of error is not as significant in the middle intervals 2-10 because there could be a possible balancing of the samples between intervals. However, the end intervals do not have adjacent intervals to obtain that balance. We saw that there were 1,508 genes that were commonly expressed between all 210 cell lines, indicating that these genes are essential for the survival of all cell lines studied (Figure 8).

127 Abstract

Aneuploidy occurs in many cancers, and is a sign of genetic instability in the genome.

We use two different techniques- single nucleotide polymorphisms (SNPs) and gene expression arrays to investigate two aspects of aneuploidy- the pattern of genetic instability causing aneuploidy, and the reason aneuploidy occurs. SNP technology was employed to obtain areas of loss of heterozygosity (LOH) in a prostate cancer cell line

(DU-145), as a means of looking at genetic instability. Using a novel method of analysis we were able to eliminate noise in both normal and cancer samples, and to visualize large quantities of data in a graphical manner. We observed several areas of genetic instability that can be linked to the cancer phenotype. Using gene expression arrays, we investigated the expression of genes in 210 human cell lines and 2,035 human tissues of multiple origins, both cancer and normal, to look at why aneuploidy occurs. We found

778 genes that are commonly expressed in all 2,245 samples, are distributed throughout the human genome, and have mostly metabolic functions, indicating that these are housekeeping genes or universally expressed genes for cell growth and survival.

Additionally, we found a subset of variably expressed genes, from where tissue specific genes were also chosen and investigated. We looked at prostate specific tissues, and found that there were some genes that were more highly expressed in the prostate tissues and cell lines, than in other non-prostate samples. We provide evidence that the broken chromosomes in the prostate cell lines are retaining specific regions of their genome, as a means of conserving universally expressed genes, which allows the maintenance of at least a haploid genome, even in cases where an essential gene is mutated or non-functional, in the form of translocations.

128