(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) (19) World Intellectual Property Organization International Bureau (10) International Publication Number (43) International Publication Date WO 2014/134728 Al 12 September 2014 (12.09.2014) P O P C T

(51) International Patent Classification: (81) Designated States (unless otherwise indicated, for every C12Q 1/68 (2006.01) G06F 19/20 (201 1.01) kind of national protection available): AE, AG, AL, AM, C40B 30/00 (2006.01) AO, AT, AU, AZ, BA, BB, BG, BH, BN, BR, BW, BY, BZ, CA, CH, CL, CN, CO, CR, CU, CZ, DE, DK, DM, (21) International Application Number: DO, DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, PCT/CA20 14/050 174 HN, HR, HU, ID, IL, IN, IR, IS, JP, KE, KG, KN, KP, KR, (22) International Filing Date: KZ, LA, LC, LK, LR, LS, LT, LU, LY, MA, MD, ME, 6 March 2014 (06.03.2014) MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, NO, NZ, OM, PA, PE, PG, PH, PL, PT, QA, RO, RS, RU, RW, SA, (25) Filing Language: English SC, SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ, TM, (26) Publication Language: English TN, TR, TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW. (30) Priority Data: 61/774,271 7 March 2013 (07.03.2013) US (84) Designated States (unless otherwise indicated, for every kind of regional protection available): ARIPO (BW, GH, (71) Applicants: UNIVERSITE DE MONTREAL [CA/CA]; GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, SZ, TZ, 2900 Edouard-Montpetit, Montreal, Quebec H3T 1J4 UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU, TJ, (CA). THE WALTER AND ELIZA HALL INSTITUTE TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK, OF MEDICAL RESEARCH [AU/AU]; 1G Royal EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LT, LU, LV, Parade, Parkville, Victoria 3052 (AU). MC, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GW, (72) Inventors: SAUVAGEAU, Guy; 7390, de Tilly, Montreal, KM, ML, MR, NE, SN, TD, TG). Quebec H3R 3E3 (CA). MACRAE, Tara; 4584 Ave. Hingston, Montreal, Quebec H4A 2K1 (CA). SAR- Published: GEANT, Tobias; 23 Olinda Crescent, Olinda, Melbourne, — with international search report (Art. 21(3)) Victoria 3788 (AU). — with sequence listing part of description (Rule 5.2(a)) (74) Agent: GOUDREAU GAGE DUBUC; 2000, McGill Col lege, #2200, Montreal, Quebec H3A 3H3 (CA).

(54) Title: METHODS AND FOR NORMALIZATION OF EXPRESSION

00

Least ai i t genes ::::> V FIG. 2

(57) Abstract: Novel genes exhibit minimal variation in expression level across different samples and which may be used as house - © keeping genes for normalization of in quantitative gene expression measurements are disclosed. A novel method for the identification of housekeeping genes using whole Transcriptome Shotgun Sequencing (RNA-seq) is also disclosed. METHODS AND GENES FOR NORMALIZATION OF GENE EXPRESSION

CROSS REFERENCE TO RELATED APPLICATIONS

The present application claims the benefit of U.S. Provisional Application Serial No.

61/774,271 filed on March 7 , 2013, which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

The present invention generally relates to the normalization of measured levels of a gene of interest in applications involving gene expression analysis.

BACKGROUND

Normalization of measured levels of a gene of interest against a stably expressed control gene is the most important action leading to accuracy in quantitative reverse- transcriptase PCR (qRT-PCR) experiments. However, while control gene levels can vary greatly depending on samples used, they are usually selected based solely on convention [1-6]. The control genes most commonly used were originally selected due to their high expression levels in all tissues rather than their low variability among tissues. Numerous studies have shown that these genes can vary considerably [1-5], thus casting doubt on the accuracy of relative quantification values. A couple of studies which have been done with this shared goal relied on microarray data meta-analysis [7, 8]. However, microarray data is susceptible to errors resulting from hybridization artifacts, saturation of fluorescent signal, and requires complicated normalization [10-12]. Leukemia and other cancer samples are prone to higher variability of gene expression compared to normal tissues due to clonal selection and genetic instability. Given the increased interest in expression profiling and identification of marker genes in cancer for personalized medicine, there is a clear need for optimal normalization of gene expression data by identifying control genes with the least possible variation. There is thus a need for the identification of genes suitable for normalization of gene expression. The present description refers to a number of documents, the content of which is herein incorporated by reference in their entirety.

SUMMARY OF THE INVENTION

In a first aspect, the present invention provides a method for comparing expression levels of a test gene in a plurality of samples, comprising: a) measuring the expression of one or more of control genes depicted in Table 1 below in said plurality of samples; Table 1

b) measuring the expression of the test gene in said plurality of samples; c) normalizing the expression of the test gene in each sample by comparing expression of the one or more control genes across the samples, and applying normalization to the test gene to obtain normalized expression levels of the test gene; and d) comparing the normalized expression levels of the test gene across said plurality of samples. In another aspect, the present invention provides a method for normalizing the levels of a test gene present in a plurality of samples comprising a) measuring the expression of one or more of control genes depicted in Table 1 across said plurality of samples; b) comparing the expression levels of the one or more control genes across said plurality of samples; c) deriving a value for normalizing expression of the one or more control genes across said plurality of samples; and d) normalizing the expression of the test gene in said plurality of samples based on the value obtained in step c). In an embodiment, the one or more control genes encodes/encode involved in RNA splicing/processing, and is/are KHDRBS1 , RBM22, SNW1, CASC3, SF3A1 , POLR2C, PAPOLA, HNRNPH3, HNRNPUL1 , RBM8A, GTF2F1 , USP39, U2AF1 , XRN2 and/or ADAR (i.e. one or any combination of the just-noted RNA splicing/processing gene).

In another embodiment, the above-mentioned one or more control genes encode protein involved in /ubiquitination and is/are USP4, UBE2I, PSMF1 , PSMA1 , VCP, PSMD6, PSMD7, KHDRBS1 , and/or VPS4A, in a further embodiment UBE2I, PSMF1 , PSMA1 , PSMD6 and/or VPS4A (i.e. one or any combination of the just-noted proteasome/ubiquitination genes). In another embodiment, the above-mentioned one or more control genes is/are HNRNPL, PCBP2, GNB1 , SLC25A3, ZNF207, UBE2I, VPS4A, PSMF1 , PSMA1 , SRSF9 and/or PSMD6 (i.e. any combination thereof). In an embodiment, the control gene is HNRNPL. In an embodiment, the control gene is PCBP2. In an embodiment, the control gene is GNB1 . In an embodiment, the control gene is SLC25A3. In an embodiment, the control gene is ZNF207. In an embodiment, the control gene is UBE2I. In an embodiment, the control gene is VPS4A. In an embodiment, the control gene is PSMF1 . In an embodiment, the control gene is PSMA1 . In an embodiment, the control gene is SRSF9. In an embodiment, the control gene is PSMD6.

In an embodiment, the expression is measured at the mRNA level. In a further embodiment, the mRNA is reverse transcribed to cDNA prior to the measuring. In a further embodiment, the mRNA or cDNA is amplified prior to said measuring. In a further embodiment, the amplification is by PCR, more particularly real time PCR (RT-PCR) (e.g., quantitative RT- PCR, qRT-PCR). In an embodiment, the above-mentioned the plurality of samples comprises a normal cell sample. In another embodiment, the above-mentioned the plurality of samples comprises a tumor cell sample). In another embodiment, the plurality of samples comprises both a normal cell sample and a tumor cell sample. In a further embodiment, the tumor cell sample is a leukemia cell sample, a breast cancer cell sample, a colon cancer cell sample, a kidney cancer cell sample and/or a lung cancer cell sample, more particularly a leukemia cell sample. In another aspect, the present invention provides a method for identifying a gene useful for normalizing the expression of a test gene across a plurality of samples, comprising a) performing whole Transcriptome Shotgun Sequencing (RNA-seq) on said plurality of samples; b) comparing the level of expression of the genes of the transcriptome across the plurality of samples; and c) identifying the gene(s) exhibiting a coefficient of variation (CV) of about 25% or less and a maximum fold-change (MFC) of about 10 or less across the plurality of samples.

In an embodiment, the MFC is about 5 or less, more particularly about 2 or less. In an embodiment, the CV is about 20% or less, more particularly about 15% or less. Other objects, advantages and features of the present invention will become more apparent upon reading of the following non-restrictive description of specific embodiments thereof, given by way of example only with reference to the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

In the appended drawings: FIG. 1 shows the distribution of coefficient of variation of control genes in relation to all genes in combined TCGA RNA-seq data. Mean expression represents the average of all RPKM values for a given gene across the combined TCGA data set (1933 samples). Coefficient of variation equals the standard deviation divided by the mean RPKM. Each dot represents a single gene: grey dots represent entire transcriptome; the larger dark dots represent new control genes with expression greater than or less than 100 RPKM; the standard control genes are indicated. The curves represent the 5th, 25th, 50th and 75th quantiles of coefficient of variation for a given expression level computed over windows of 2000 ranked genes centered about a given mean RPKM value; FIG. 2 shows the average expression stability of control genes in qRT-PCR. Average expression stability (M) was calculated with the GeNorm algorithm [16] based on qRT-PCR for the indicated control gene on a panel of 14 leukemia samples and one cord blood sample. Lower M values relate to genes which proved to have more stable expression levels across the samples used; FIG. 3 shows the correlation between RPKM and delta Ct of CD33 calculated with different control genes. dCt represents the difference between the Ct value of CD33 and that of the indicated control gene, for a given leukemic sample, measured by qRT-PCR. RPKM is plotted on a log-2 scale and represents the Reads Per Kilobase of transcript per Million mapped reads obtained for each leukemic sample by RNA-seq. p represents the Spearman correlation coefficient between the RPKM and the dCt obtained with the indicated control gene; and FIG. 4 shows the comparison of EIF4H gene expression values calculated with GAPDH or HNRNPL. RQ represents relative quantification of EIF4H determined by qRT-PCR, calculated using the ddCt method with either GAPDH or HNRNPL as the control gene, relative to the CD34+ cord blood (CB) sample. The X axis indicates the leukemic sample ID. CV (expressed as a percentage) indicates the coefficient of variation and equals the standard deviation divided by the mean RQ of CD33 calculated using the indicated control gene. MFC (mean fold change) represents the maximum divided by minimum RQ value.

DISCLOSURE OF INVENTION

In the present description, a number of terms are extensively utilized. In order to provide a clear and consistent understanding of the specification and claims, including the scope to be given to such terms, the following definitions are provided. The use of the word "a" or "an" when used in conjunction with the term "comprising" in the claims and/or the specification may mean "one" but it is also consistent with the meaning of "one or more", "at least one", and "one or more than one". As used in this specification and claim(s), the words "comprising" (and any form of comprising, such as "comprise" and "comprises"), "having" (and any form of having, such as "have" and "has"), "including" (and any form of including, such as "includes" and "include") or "containing" (and any form of containing, such as "contains" and "contain") are inclusive or open-ended and do not exclude additional, un-recited elements or method steps. Throughout this application, the term "about" is used to indicate that a value includes the standard deviation of error for the device or method being employed to determine the value. In general, the terminology "about" is meant to designate a possible variation of up to 10%. Therefore, a variation of 1, 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 and 10% of a value is included in the term "about". An "isolated nucleic acid molecule", as is generally understood and used herein, refers to a polymer of nucleotides, and includes, but should not limited to DNA and RNA. The "isolated" nucleic acid molecule is purified from its natural in vivo state, obtained by cloning or chemically synthesized. Nucleotide sequences are presented herein by single strand, in the 5' to 3' direction, from left to right, using the one-letter nucleotide symbols as commonly used in the art and in accordance with the recommendations of the lUPAC IUB Biochemical Nomenclature Commission. As used herein, "gene" is meant to broadly include any nucleic acid sequence transcribed into an RNA molecule, whether the RNA is coding (e.g., mRNA) or non-coding (e.g., ncRNA). A number of gene/protein names and/or accession numbers are referred to herein. Accessing the corresponding sequence information based on gene/protein names and/or accession numbers can be readily done by any person of ordinary skill in the art from a number of publicly available gene databanks. "Hybridization" or "nucleic acid hybridization" or "hybridization" refers generally to the hybridization of two single stranded nucleic acid molecules having complementary base sequences, which under appropriate conditions will form a thermodynamically favored double stranded structure. The term "hybridizes" as used herein may relate to hybridizations under stringent or non-stringent conditions. The setting of conditions is well within the skill of the artisan and can be determined according to protocols described in the art. The term "hybridizing sequences" preferably refers to sequences which display a sequence identity of at least 40%, preferably at least 50%, more preferably at least 60%, even more preferably at least 70%, particularly preferred at least 80%, more particularly preferred at least 90%, even more particularly preferred at least 95% and most preferably at least 97% identity. Examples of hybridization conditions can be found in the two laboratory manuals referred above (Sambrook et al., 2000, supra and Ausubel et al., 1994, supra, or further in Higgins and Hames (Eds.) "Nucleic acid hybridization, a practical approach" IRL Press Oxford, Washington DC, (1985)) and are commonly known in the art. In the case of a hybridization to a nitrocellulose filter (or other such support like nylon), as for example in the well-known Southern blotting procedure, a nitrocellulose filter can be incubated overnight at a temperature representative of the desired stringency condition (60-65°C for high stringency, 50-60°C for moderate stringency and 40-45°C for low stringency conditions) with a labeled probe in a solution containing high salt (6x SSC or 5x SSPE), 5x Denhardt's solution, 0.5% SDS, and 100 g/ml denatured carrier DNA (e.g., salmon sperm DNA). The non-specifically binding probe can then be washed off the filter by several washes in 0.2 x SSC/0.1% SDS at a temperature which is selected in view of the desired stringency: room temperature (low stringency), 42°C (moderate stringency) or 65°C (high stringency). The salt and SDS concentration of the washing solutions may also be adjusted to accommodate for the desired stringency. The selected temperature and salt concentration is based on the melting temperature (Tm) of the DNA hybrid. Of course, RNA- DNA hybrids can also be formed and detected. In such cases, the conditions of hybridization and washing can be adapted according to well-known methods by the person of ordinary skill. Stringent conditions will be preferably used (Sambrook et al., 2000, supra). Other protocols or commercially available hybridization kits (e.g., ExpressHyb™ from BD Biosciences Clonetech) using different annealing and washing solutions can also be used as well known in the art. As is well known, the length of the probe and the composition of the nucleic acid to be determined constitute further parameters of the hybridization conditions. Note that variations in the above conditions may be accomplished through the inclusion and/or substitution of alternate blocking reagents used to suppress background in hybridization experiments. Typical blocking reagents include Denhardt's reagent, BLOTTO, heparin, denatured salmon sperm DNA, and commercially available proprietary formulations. The inclusion of specific blocking reagents may require modification of the hybridization conditions described above, due to problems with compatibility. Hybridizing nucleic acid molecules also comprise fragments of the above described molecules. Furthermore, nucleic acid molecules which hybridize with any of the aforementioned nucleic acid molecules also include complementary fragments, derivatives and allelic variants of these molecules. Additionally, a hybridization complex refers to a complex between two nucleic acid sequences by virtue of the formation of hydrogen bonds between complementary G and C bases and between complementary A and T bases; these hydrogen bonds may be further stabilized by base stacking interactions. The two complementary nucleic acid sequences hydrogen bond in an antiparallel configuration. A hybridization complex may be formed in solution (e.g., Cot or Rot analysis) or between one nucleic acid sequence present in solution and another nucleic acid sequence immobilized on a solid support (e.g., membranes, filters, chips, pins or glass slides to which, e.g., cells have been fixed). The terms "complementary" or "complementarity" refer to the natural binding of polynucleotides under permissive salt and temperature conditions by base-pairing. For example, the sequence "A-G-T" binds to the complementary sequence "T-C-A". Complementarity between two single-stranded molecules may be "partial", in which only some of the nucleic acids bind, or it may be complete when total complementarity exists between single-stranded molecules. The degree of complementarity between nucleic acid strands has significant effects on the efficiency and strength of hybridization between nucleic acid strands. This is of particular importance in amplification reactions, which depend upon binding between nucleic acids strands. By "sufficiently complementary" is meant a contiguous nucleic acid base sequence that is capable of hybridizing to another sequence by hydrogen bonding between a series of complementary bases. Complementary base sequences may be complementary at each position in sequence by using standard base pairing (e.g., G:C, A:T or A:U pairing) or may contain one or more residues (including abasic residues) that are not complementary by using standard base pairing, but which allow the entire sequence to specifically hybridize with another base sequence in appropriate hybridization conditions. Contiguous bases of an oligomer are preferably at least about 80% (81 , 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100%), more preferably at least about 90% complementary to the sequence to which the oligomer specifically hybridizes. The term "identical" or "percent identity" in the context of two or more nucleic acid or amino acid sequences as used herein, refers to two or more sequences or subsequences that are the same, or that have a specified percentage of amino acid residues or nucleotides that are the same (e.g., 60% or 65% identity, preferably, 70-95% identity, more preferably at least 95% identity), when compared and aligned for maximum correspondence over a window of comparison, or over a designated region as measured using a sequence comparison algorithm as known in the art, or by manual alignment and visual inspection. Sequences having, for example, 60% to 95% or greater sequence identity are considered to be substantially identical. Such a definition also applies to the complement of a test sequence. Preferably the described identity exists over a region that is at least about 15 to 25 amino acids or nucleotides in length, more preferably, over a region that is about 50 to 100 amino acids or nucleotides in length. Those having skill in the art will know how to determine percent identity between/among sequences using, for example, algorithms such as those based on CLUSTALW computer program (Thompson Nucl. Acids Res. 2 (1994), 4673-4680) or FASTDB (Brutlag Comp. App. Biosci. 6 (1990), 237-245), as known in the art. Although the FASTDB algorithm typically does not consider internal non-matching deletions or additions in sequences, i.e., gaps, in its calculation, this can be corrected manually to avoid an overestimation of the % identity. CLUSTALW, however, does take sequence gaps into account in its identity calculations. Also available to those having skill in this art are the BLAST and BLAST 2.0 algorithms (Altschul,

Nucl. Acids Res. 25 ( 1 977): 3389-3402). The BLASTN program for nucleic acid sequences uses as defaults a word length (W) of 11, an expectation (E) of 10, M=5, N=4, and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a wordlength (W) of 3 , and an expectation (E) of 10. The BLOSUM62 scoring matrix (Henikoff Proc. Natl. Acad. Sci., USA, 89, (1989), 10915) uses alignments (B) of 50, expectation (E) of 10, M=5, N=4, and a comparison of both strands. Moreover, the present invention also relates to nucleic acid molecules the sequence of which is degenerate in comparison with the sequence of an above- described hybridizing molecule. When used in accordance with the present invention the term "being degenerate as a result of the genetic code" means that due to the redundancy of the genetic code different nucleotide sequences code for the same amino acid. The present invention also relates to nucleic acid molecules which comprise one or more mutations or deletions, and to nucleic acid molecules which hybridize to one of the herein described nucleic acid molecules, which show (a) mutation(s) or (a) deletion(s). A "probe" is meant to include a nucleic acid oligomer or aptamer that hybridizes specifically to a target sequence in a nucleic acid or its complement, under conditions that promote hybridization, thereby allowing detection of the target sequence or its amplified nucleic acid. Detection may either be direct (i.e., resulting from a probe hybridizing directly to the target or amplified sequence) or indirect (i.e., resulting from a probe hybridizing to an intermediate molecular structure that links the probe to the target or amplified sequence). A probe's "target" generally refers to a sequence within an amplified nucleic acid sequence (i.e., a subset of the amplified sequence) that hybridizes specifically to at least a portion of the probe sequence by standard hydrogen bonding or "base pairing." Sequences that are "sufficiently complementary" allow stable hybridization of a probe sequence to a target sequence, even if the two sequences are not completely complementary. A probe may be labeled or unlabeled. A probe can be produced by molecular cloning of a specific DNA sequence or it can also be synthesized. Numerous primers and probes which can be designed and used in the context of the present invention can be readily determined by a person of ordinary skill in the art to which the present invention pertains. As used herein, a "primer" defines an oligonucleotide which is capable of annealing to a target sequence, thereby creating a double stranded region which can serve as an initiation point for nucleic acid synthesis under suitable conditions. Primers can be, for example, designed to be specific for certain alleles so as to be used in an allele-specific amplification system. The primer's 5' region may be non-complementary to the target nucleic acid sequence and include additional bases, such as a promoter sequence (which is referred to as a "promoter primer"). Those skilled in the art will appreciate that any oligomer that can function as a primer can be modified to include a 5' promoter sequence, and thus function as a promoter primer. Similarly, any promoter primer can serve as a primer, independent of its functional promoter sequence. Of course the design of a primer from a known nucleic acid sequence is well known in the art. Oligos can comprise a number of types of different nucleotides. Skilled artisans can easily assess the specificity of selected primers and probes by performing computer alignments/searches using well-known databases (e.g., Genbank™). Primers and probes can be designed based upon exon or intron sequences present in the mRNA transcript using publicly available sequence database such as the NCBI Reference Sequence (RefSeq) database. Where necessary or desired, primers and probes are designed to detect the maximum number of transcripts for the gene of interest without detecting gene products with similar sequence such as homologs. Those skilled in the art will recognize that primers and probes design required several steps such as mapping the target sequence to the genome, identify exon-exon junctions and designing a primer at each junction, identifying SNP and transcript variant that can be detected simultaneously or separately with a set of primers. Other factors that can influence primer design include without being restricted to: primer length, melting temperature (Tm), G/C content, specificity, complementary primer sequence, primer dimers and 3' sequence. For general use, optimal primer and probes can be designed using any commercially or otherwise publicly available primer/probe design software, such as

PrimerExpress™ (Applied Biosystem) or Primer3™ (http://primer3.sourceforqe.net ). "Amplification" or "amplification reaction" refers to any in vitro procedure for obtaining multiple copies ("amplicons") of a target nucleic acid sequence or its complement, or fragments thereof. In vitro amplification refers to production of an amplified nucleic acid that may contain less than the complete target region sequence or its complement. In vitro amplification methods include, e.g., transcription-mediated amplification, replicase-mediated amplification, polymerase chain reaction (PCR) amplification, ligase chain reaction (LCR) amplification and strand- displacement amplification (SDA including multiple strand-displacement amplification method (MSDA)). Replicase-mediated amplification uses self-replicating RNA molecules, and a replicase such as Οβ-replicase (e.g., Kramer et al., U.S. Pat. No. 4,786,600). PCR amplification is well known and uses DNA polymerase, primers and thermal cycling to synthesize multiple copies of the two complementary strands of DNA or cDNA (e.g., Mullis et al., U.S. Pat. Nos. 4,683,195, 4,683,202, and 4,800,159). LCR amplification uses at least four separate oligonucleotides to amplify a target and its complementary strand by using multiple cycles of hybridization, ligation, and denaturation (e.g., EP Pat. App. Pub. No. 0 320 308). SDA is a method in which a primer contains a recognition site for a restriction endonuclease that permits the endonuclease to nick one strand of a hemimodified DNA duplex that includes the target sequence, followed by amplification in a series of primer extension and strand displacement steps (e.g., Walker et al., U.S. Pat. No. 5,422,252). Two other known strand-displacement amplification methods do not require endonuclease nicking (Dattagupta et al., U.S. Patent No.

6,087,133 and U.S. Patent No. 6,124,120 (MSDA)). Those skilled in the art will understand that the oligonucleotide primer sequences of the present invention may be readily used in any in vitro amplification method based on primer extension by a polymerase (see generally Kwoh et al., 1990, Am. Biotechnol. Lab. 8:14 25 and (Kwoh et al., 1989, Proc. Natl. Acad. Sci. USA 86, 1173 1177; Lizardi et al., 1988, BioTechnology 6:1197 1202; Malek et al., 1994, Methods Mol. Biol., 28:253 260; and Sambrook et al., 2000, Molecular Cloning - A Laboratory Manual, Third Edition, CSH Laboratories). As commonly known in the art, the oligonucleotides are designed to bind to a complementary sequence under selected conditions. Sequencing technologies such as Sanger sequencing, pyrosequencing, sequencing by ligation, massively parallel sequencing, also called "Next-generation sequencing" (NGS), and other high-throughput sequencing approaches with or without sequence amplification of the target can also be used to detect and quantify the presence of target nucleic acid in a sample. The terminologies "level" and "amount" are used herein interchangeably when referring to a marker which is measured.

Normalization of gene expression

Genes that exhibit minimal variation in expression level (e.g., protein and/or mRNA levels) across a variety of samples, cell types and biological conditions provide valuable controls for relative quantification of nucleic acids. Normalizing quantitative data with housekeeping or control genes has many applications from identifying genes regulated during embryogenesis to developing new cancer diagnostics. The use of housekeeping or control genes is necessary for proper interpretation of quantitative gene expression measurements in samples (e.g., clinical samples, such as tumor samples), notably to correct expression data for differences in cellular input, nucleic acid (RNA) quality, reverse transcription (RT) and amplification efficiency between samples, for example. In the studies described herein, the present inventors have identified several control (normalizing or housekeeping) genes whose expression is stable in leukemia samples, as well as in samples from other cancer types as well as in normal samples, and thus that may potentially be used as general control genes for most human tissues. These "housekeeping" genes may thus be utilized as references, internal controls and reference values in the quantification of gene expression and of RNA and mRNA by means of methods such as Northern Blotting, Ribonuclease Protection Assay, capillary electrophoresis, microarrays, RNA- seq, and quantitative real-time PCR.

Accordingly, in a first aspect, the present invention provides a method of determining the expression level of one or more genes of interest (or test genes) in a cell of a sample, such as one or more cells of a sample from a subject. The method comprises determining the expression level(s) of said one or more genes of interest in said sample, determining the expression level(s) of one or more of control genes depicted in Table 1 in said sample, and comparing said expression level(s) of said one or more genes of interest to the expression level of said one or more control genes. The expression level(s) of one or more control genes of the invention provides a means to "normalize" the expression data from the genes of interest for comparison of data from a sample. Stated differently, expression of a gene of interest is calculated in a manner "relative to" expression of one or more control genes of the invention. The normalization may also be used for comparisons between samples, especially when they are conducted in separate experiments. Alternatively, the control genes may be used in the same manner as other "housekeeping" gene sequences known to the skilled person.

In another aspect, the present invention provides a method for comparing expression levels of a test gene in a plurality (e.g., two or more) of samples, comprising: a) measuring the expression of one or more of control genes depicted in Table 1 in said plurality of samples; b) measuring the expression of the test gene in said plurality of samples; c) normalizing the expression of the test gene in each sample by i) comparing expression of the one or more control genes across the samples, and ii) applying normalization to the test gene to obtain normalized expression levels of the test gene; and d) comparing the normalized expression levels of the test gene across said plurality of samples.

In another aspect, the present invention provides a method for normalizing the levels of a test gene present in a plurality of samples comprising a) measuring the expression of one or more of control genes depicted in Table 1 across said plurality of samples b) comparing the expression levels of the one or more control genes across said plurality of samples; c) deriving a value for normalizing expression of the one or more control genes across said plurality of samples; and d) normalizing the expression of the test gene in said plurality of samples based on the value obtained in step c).

In another aspect, the present invention provides a method for normalizing gene expression analysis data with the aid of one or more control genes, comprising: a) carrying out at least one gene expression analysis assay on a test gene in a plurality of samples; b) examining the expression of one or more of control genes depicted in Table 1 jointly in the same assay as a basis for the normalization of the gene expression analysis data of the samples to be examined, c) detecting signals from the gene expression analyses which reflect the level of gene expression of the test gene and of the one or more control genes; d) subjecting the signal data obtained in step c) to a mathematical transformation in order to at least weaken the technical variability of the signal data; and e) normalizing thereby the transformed signal data of the samples to be examined.

In another aspect, the present invention provides the use of one or more of control genes depicted in Table 1 for normalizing the levels of a test gene (or of a plurality of test genes) across a plurality of samples.

Table 2 : List of 119 housekeeping genes identified in the studies described herein, with their corresponding Genbank accession numbers

No. Gene GenBank Accession #

NM 001012750, NM 001012751 , NM 001012752, 1 ABM NM 001 1781 16, NM 001 1781 19, NM 001 178120, NM 001 178121 , NM 001 178122, NM 001 178123,

GNB1 NM_002074, NM_001 282538, NM_001 282539 GORASP2 NM 001201428, NM 015530 GTF2F1 NM 002096 HDAC3 NM 003883 HNRNPA2B1 NM 002137, NM 031243 HNRNPC NM_001 077442, NM_001 077443, NM_004500,NM_031 3 14 HNRNPD NM 002138, NM 031369, NM 031370 HNRNPH3 NM 012207, NM 021644 HNRNPK NM 002140, NM 031262, NM 031263 HNRNPL NM 001005335, NM 001533 HNRNPU NM 004501 , NM 031844 HNRNPUL1 NM 007040 IDH3B NM_001 258384, NM_006899, NM_1 74855, NM_1 74856 IK NM 006083 KARS NM 001 130089, NM 005548 KHDRBS1 NM 006559 LSM14A NM 001 14093, NM 015578 MAPRE1 NM 012325 MARS NM 004990 MLF2 NM 005439 MMADHC NM 015702 MORF4L1 NM 001265605, NM 006791 , NM 206839 MRFAP1 NM 033296 MRPL9 NM 031420 MTA2 NM 004739 MYL12B NM 001 144944, NM 001 144945, NM 033546 NOL7 NM 016167 NRD1 NM 001 101662, NM 001242361 , NM 002525

OCIAD1 NM 001079839, NM 001079840, NM 001079841 , NM 001079842, NM 001 168254, NM 017830 PAPOLA NM 001252006, NM 032632 NM 001098620, NM 001 12891 1, NM 001 128912, PCBP2 NM 001 128913, NM 001 128914, NM 005016, NM 031989 POLR2C NM 032940 PSMA1 NM 002786, NM 148976 PSMB1 NM 002793 PSMD2 NM 002808 NM 014814, NM 001271780, NM 001271779, PSMD6 NM 001271781 PSMD7 NM 00281 1 PSME1 NM 006263, NM 176783 PSME3 NM 001267045, NM 005789, NM 176863 PSMF1 NM_006814, NM_1 78578 PTPRA NM 002836, NM 080840, NM 080841

118 ZC3H1 1A NM _001271675, NM_ 014827 119 ZNF207 NM _00 1032293, NM__00 1098507, NM_003457

"Normalizing" or "normalization" as used herein refers to the correction of raw gene expression values/data between different samples for sample to sample variations, to take into account differences in "extrinsic" parameters such as cellular input, nucleic acid (RNA) or protein quality, efficiency of reverse transcription (RT), amplification, labeling, purification, etc., i.e. differences not due to actual "intrinsic" variations in gene expression by the cells in the samples (which is what needs to be measured in gene expression studies/analyses). Such normalization is performed by correcting the raw gene expression values/data for a test gene (or gene of interest) based on the gene expression values/data measured for one or more "housekeeping" or "control" genes, i.e. whose expressions are known to be constant (i.e. to show relatively low variability) between the cells of different tissues and under different experimental conditions. Assuming for example that the level of a test gene measured in a first sample is two-fold higher than the level measured in a second sample. If the level of the "housekeeping" or "control" gene is substantially the same between the two sample, it may be concluded that the test gene is expressed at higher levels (about 2-fold) in the first sample, relative the second. If, however, the level of the measured in the first sample is also two-fold higher than the level measured in the second sample, it may be concluded that the test gene is expressed at similar levels in the two samples. Thus, normalization ensures accurate comparison of expression of a test gene between different samples. Normalization may be performed by dividing, across all samples, the expression value(s) obtained for the test gene by the expression value(s) obtained for the control gene(s).

In an embodiment, the expression of the control genes is measured in the same plate (run) as the gene of interest. In another embodiment, the expression of the control genes is measured in a different plate (run) as the gene of interest.

In the above-noted methods, the control genes may be used singly (the level of one control gene is used for normalizing the expression of the test nucleic acid) or in combination (the levels of two or more control genes are used for normalizing the expression of the test gene). Any combinations (any combination of 2 , 3 , 4 , 5 , 6 or more control nucleic acids) of the control nucleic acids corresponding to the genes depicted in Table 1 may be used in the above- noted methods. In an embodiment, the one or more control genes depicted in Table 1 may be used in combination with one or more known housekeeping genes. When more than one control gene is used, the amount of gene expression of such control genes can be averaged, combined together by straight additions or by a defined algorithm. Also, the expression level of one or a plurality of test genes (2, 3 , 4 , 5 , 6 or more test genes) may be measured and normalized using the above methods. The methods of the invention may be advantageously used in an array based format (e.g. protein array, antibody array, gene/nucleic acid or oligonucleotide array), and thus a plurality of genes may be evaluated for their expression at the same time. One or more of the control genes of the invention may also be evaluated as part of the same experiment.

In an embodiment, the one or more control genes is/are a gene encoding a protein involved in RNA splicing/processing. In a further embodiment, the one or more control genes is/are KHDRBS1 , RBM22, SNW1 , CASC3, SF3A1 , POLR2C, PAPOLA, HNRNPH3, HNRNPUL1 , RBM8A, GTF2F1, USP39, U2AF1 , XRN2 and/or ADAR.

In another embodiment, the one or more control genes is/are a gene encoding a protein involved in protein degradation, e.g. proteasome/ubiquitination (proteasome/ ligase activity). In a further embodiment, the one or more control genes is/are USP4, UBE2I, PSMF1 , PSMA1 , VCP, PSMD6, PSMD7, KHDRBS1 , and/or VPS4A, in yet a further embodiment UBE2I, PSMF1 , PSMA1 , PSMD6 and/or VPS4A. In another embodiment, the control gene is a gene encoding a protein involved in transcription.

In another embodiment, the control gene is a gene encoding a protein involved in translation. The expression of a gene may be measured by quantitating the level of a gene product, such as a nucleic acid (mRNA) or the translated protein encoded by the gene. In an embodiment, the expression of the one or more control genes and/or test gene is measured at the protein levels. Examples of methods to measure the amount/level of a protein in a sample include, but are not limited to: Western blot, immunoblot, -linked immunosorbent assay (ELISA), "sandwich" immunoassays, radioimmunoassay (RIA), immunoprecipitation, surface plasmon resonance (SPR), chemiluminescence, fluorescent polarization, phosphorescence, immunohistochemical (IHC) analysis, matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry, microcytometry, microarray, antibody array, microscopy (e.g., electron microscopy), flow cytometry, and proteomic-based assays.

In another embodiment, the expression of the one or more control genes and/or test genes is measured at the nucleic acid (mRNA, cDNA) level. Numerous detection and quantification technologies may be used to determine the expression level of the one or more control nucleic acids and/or test nucleic acids, including but not limited to: PCR, RT-PCR; RT- qPCR; NASBA; Northern blot technology; a hybridization array; branched nucleic acid amplification/technology; TMA; LCR; High-throughput sequencing or next generation sequencing (NGS) methods such as RNA-seq, in situ hybridization technology; and amplification process followed by HPLC detection or MALDI-TOF mass spectrometry. In a particular embodiment, an amplification process is performed by PCR. Commercially available systems for quantitative PCR may be used, for example, "Real Time PCR System" of Applied Biosystems®, LightCycler® from Roche, iCycler® from BioRad®, and others. The detection methods described herein are meant to exemplify how the present invention may be practiced and are not meant to limit the scope of invention. It is contemplated that other sequence-based methodologies for detecting the presence of a nucleic acid in a subject sample may be employed according to the invention. In embodiments of the invention, all or part of a nucleic acid may be amplified and detected by methods such as the polymerase chain reaction (PCR) and variations thereof, such as, but not limited to, quantitative PCR (Q-PCR), reverse transcription PCR, and real-time PCR (including as a means of measuring the initial amounts of mRNA copies for each sequence in a sample). Such methods would utilize one or two primers that are complementary to portions of a nucleic acid, where the primers are used to prime nucleic acid synthesis. The newly synthesized nucleic acids are optionally labeled and may be detected directly or by hybridization to a polynucleotide of the invention. The newly synthesized nucleic acids may be contacted with polynucleotides (containing sequences) under conditions which allow for their hybridization. Additional methods to detect the expression of expressed nucleic acids include RNAse protection assays, including liquid phase hybridizations, and in situ hybridization of cells. As would be understood by the skilled person, detection of expression of nucleic acids may be performed by the detection of expression of any appropriate portion or fragment of these nucleic acids, or the entire nucleic acids. Preferably, the portions are sufficiently large to contain unique sequences relative to other sequences expressed in a sample. Moreover, the skilled person would recognize that either strand of a nucleic acid may be detected as an indicator of expression of the nucleic acid. This follows because the nucleic acids are expressed as RNA molecules in cells, which may be converted to cDNA molecules for ease of manipulation and detection. The resultant cDNA molecules may have the sequences of the expressed RNA as well as those of the complementary strand thereto. Thus either the RNA sequence strand or the complementary strand may be detected. Of course is it also possible to detect the expressed RNA without conversion to cDNA.

In an embodiment, the method comprises performing a reverse transcription of mRNA molecules present in a sample; and amplifying the target cDNA and the one or more control cDNAs using primers hybridizing to the cDNAs. The invention is readily practiced with the use of cell-containing samples, although any protein and/or nucleic acid containing sample which may be assayed for gene expression levels may be used in the practice of the invention. Non-limiting examples of samples for use with the invention include a clinical sample, such as, but not limited to, a fixed sample, a fresh sample, or a frozen sample. The sample may be an aspirate, a cytological sample (including blood or other bodily fluid), or a tissue specimen (e.g., a biopsy), which includes at least some information regarding the in situ context of cells in the specimen, so long as appropriate cells or nucleic acids are available for determination of gene expression levels. Samples may be processed prior to analysis as long as the ability to detect the markers of the present invention is preserved. Sample processing may include preservation and storage, as well as treating the samples to physically disrupt tissue or cell structure, thus releasing intracellular components into a solution which may further contain , buffers, salts, detergents, and the like, which are used to prepare the sample for analysis. Cells may be isolated from a fluid sample such as with centrifugation, filtration or sedimentation. Body fluids such as urine and blood may require the addition of one or more stabilizing agents, such as when further testing is to be performed hours or days after sample collection. Further processing of the sample may require one or more storage or preservation steps to be reversed, such as the removal of stabilizing and preserving agents. Tissue samples may be homogenized or otherwise prepared for analysis by well-known techniques including but not limited to: sonication; mechanical disruption; chemical lysis such as detergent lysis; and combinations thereof. Samples may also be physically divided; exposed to a chemical reaction such as a deparaffinization and/or a precipitation procedure; exposed to a separation process such as separation in a centrifuge; exposed to a washing procedure; preserved; fixed; frozen; or the like. Samples, such as tissue may be frozen, dehydrated, or preserved with a chemical agent such as formalin. Fixed tissue samples may be embedded in paraffin which eases storage and transportation, as well as facilitates the creation of slides used by a pathologist to visually inspect and assess the sample, or frozen in a medium such as RNALater® or Trizol®. Tissue section preparation for surgical pathology may be frozen and prepared using standard techniques. Immunohistochemistry and in situ hybridization binding assays on tissue sections can be performed on fixed cells. In accordance with the present invention, RNA may be extracted from biological sample in a number of ways, e.g., using an organic extraction or a solid surface target capture method.

In an embodiment, a sample of the invention may be one that is obtained from a "healthy" subject, for from a healthy/normal tissue or organ (i.e. comprising "normal" cells). In an embodiment, a sample of the invention may be one that is obtained from an "ill" subject, for example from a tissue or organ comprising "abnormal" cells (e.g., cells affected from a disease). In an embodiment, a sample of the invention may be one that is suspected or known to contain tumor cells. Alternatively, a sample of the invention maybe a "tumor sample" or "tumor containing sample" or "tumor cell containing sample" of tissue or fluid isolated from an individual suspected of being afflicted with, or at risk of developing, cancer. In an embodiment, the tumor or cancer sample is a leukemia cell sample, e.g., acute lymphoblastic leukemia (ALL), chronic lymphocytic leukemia (CLL), acute myelogenous leukemia (AML) or chronic myelogenous leukemia (CML), a breast cancer cell sample, a colon cancer cell sample (e.g., colon adenocarcinoma), a renal/kidney cancer cell sample (e.g., clear cell kidney cancer), or a lung cancer cell sample (e.g., lung adenocarcinoma). In an embodiment, the method comprises the following steps: a) isolating the control gene RNA and the test gene RNA from a first sample, b) contacting, under suitable hybridization conditions, the control RNA and the test gene RNA with detectably-labelled oligonucleotides that specifically binds to said control RNA and said test gene RNA, thereby obtaining a control gene hybridization signal and a test gene hybridization signal; c) quantitatively detecting the control gene and test gene hybridization signals, and d) normalizing said test gene hybridization signal based on said control gene hybridization signal, thereby obtaining a normalized test gene hybridization signal. In an embodiment, the method comprises the following steps: a) isolating the control gene RNA and the test gene RNA from a first sample, b) preparing a control gene cDNA and a test gene cDNA using said control gene RNA and the test gene RNA c) contacting, under suitable hybridization conditions, the control gene cDNA and the test gene cDNA with detectably-labelled oligonucleotides that specifically binds to said control cDNA and said test gene cDNA, thereby obtaining a control gene hybridization signal and a test gene hybridization signal, d) quantitatively detecting the control gene and test gene hybridization signals, and e) normalizing said test gene hybridization signal based on said control gene hybridization signal, thereby obtaining a normalized test gene hybridization signal. In an embodiment, the above method further comprises the following step b1): performing an amplification reaction on said control gene cDNA and test gene cDNA prior to said contacting.

In an embodiment, the method further comprises performing the above-mentioned steps a) to d) or a) to e) on a second sample, and comparing the normalized test gene hybridization signals in said first and second sample, wherein a higher normalized test gene hybridization signal in said first sample is indicative that said test gene is present at higher levels in said first sample, a lower normalized test gene hybridization signal in said first sample is indicative that said test gene is present at lower levels in said first sample, and similar normalized test gene hybridization signals in said first and second samples is indicative that said test gene is present at similar levels in said first and second samples In the studies described herein, the present inventors have shown that the use of whole Transcriptome Shotgun Sequencing (RNA-seq) permits the identification of the most stable genes across multiple samples for use as endogenous control ("housekeeping") genes. Accordingly, in another aspect, the present invention provides a method for identifying a gene useful for normalizing the expression of a test gene across a plurality of samples (a housekeeping gene), comprising a) performing whole Transcriptome Shotgun Sequencing (RNA-seq) on said plurality of samples; b) comparing the level of expression of the genes of the transcriptome across the plurality of samples; and c) identifying the gene(s) exhibiting a coefficient of variation (CV) of about 25% or less and a maximum fold-change (MFC) of about 10 or less across the plurality of samples.

RNA-Seq uses recently developed deep-sequencing technologies. In general, a population of RNA (total or fractionated, such as poly(A)+) is converted to a library of cDNA fragments with adaptors attached to one or both ends. Each molecule, with or without amplification, is then sequenced in a high-throughput manner to obtain short sequences from one end (single-end sequencing) or both ends (pair-end sequencing). The reads are typically 30-400 bp, depending on the DNA-sequencing technology used. RNA-seq can be done with a variety of platforms including the lllumina Genome Analyzer® platform, the ABI Solid Sequencing® platform, the Roche/Life Science's 454 Sequencing® platform, or the Helicos Biosciences tSMS® platform for example. Following sequencing, the resulting reads are either aligned to a reference genome or reference transcripts, or assembled de novo without the genomic sequence to produce a genome-scale transcription map that consists of both the transcriptional structure and/or level of expression for each gene. Expression values typically used for RNA-seq are normalized for gene length and the total number of reads for each sample (Reads Per Kilobase of transcript per Million mapped reads: RPKM), allowing for comparison between data sets. The coefficient of variation (CV) represents the standard deviation of the RPKM divided by the mean RPKM (for the RPKM values obtained from plurality of samples), and the maximum fold-change (MFC) represents the maximum RPKM divided by the minimum RPKM (for the RPKM values obtained from plurality of samples). In embodiments, the above-mentioned CV is of about 24%, 23%, 22%, 21%, 20%, 19%, 18%, 17%, 16%, 15%, 14%, 13%, 12%, 11%, 10%, 9%, 8%, 7%, 6% or 5%, or less. In embodiments, the above-mentioned MFC is of about 9 , 8 , 7 , 6 , 5 , 4 , 3 , 2 , 1, or less. In an embodiment, the plurality of samples are samples from several subjects suffering from the same condition/disease (e.g., tumor samples from subjects suffering from the same cancer), and/or subjects suffering from different conditions/diseases (e.g., tumor samples from subjects suffering from different types of cancer, or samples from subjects suffering from unrelated diseases/conditions), and/or samples from normal subjects. In an embodiment, the plurality of samples is from the same organs or tissues (e.g., liver biopsy samples). In an embodiment, the plurality of samples is from different organs or tissues (e.g., blood samples vs. tissue biopsies). MODE(S) FOR CARRYING OUT THE INVENTION

The present invention is illustrated in further details by the following non-limiting examples.

Example 1: Materials and methods Patient samples. The Leukemia samples used in the Leucegene data set were collected by the Quebec Leukemia Cell Bank as described [14]. Human cord blood samples were collected from healthy volunteers by Hema-Quebec. RNA-seq. RNA-seq was performed as previously described [14]. qRT-PCR. Total RNA was isolated from leukemic and CD34 + cord blood cells using Trizol® solution, according to the manufacturer's protocol (lnvitrogen®/Life Technologies®, Burlington, ON, Canada). Human CD34 + cord blood cells were isolated from total cord blood using the RosetteSep® Cord Blood CD34 Pre-enrichment kit, followed by the EasySep® Human Cord Blood CD34 + Selection kit, according to manufacturer's guidelines (STEMCELL® Technologies, Vancouver, BC, Canada), yielding 70-86% CD34 +. CD34 + cord blood samples from five different individuals were immediately used for reverse transcription. Moreover, CD34 + cord blood samples from twelve additional individuals were sorted using FACS Aria® cell sorter (Becton-Dickinson, San Jose, CA, USA) to keep only CD34_APC7CD45RA_PE cells (Antibodies: Becton-Dickinson, San Jose, CA, USA) before proceeding with reverse transcription. Reverse transcription of total RNA was performed using MMLV reverse transcriptase and random hexamers according to manufacturer's guidelines (lnvitrogen®/Life Technologies®, Burlington, ON, Canada). Expression assays were performed to measure gene expression levels using 2X Fast Master Mix® (Applied Biosystems®/Life Technologies®, Burlington, ON, Canada), standard primers (lnvitrogen®/Life Technologies®, Burlington, ON, Canada) and a specific probe from the Universal Probe Library® (Roche Diagnostics®, Laval, QC, Canada). qRT-PCR reactions were done on the ABI 7900HT® Fast Real-Time PCR System (Applied Biosystems®/Life Technologies®, Burlington, ON, Canada). For RQ (relative quantification) calculations, from a given test sample, the Ct (threshold cycle) values for each gene were normalized to the control gene (dCt = Ct Target - Ct Control) and compared to the mean dCT from the CD34+ cord blood sample (calibrator) using the ddCt method (ddCT =dCT Sample - dCt Calibrator; RQ=2 -ddCt). qRT-PCR cycling conditions were as follows: 2 minutes at 50°C and 10 minutes at 95°C, followed by 40 cycles of 15 seconds at 95°C and 1 minute at 59°C.

Example 2 : Variability of commonly used control genes in RNA-seq data For the present studies, RNA-seq data obtained in the Leucegene project, which was acquired from a panel of 55 Leukemia patient samples (43 AML, 12 ALL) from The Quebec Leukemia Cell Bank (BCLQ), was used. Furthermore, RNA-seq data from various cancers and associated normal tissues, including AML, breast, lung, colon and kidney (all publically available from The Cancer Genome Atlas (TCGA)), was also analyzed. The combined TCGA data set represents data from a total of 1933 patients (207 normal tissue and 1726 cancer tissue samples), as shown in Table 3 . Table 3 : RNA-seq data sets analysed in this study

* n=43 AML; n=1 2 ALL † n=5 CD34+; n= 12 CD34+CD45RA- RNA-seq for Leucegene data was performed by NluminaHiSeq2000® RNA-seq for TCGA data was performed by NluminaHiSeq2000® or NluminaGA®

To assess gene expression stability, the variability in Reads Per Kilobase of transcript per Million mapped reads (RPKM) values between different patient samples across a given RNA-seq data set was examined. This was achieved by calculating the coefficient of variation (CV) and the maximum fold change (MFC) for each gene across multiple samples within each data set; where CV represents the standard deviation divided by the mean RPKM, and MFC represents the maximum RPKM divided by the minimum RPKM value. The expression stability of 19 commonly used control genes in the Leucegene and the combined TCGA data sets was first analyzed. Standard control genes were ranked from lowest to highest CV (Table 4a). Using this approach , it was found that the most stable commonly used control gene, in both data sets, was TATA Binding Protein (TBP), yielding a CV equal to 22.8 or 44.9% and a MFC equal to 2.5 or 12.2, in Leucegene or combined TCGA data sets, respectively. Ableson (ABL1), a control gene commonly used for leukemia samples, yielded a slightly lower CV in the combined TCGA data set (39.8%), but had a high MFC (26.9). The majority of commonly used control genes exhibited variability, with CV values ranging from 27.2 to 69. 1% in Leucegene (median CV =42.6%), and 47.0 to 116.2% in the combined TCGA data (median CV =61 .4%). It was noted that the variability of the genes was higher in the combined TCGA data, which represents a more diverse collection of samples from five different cancer types and three different normal tissue types. This higher degree of variation in the combined TCGA data was more obvious in the MFC values, which are more greatly affected by extreme differences of expression in individual samples. MFC values ranged from 2.5 to 31.7 fold in

Leucegene (median =8.3), and 12.2 to 639.5 fold in the combined TCGA data (median =84.0).

Table 4a. Variability of most commonly used control genes in Leucegene and combined TCGA RNA-seq data sets. Leucegene TCGA Combined rank gene mean CV (%) MFC mean CV (%) MFC 1 TBP 8,1 22,8 2,5 6,7 44,9 12,2 2 YWHAZ 144,6 27,2 3,2 284,9 70,0 55,1 3 PGK1 189,9 28,4 3,4 212,6 62,0 3 1,0 4 LDHA 144,7 34,2 10,2 401 ,7 66,6 42,4 5 ALDOA 244,0 35,5 3,6 736,7 60,3 105,0 6 HPRT1 30,7 40,0 6,7 23,1 56,5 304,5 7 ABL1 17,0 40,1 5,7 13,9 39,8 26,9 8 SDHA 3 1, 1 40,7 12,2 52,9 6 1,8 74,2 9 UBC 499,1 4 1,3 5,2 1260,8 47,0 102,0 10 GAPDH 2206,7 42,6 8,3 1954,8 70,7 60,7 11 ACTB 1617,9 48,7 5,4 2069,5 47,4 45,2 12 G6PD 43,5 52,6 6,9 23,9 106,7 639,5 13 VIM 1700,4 53,4 17,0 824,2 90,0 192,0 14 TUBA1A 251 ,0 53,8 8,4 148,7 55,6 64,6 15 PFKP 56,4 55,3 13,3 52,9 116,2 521 ,0 16 B2M 1798,6 55,5 13,9 2506,3 6 1,4 9 1,9 17 GUSB 45,6 55,9 10,6 44,7 6 1,2 84,0 18 PGAM1 12,9 65,5 14,4 125,4 60,1 95,4 19 HMBS 18,2 69,1 3 1,7 11,4 80,5 202,8 Mean relates to RPKM values within each data set. CV indicates the coefficient of variation and equals the standard deviation divided by the mean RPKM, expressed as a percentage. MFC, mean fold change, represents the maximum divided by minimum RPKM value of the data set. Rank is based on lowest to highest CV.

The expression stability of 12 candidate control genes identified by de Jonge et al. [7] as being the most stably expressed genes in a collection of microarray experiments was examined. This gene list consists of 10 ribosomal protein coding genes, as well as SRP14 and OAZ1 (Table 4b). Using the above approach, it was found that the candidates identified from microarray data showed variability similar to those of the standard housekeeping genes, with a median CV equal to 48.5 or 5 1.6% and a median MFC equal to 8.3 or 44.5, in Leucegene or combined TCGA data sets, respectively. The most stable gene from this list was Signal Recognition Particle 14kDa (SRP14). Of note, while these genes presented similar variability in the Leucegene data set as compared to the commonly used control genes, they did prove to be slightly less variable in the combined TCGA data set. However, there was still significant variability within the TCGA data, which showed %CV values up to 82.0 for RPS16, and MFC values up to 1208.3 for RPL9.

Table 4b. Variability of genes identified as stable in microarray experiments, in Leucegene and combined TCGA RNA-seq data sets. Leucegene TCGA rank gene mean CV (%) MFC mean CV (%) MFC 1 SRP14 132,2 24,9 3,2 145,6 3 1,8 10,9 2 RPL4 1276,1 40,4 5,6 734,0 44,3 5 1,0 3 RPL6 324,0 43,1 6,5 565,4 48,9 78,6 4 OAZ1 421 ,9 44,4 4,5 273,7 42,5 18,5 5 RPL22 156,6 45,4 9,0 192,4 39,2 25,8 6 RPL24 798,9 48,1 7,5 778,6 54,3 36,5 7 RPL27 1292,6 48,8 10,5 682,5 60,4 38,1 8 RPS13 935,8 55,0 8,8 662,8 47,7 29,3 9 RPS20 636,3 55,1 8,0 667,0 58,6 52,9 10 RPS29 559,0 56,2 8,7 490,9 65,8 100,3 11 RPS16 1104,4 6 1,6 9,2 794,2 82,0 192,1 12 RPL9 99,7 124,3 72,6 1007,0 66,3 1208,3 Genes identified by deJonge et al. [7] Mean relates to RPKM values within each data set. CV indicates the coefficient of variation and equals the standard deviation divided by the mean RPKM, expressed as a percentage. MFC (mean fold change) represents the maximum divided by minimum RPKM value of the data set. Rank is based on lowest to highest CV.

Example 3 : Selection of improved control genes from Leucegene RNA-seq data In order to identify improved control genes with the most stable expression, cut-offs for %CV and MFC that were lower than the values obtained for the majority of commonly used control genes were established. Within the Leucegene data set, the entire transcriptome of

2 1,892 genes was analyzed, and those which had a %CV less than 25 and a MFC less than 5 , for two different ranges of expression: mean RPKM greater than or less than 100 (but greater than 25), were selected. These genes were then ranked from lowest to highest %CV (Table 5). Using these criteria, 20 candidate control genes with mean RPKM levels greater than 100, and 99 candidate control genes with mean RPKM levels less than 100 (Table 5 contains the best 20 genes; the full list is available in Table 6), were identified. The full list of 119 genes with their descriptions is available in Table 7 . Of these, 15 genes were selected for validation based on their high ranking in the Leucegene data, as well as having relatively stable expression in the various TCGA data sets (Table 7). The newly identified candidate control genes are: HNRNPK, PCBP2, SLC25A3, GNB1, HNRNPL, SRP14 (RPKM>100); and PSMD6, PSMA1, PSMF1, VPS4A, SF3B2, EIF4H, ZNF207, UBE2I (RPKM<100). EIF4H had slightly higher expression in the various TCGA data sets, and was therefore included in the panel of genes with higher expression for subsequent analyses.

Table 5. Selection of candidate control genes based on Leucegene RNA-seq data

Mean relates to RPKM values within each data set. CV indicates the coefficient of variation equals the standard deviation divided by the mean RPKM, expressed as a percentage. MFC (mean fold change) represents the maximum divided by minimum RPKM value of the data set. Rank is based on lowest to highest CV. Criteria for gene selection were CV < 25%, MFC < 5 in Leucegene AML_ALL data. All genes fitting criteria for expression > 100 RPKM shown; expression < 100 RPKM table contains the 18 genes with the lowest CV in Leucegene AML_ALL data, as well two other selected candidates (full list of 99 genes available in Table 5). Genes listed in bold were selected for validation studies.

Table 6 : Full list of candidate control genes, < 100 RPKM, Leucegene data

55 OCIAD1 33.5 23.0 2.7 56 RNF7 27. 1 23.0 3.3 57 SERP1 6 1.0 23.0 3.8 58 U2AF1 89.5 23. 1 3.8 59 HDAC3 26.4 23. 1 3.0 60 SF3A1 4 1.9 23.2 3.5 6 1 MARS 40.6 23.3 3.9 62 SRSF9 9 1 .9 23.4 3.0 63 MRPL9 35.4 23.5 3.3 64 TMEM50A 40.5 23.5 3.3 65 PAPOLA 43. 1 23.6 3.3 66 DOCK2 58.9 23.7 3.3 67 PTPRA 27.4 23.8 3.8 68 MAPRE1 78.9 23.9 2.8 69 E2F4 37.5 23.9 3.0 70 DCAF8 26.9 23.9 3.6 7 1 GTF2F1 40.4 24.0 2.7 72 UBE2Z 27.7 24.0 3.3 73 ADAR 49.3 24. 1 3.4 74 RAB7A 78.4 24. 1 3.4 75 TH1 L 32.7 24. 1 3.3 76 COPB1 34.4 24. 1 3.3 77 KARS 64. 1 24.2 3 .1 78 UBE2I 56. 1 24.2 4 .1 79 SON 56.2 24.3 3.8 80 UBQLN1 39.5 24.4 4.8 8 1 GABARAPL2 4 1.2 24.4 4.6 82 ATXN2L 52.9 24.4 3.3 83 USP39 29.0 24.5 3.5 84 SUM03 70.2 24.5 3 .1 85 DNAJC7 33.7 24.5 3 .1 86 TCEB1 3 1 .0 24.5 3.3 87 MTA2 45. 1 24.5 2.5 88 MMADHC 46.8 24.6 2.8 89 POLR2C 30.3 24.6 2.9 90 GORASP2 27.3 24.6 3.8 9 1 CMTM3 3 1 .6 24.6 2.9 92 COPS5 25.7 24.7 4.0 93 PSME3 29.9 24.7 4.6 94 AUP1 89.7 24.7 3.3 95 CMPK1 37.9 24.7 3.3 96 TRIP1 2 28.0 24.8 3.5 97 ACP1 38.9 24.9 3.3 98 HNRNPH3 68. 1 24.9 3.3 99 TMED2 7 1 .2 24.9 3.4 Genes listed in bold were selected for validation studies. Table 7 : Full list of the 119 genes identified herein (having a %CV less than 25 and a MFC less than 5) with their descriptions

Gene Description abl-interactor 1; May act in negative regulation of cell growth and transformation by interacting with nonreceptor tyrosine kinases ABL1 and/or ABL2. May play a role in regulation of EGF-induced Erk pathway activation. Involved in cytoskeletal reorganization and EGFR signaling. Together with EPS8 ABM participates in transduction of signals from Ras to Rac. In vitro, a trimeric complex of ABI1 , EPS8 and SOS1 exhibits Rac specific guanine nucleotide exchange factor (GEF) activity and ABM seems to act as an adapter in the complex. Regulates ABL1/c-Abl-mediated phosphorylation of MENA. Recrui [...] apoptotic chromatin condensation inducer 1; Component of a splicing-dependent multiprotein exon junction complex (EJC) deposited at splice junction on mRNAs. The EJC is a dynamic structure ACIN1 consisting of a few core and several more peripheral nuclear and cytoplasmic associated factors that join the complex only transiently either during EJC assembly or during subsequent mRNA metabolism. Induces apoptotic chromatin condensation after activation by CASP3 acid phosphatase 1, soluble; Acts on tyrosine phosphorylated proteins, low-MW aryl phosphates and ACP1 natural and synthetic acyl phosphates. Isoform 3 does not possess phosphatase activity adenosine deaminase, RNA-specific; Converts multiple adenosines to inosines and creates l/U mismatched base pairs in double-helical RNA substrates without apparent sequence specificity. Has been found to modify more frequently adenosines in AU-rich regions, probably due to the relative ease ADAR of melting A U base pairs as compared to G/C pairs. Functions to modify viral RNA genomes and may be responsible for hypermutation of certain negative-stranded viruses. Edits the messenger for glutamate receptor (GLUR) subunits by site- selective adenosine deamination. Produces low-level editin [...] adducin 1 (alpha); Membrane-cytoskeleton-associated protein that promotes the assembly of the ADD1 spectrin-actin network. Binds to calmodulin anaphase promoting complex subunit 5; Component of the anaphase promoting complex/cyclosome (APC/C), a cell cycle-regulated E3 ubiquitin ligase that controls progression through mitosis and the G 1 ANAPC5 phase of the cell cycle. The APC/C complex acts by mediating ubiquitination and subsequent degradation of target proteins: it mainly mediates the formation of 'Lys-1 1'-linked polyubiquitin chains and, to a lower extent, the formation of 'Lys-48'- and 'Lys-63'-linked polyubiquitin chains ADP-ribosylation factor 1; GTP-binding protein that functions as an allosteric activator of the cholera toxin catalytic subunit, an ADP- ribosyltransferase. Involved in protein trafficking among different compartments. Modulates vesicle budding and uncoating within the Golgi complex. Deactivation ARF1 induces the redistribution of the entire Golgi complex to the endoplasmic reticulum, suggesting a crucial role in protein trafficking. In its GTP-bound form, its triggers the association with coat proteins with the Golgi membrane. The hydrolysis of ARF1 -bound GTP, which is mediated by ARFGAPs [...] ATP synthase, H+ transporting, mitochondrial F 1 complex, beta polypeptide; Mitochondrial membrane ATP synthase (F(1 )F(0) ATP synthase or Complex V) produces ATP from ADP in the presence of a proton gradient across the membrane which is generated by electron transport complexes of the ATP5B respiratory chain. F-type ATPases consist of two structural domains, F(1 ) - containing the extramembraneous catalytic core, and F(0) - containing the membrane proton channel, linked together by a central stalk and a peripheral stalk. During catalysis, ATP synthesis in the catalytic domain of F(1 ) is couple [...] ATPase, H+ transporting, lysosomal 13kDa, V 1 subunit G 1; Catalytic subunit of the peripheral V 1 ATP6V1 G 1 complex of vacuolar ATPase (V-ATPase). V-ATPase is responsible for acidifying a variety of intracellular compartments in eukaryotic cells ATXN2L ataxin 2-like ancient ubiquitous protein 1; May play a role in the translocation of terminally misfolded proteins from AUP1 the endoplasmic reticulum lumen to the cytoplasm and their degradation by the proteasome C1orf144 UPF0485 protein C1orf144 (Putative MAPK-activating protein PM1 8/PM20/PM22) C20orf43 UPF0549 protein C20orf43 C6orf62 Uncharacterized protein C6orf62 (HBV X-transactivated gene 12 protein) cell cycle associated protein 1; May regulate the transport and translation of mRNAs of proteins CAPRIN1 involved in synaptic plasticity in neurons and cell proliferation and migration in multiple cell types (By similarity) cancer susceptibility candidate 3; Component of a splicing-dependent multiprotein exon junction complex (EJC) deposited at splice junction on mRNAs. The EJC is a dynamic structure consisting of a CASC3 few core proteins and several more peripheral nuclear and cytoplasmic associated factors that join the complex only transiently either during EJC assembly or during subsequent mRNA metabolism. Core components of the EJC, that remains bound to spliced mRNAs throughout all stages of mRNA metabolism, functions to mark the position of the exon-exon junction in the mature mRNA and thereby influence [...] CCNI cyclin I microRNA 118 1; Co-chaperone that binds to numerous kinases and promotes their interaction with the CDC37 Hsp90 complex, resulting in stabilization and promotion of their activity CDV3 homolog (mouse) CDV3 cytidine monophosphate (UMP-CMP) kinase 1, cytosolic; Catalyzes specific phosphoryl transfer from CMPK1 ATP to UMP and CMP CMTM3 CKLF-like MARVEL transmembrane domain containing 3 coatomer protein complex, subunit beta 1; The coatomer is a cytosolic protein complex that binds to dilysine motifs and reversibly associates with Golgi non-clathrin-coated vesicles, which further mediate biosynthetic protein transport from the ER, via the Golgi up to the trans Golgi network. Coatomer COPB1 complex is required for budding from Golgi membranes, and is essential for the retrograde Golgi-to-ER transport of dilysine-tagged proteins. In mammals, the coatomer can only be recruited by membranes associated to ADP-ribosylation factors (ARFs), which are small GTP-binding proteins; th [...] COP9 constitutive photomorphogenic homolog subunit 5 (Arabidopsis); Probable subunit of the COP9 signalosome complex (CSN), a complex involved in various cellular and developmental processes. The CSN complex is an essential regulator of the ubiquitin (Ubl) conjugation pathway by mediating the deneddylation of the cullin subunits of the SCF-type E3 ligase complexes, leading to COPS5 decrease the Ubl ligase activity of SCF-type complexes such as SCF, CSA or DDB2. The complex is also involved in phosphorylation of p53/TP53, c-jun/JUN, IkappaBalpha/NFKBIA, ITPK1 and ICSBP, possibly via [...] CS citrate synthase cold shock domain containing E 1, RNA-binding; RNA-binding protein. Required for internal initiation of translation of human rhinovirus RNA. May be involved in translationally coupled mRNA turnover. CSDE1 Implicated with other RNA- binding proteins in the cytoplasmic deadenylation/translational and decay interplay of the FOS mRNA mediated by the major coding- region determinant of instability (mCRD) domain DAP3 death associated protein 3; Involved in mediating interferon-gamma-induced cell death DCAF8 DDB1 and CUL4 associated factor 8 DEAD (Asp-Glu-Ala-Asp) box polypeptide 5; RNA-dependent ATPase activity. The rate of ATP DDX5 hydrolysis is highly stimulated by single-stranded RNA. May be involved in pre-mRNA splicing dihydrolipoamide S-succinyltransferase (E2 component of 2-oxo-glutarate complex); The 2- oxoglutarate dehydrogenase complex catalyzes the overall conversion of 2-oxoglutarate to succinyl- DLST CoA and CO(2). It contains multiple copies of 3 enzymatic components: 2-oxoglutarate dehydrogenase (E1 ), dihydrolipoamide succinyltransferase (E2) and lipoamide dehydrogenase (E3) DNAJC7 DnaJ (Hsp40) homolog, subfamily C, member 7 E2F transcription factor 4, p107/p1 30-binding; Transcription activator that binds DNA cooperatively with DP proteins through the E2 recognition site, 5'-TTTC[CG]CGC- 3' found in the promoter region of a E2F4 number of genes whose products are involved in cell cycle regulation or in DNA replication. The DRTF1/E2F complex functions in the control of cell-cycle progression from G 1 to S phase. E2F^ binds with high affinity to RBL1 and RBL2. In some instances, can also bind RB protein eukaryotic translation initiation factor 3, subunit I; Component of the eukaryotic translation initiation factor 3 (elF-3) complex, which is required for several steps in the initiation of protein synthesis. The elF-3 complex associates with the 40S ribosome and facilitates the recruitment of elF-1 , elF-1 A , elF- EIF3I 2:GTP:methionyl-tRNAi and elF-5 to form the 43S preinitiation complex (43S PIC). The elF-3 complex stimulates mRNA recruitment to the 43S PIC and scanning of the mRNA for AUG recognition. The elF- 3 complex is also required for disassembly and recycling of posttermination ribosom [...] eukaryotic translation initiation factor 4H; Stimulates the RNA helicase activity of EIF4A in the EIF4H translation initiation complex. Binds weakly mRNA Ewing sarcoma breakpoint region 1; Might normally function as a repressor. EWS-fusion-proteins (EFPS) may play a role in the tumorigenic process. They may disturb gene expression by mimicking, or EWSR1 interfering with the normal function of CTD-POLII within the transcription initiation complex. They may also contribute to an aberrant activation of the fusion protein target genes FAM32A family with sequence similarity 32, member A GABA(A) receptor-associated protein-like 2; Involved in intra-Golgi traffic. Modulates intra-Golgi GABARAPL2 transport through coupling between NSF activity and SNAREs activation. It first stimulates the ATPase activity of NSF which in turn stimulates the association with GOSR1 (By similarity) guanine nucleotide binding protein (G protein), beta polypeptide 1; Guanine nucleotide-binding proteins (G proteins) are involved as a modulator or transducer in various transmembrane signaling systems. GNB1 The beta and gamma chains are required for the GTPase activity, for replacement of GDP by GTP, and for G protein-effector interaction golgi reassembly stacking protein 2, 55kDa; May be involved in assembly and membrane stacking of the Golgi cisternae, and in the process by which Golgi stacks reform after mitotic breakdown. May GORASP2 regulate the intracellular transport and presentation of a defined set of transmembrane proteins, such as transmembrane TGFA general transcription factor IIF, polypeptide 1, 74kDa; TFIIF is a general transcription initiation factor GTF2F1 that binds to RNA polymerase II and helps to recruit it to the initiation complex in collaboration with TFIIB. It promotes transcription elongation histone deacetylase 3; Responsible for the deacetylation of lysine residues on the N-terminal part of the core histones (H2A, H2B, H3 and H4). Histone deacetylation gives a tag for epigenetic repression and plays an important role in transcriptional regulation, cell cycle progression and developmental events. HDAC3 Histone deacetylases act via the formation of large multiprotein complexes. Probably participates in the regulation of transcription through its binding to the zinc-finger transcription factor YY1 ; increases YY1 repression activity. Required to repress transcription of the POU1 F 1 [...] heterogeneous nuclear ribonucleoprotein A2/B1 ; Involved with pre-mRNA processing. Forms HNRNPA2B1 complexes (ribonucleosomes) with at least 20 other different hnRNP and heterogeneous nuclear RNA in the nucleous heterogeneous nuclear ribonucleoprotein C (C1/C2); Binds pre-mRNA and nucleates the assembly of 40S hnRNP particles. Single HNRNPC tetramers bind 230-240 nucleotides. Trimers of HNRNPC HNRNPC tetramers bind 700 nucleotides. May play a role in the early steps of spliceosome assembly and pre- mRNA splicing. Interacts with poly-U tracts in the 3'-UTR or 5'-UTR of mRNA and modulates the stability and the level of translation of bound mRNA molecules heterogeneous nuclear ribonucleoprotein D (AU-rich element RNA binding protein 1, 37kDa); Binds with high affinity to RNA molecules that contain AU-rich elements (AREs) found within the 3'-UTR of many proto- oncogenes and cytokine mRNAs. Also binds to double- and single- stranded DNA HNRNPD sequences in a specific manner and functions a transcription factor. Each of the RNA-binding domains specifically can bind solely to a single-stranded non-monotonous 5'-UUAG-3' sequence and also weaker to the single-stranded 5'-TTAGGG-3' telomeric DNA repeat. Binds RNA oligonucleotides with 5'-UUAGGG-3' re [...] heterogeneous nuclear ribonucleoprotein H3 (2H9); Involved in the splicing process and participates in HNRNPH3 early heat shock-induced splicing arrest. Due to their great structural variations the different isoforms may possess different functions in the splicing reaction microRNA 7-1 ; One of the major pre-mRNA-binding proteins. Binds tenaciously to poly(C) sequences. HNRNPK Likely to play a role in the nuclear metabolism of hnRNAs, particularly for pre-mRNAs that contain cytidine-rich sequences. Can also bind poly(C) single- stranded DNA heterogeneous nuclear ribonucleoprotein L; This protein is a component of the heterogeneous nuclear ribonucleoprotein (hnRNP) complexes which provide the substrate for the processing events that pre- HNRNPL mRNAs undergo before becoming functional, translatable mRNAs in the cytoplasm. L is associated with most nascent transcripts including those of the landmark giant loops of amphibian lampbrush heterogeneous nuclear ribonucleoprotein U (scaffold attachment factor A); Component of the CRD- HNRNPU mediated complex that promotes MYC mRNA stabilization. Binds to pre-mRNA. Has high affinity for scaffold-attached region (SAR) DNA. Bind to double- and single- stranded DNA and RNA heterogeneous nuclear ribonucleoprotein U-like 1; Acts as a basic transcriptional regulator. Represses basic transcription driven by several virus and cellular promoters. When associated with BRD7, HNRNPUL1 activates transcription of glucocorticoid-responsive promoter in the absence of ligand- stimulation. Plays also a role in mRNA processing and transport. Binds avidly to poly(G) and poly(C) RNA homopolymers in vitro IDH3B isocitrate dehydrogenase 3 (NAD+) beta IK Protein Red (Protein RER)(IK factor)(Cytokine IK); Not known. May bind to chromatin lysyl-tRNA synthetase; Catalyzes the specific attachment of an amino acid to its cognate tRNA in a 2 step reaction: the amino acid (AA) is first activated by ATP to form AA-AMP and then transferred to the acceptor end of the tRNA. When secreted, acts as a signaling molecule that induces immune response KARS through the activation of monocyte/macrophages. Catalyzes the synthesis of diadenosine oligophosphate (Ap4A), a signaling molecule involved in the activation of MITF transcriptional activity. Interacts with HIV-1 virus GAG protein, facilitating the selective packaging of tRNA(3)(Lys), th [...] KH domain containing, RNA binding, signal transduction associated 1; Recruited and tyrosine phosphorylated by several receptor systems, for example the T-cell, leptin and insulin receptors. Once phosphorylated, functions as an adapter protein in signal transduction cascades by binding to SH2 and KHDRBS1 SH3 domain- containing proteins. Role in G2-M progression in the cell cycle. Represses CBP- dependent transcriptional activation apparently by competing with other nuclear factors for binding to CBP. Also acts as a putative regulator of mRNA stability and/or translation rates and mediates mRNA n [...] dedicator of cytokinesis 2; Involved in cytoskeletal rearrangements required for lymphocyte KIAA0209 migration in response of chemokines. Activates RAC1 and RAC2 small GTPases, probably by functioning as a guanine nucleotide exchange factor (GEF), which exchanges bound GDP for free GTP. May also participate in IL2 transcriptional activation via the activation of RAC2

LSM14A LSM14A, SCD6 homolog A (S. cerevisiae) microtubule-associated protein, RP/EB family, member 1; May be involved in microtubule MAPRE1 polymerization, and spindle function by stabilizing microtubules and anchoring them at centrosomes. May play a role in cell migration MARS methionyl-tRNA synthetase MLF2 myeloid leukemia factor 2 methylmalonic aciduria (cobalamin deficiency) cbID type, with homocystinuria; Involved in cobalamin MMADHC metabolism mortality factor 4; Component of the NuA4 histone acetyltransferase (HAT) complex which is involved in transcriptional activation of select genes principally by acetylation of nucleosomal histones H4 and H2A. This modification may both alter nucleosome - DNA interactions and promote interaction of the MORF4 modified histones with other proteins which positively regulate transcription. This complex may be required for the activation of transcriptional programs associated with oncogene and proto-oncogene mediated growth induction, tumor suppressor mediated growth arrest and replicative senesce [...] MRFAP1 Mof4 family associated protein 1 MRPL9 mitochondrial ribosomal protein L9 metastasis associated 1 family, member 2; May be involved in the regulation of gene expression as MTA2 repressor and activator. The repression might be related to covalent modification of histone proteins myosin, light chain 12B, regulatory; Myosin regulatory subunit that plays an important role in regulation of both smooth muscle and nonmuscle cell contractile activity via its phosphorylation. Phosphorylation MYL1 2B triggers actin polymerization in vascular smooth muscle. Implicated in cytokinesis, receptor capping, and cell locomotion (By similarity) NOL7 nucleolar protein 7, 27kDa nardilysin (N-arginine dibasic convertase); Cleaves peptide substrates on the N-terminus of arginine NRD1 residues in dibasic pairs OCIAD1 OCIA domain containing 1 poly(A) polymerase alpha; Polymerase that creates the 3'-poly(A) tail of mRNA's. Also required for the PAPOLA endoribonucleolytic cleavage reaction at some polyadenylation sites. May acquire specificity through interaction with a cleavage and polyadenylation specificity factor (CPSF) at its C-terminus PCBP2 poly(rC) binding protein 2; Major cellular poly(rC)-binding protein. Binds also poly(rU) polymerase (RNA) II (DNA directed) polypeptide C, 33kDa; DNA-dependent RNA polymerase catalyzes the transcription of DNA into RNA using the four ribonucleoside triphosphates as substrates. Component of RNA polymerase II which synthesizes mRNA precursors and many functional non- POLR2C coding RNAs. Pol II is the central component of the basal RNA polymerase II transcription machinery. It is composed of mobile elements that move relative to each other. RPB3 is part of the core element with the central large cleft and the clamp element that moves to open and close the cleft (By similarity) proteasome (prosome, macropain) subunit, alpha type, 1; The proteasome is a multicatalytic proteinase complex which is characterized by its ability to cleave peptides with Arg, Phe, Tyr, Leu, and Glu PSMA1 adjacent to the leaving group at neutral or slightly basic pH. The proteasome has an ATP-dependent proteolytic activity (By similarity) proteasome (prosome, macropain) subunit, beta type, 1; The proteasome is a multicatalytic proteinase complex which is characterized by its ability to cleave peptides with Arg, Phe, Tyr, Leu, and Glu PSMB1 adjacent to the leaving group at neutral or slightly basic pH. The proteasome has an ATP-dependent proteolytic activity (By similarity) proteasome (prosome, macropain) 26S subunit, non-ATPase, 2; Acts as a regulatory subunit of the 26 PSMD2 proteasome which is involved in the ATP-dependent degradation of ubiquitinated proteins proteasome (prosome, macropain) 26S subunit, non-ATPase, 6; Acts as a regulatory subunit PSMD6 of the 26S proteasome which is involved in the ATP-dependent degradation of ubiquitinated proteins proteasome (prosome, macropain) 26S subunit, non-ATPase, 7; Acts as a regulatory subunit PSMD7 of the 26S proteasome which is involved in the ATP-dependent degradation of ubiquitinated proteins proteasome (prosome, macropain) activator subunit 1 (PA28 alpha); Implicated in immunoproteasome PSME1 assembly and required for efficient antigen processing. The PA28 activator complex enhances the generation of class I binding peptides by altering the cleavage pattern of the proteasome proteasome (prosome, macropain) activator subunit 3 (PA28 gamma; Ki); Subunit of the 11S REG- gamma (also called PA28-gamma) proteasome regulator, a donut-shaped homoheptamer which associates with the proteasome. 11S REG-gamma activates the trypsin-like catalytic subunit of the PSME3 proteasome but inhibits the chymotrypsin- like and postglutamyl-preferring (PGPH) subunits. Facilitates the MDM2-TP53/p53 interaction which promotes ubiquitination- and MDM2- dependent proteasomal degradation of TP53/p53, limiting its accumulation and resulting in inhibited apoptosis after DNA damage. May also be [...] proteasome (prosome, macropain) inhibitor subunit 1 (PI31 ); Plays an important role in control of PSMF1 proteasome function. Inhibits the hydrolysis of protein and peptide substrates by the 20S proteasome. Also inhibits the activation of the proteasome by the proteasome regulatory proteins PA700 and PA28 PTPRA protein tyrosine phosphatase, receptor type, A RAB7A, member RAS oncogene family; Involved in late endocytic transport. Contributes to the RAB7A maturation of phagosomes (acidification) RNA binding motif protein 22; Involved in pre-mRNA splicing (Probable). May translocate RBM22 the cytosolic calcium-binding protein PDCD6 in the nucleus RNA binding motif protein 8A; Component of a splicing-dependent multiprotein exon junction complex (EJC) deposited at splice junction on mRNAs. The EJC is a dynamic structure consisting of a few core proteins and several more peripheral nuclear and cytoplasmic associated factors that join the complex RBM8A only transiently either during EJC assembly or during subsequent mRNA metabolism. Core components of the EJC, that remains bound to spliced mRNAs throughout all stages of mRNA metabolism, functions to mark the position of the exon-exon junction in the mature mRNA and thereby influences dow [...] ras homolog gene family, member A ; Regulates a signal transduction pathway linking plasma membrane receptors to the assembly of focal adhesions and actin stress fibers. Serves as a target for RHOA the yopT cysteine peptidase from Yersinia pestis, vector of the plague, and Yersinia pseudotuberculosis, which causes gastrointestinal disorders. May be an activator of PLCE1 . Activated by ARHGEF2, which promotes the exchange of GDP for GTP RNF1 14 ring finger protein 114; May play a role in spermatogenesis ring finger protein 7; Probable component of the SCF (SKP1 -CUL1 -F-box protein) E3 ubiquitin ligase complex which mediates the ubiquitination and subsequent proteasomal degradation of target proteins involved in cell cycle progression, signal transduction and transcription. Through the RING-type zinc RNF7 finger, seems to recruit the E2 ubiquitination enzyme to the complex and brings it into close proximity to the substrate. Promotes the neddylation of CUL5 via its interaction with UBE2F. May play a role in protecting cells from apoptosis induced by redox agents SEC22 vesicle trafficking protein homolog B (S. cerevisiae); SNARE involved in targeting and fusion of SEC22B ER-derived transport vesicles with the Golgi complex as well as Golgi-derived retrograde transport vesicles with the ER SEC31 homolog A (S. cerevisiae); Component of the coat protein complex II (COPII) which promotes the formation of transport vesicles from the endoplasmic reticulum (ER). The coat has two main SEC31A functions, the physical deformation of the endoplasmic reticulum membrane into vesicles and the selection of cargo molecules (By similarity) stress-associated endoplasmic reticulum protein 1; Interacts with target proteins during their translocation into the lumen of the endoplasmic reticulum. Protects unfolded target proteins against SERP1 degradation during ER stress. May facilitate glycosylation of target proteins after termination of ER stress. May modulate the use of N-glycosylation sites on target proteins (By similarity) splicing factor 3a, subunit 1, 120kDa; Subunit of the splicing factor SF3A required for 'A' complex assembly formed by the stable binding of U2 snRNP to the branchpoint sequence (BPS) in pre-mRNA. SF3A1 Sequence independent binding of SF3A/SF3B complex upstream of the branch site is essential, it may anchor U2 snRNP to the pre-mRNA. May also be involved in the assembly of the 'E' complex splicing factor 3b, subunit 2, 145kDa; Subunit of the splicing factor SF3B required for 'A' complex assembly formed by the stable binding of U2 snRNP to the branchpoint sequence (BPS) in pre-mRNA. Sequence independent binding of SF3A/SF3B complex upstream of the branch site is essential, it may SF3B2 anchor U2 snRNP to the pre-mRNA. May also be involved in the assembly of the Έ ' complex. Belongs also to the minor U 12-dependent spliceosome, which is involved in the splicing of rare class of nuclear pre-mRNA intron solute carrier family 25 (mitochondrial carrier; phosphate carrier), member 3; Transport of phosphate SLC25A3 groups from the cytosol to the mitochondrial matrix. Phosphate is cotransported with H(+) SNW domain containing 1; Involved in vitamin D-mediated transcription. Can function as a splicing SNW1 factor in pre-mRNA splicing SON DNA binding protein; Represses hepatitis B virus (HBV) core promoter activity and transcription of HBV genes and production of HBV virions. Binds to the consensus DNA sequence: 5'- SON GA[GT]AN[CG][AG]CC-3'. Might protect cells from apoptosis. Might be involved in pre-mRNA splicing (By similarity) signal recognition particle 14kDa (homologous Alu RNA binding protein); Signal-recognition-particle assembly has a crucial role in targeting secretory proteins to the rough endoplasmic reticulum SRP14 membrane. SRP9 together with SRP14 and the Alu portion of the SRP RNA, constitutes the elongation arrest domain of SRP. The complex of SRP9 and SRP14 is required for SRP RNA binding signal recognition particle receptor (docking protein); Component of the SRP (signal recognition SRPR particle) receptor. Ensures, in conjunction with the signal recognition particle, the correct targeting of the nascent secretory proteins to the endoplasmic reticulum membrane system splicing factor, arginine/serine-rich 5; Plays a role in constitutive splicing and can modulate the SRSF5 selection of alternative splice sites splicing factor, arginine/serine-rich 9; Plays a role in constitutive splicing and can modulate the SRSF9 selection of alternative splice sites signal sequence receptor, beta (translocon-associated protein beta); TRAP proteins are part of a SSR2 complex whose function is to bind calcium to the ER membrane and thereby regulate the retention of ER resident proteins STX1 6 syntaxin 16; SNARE involved in a vesicular transport step within the Golgi stack SUM01 3; Ubiquitin-like protein which can be covalently attached to target lysines as a monomer. Does not seem to be involved in protein degradation and may function as an antagonist of ubiquitin in the degradation process. Plays a role in a number of cellular processes such as nuclear SUM01 transport, DNA replication and repair, mitosis and signal transduction. Involved in targeting RANGAP1 to the nuclear pore complex protein RANBP2. Covalent attachment to its substrates requires prior activation by the E 1 complex SAE1 - SAE2 and linkage to the E2 enzyme UBE2I, and can be promoted SMT3 suppressor of mif two 3 homolog 3 (S. cerevisiae); Ubiquitin-like protein which can be covalently attached to target lysines either as a monomer or as a lysine-linked polymer. Does not seem to be involved in protein degradation and may function as an antagonist of ubiquitin in the degradation SUM03 process. Plays a role in a number of cellular processes such as nuclear transport, DNA replication and repair, mitosis and signal transduction. Covalent attachment to its substrates requires prior activation by the E 1 complex SAE1 -SAE2 and linkage to the E2 enzyme UBE2I, and can be promoted b [...] suppressor of Ty 6 homolog (S. cerevisiae); Acts to stimulate transcriptional elongation by SUPT6H RNA polymerase II transcription elongation factor B (SIM), polypeptide 1 ( 15kDa, elongin C); SIM, also known as elongin, is a general transcription elongation factor that increases the RNA polymerase II transcription elongation TCEB1 past template-encoded arresting sites. Subunit A is transcriptionally active and its transcription activity is strongly enhanced by binding to the dimeric complex of the SIM regulatory subunits B and C (elongin BC complex) TH1 -like (Drosophila); Essential component of the NELF complex, a complex that negatively regulates the elongation of transcription by RNA polymerase II. The NELF complex, which acts via an association TH1 L with the DSIF complex and causes transcriptional pausing, is counteracted by the P-TEFb kinase complex transmembrane emp24 domain trafficking protein 2 ; Could have a role in the budding of coatomer- TMED2 coated and other species of coated vesicles. Could bind cargo molecules to collect them into budding vesicles TMEM50A transmembrane protein 50A thyroid hormone receptor interactor 12; Component of PA700, an ATP-dependent multisubunit protein that activates the proteolytic activities of the multifunctional proteinase (20S proteasome) of the 26S complex. Specifically interacts with the ligand binding domain of the thyroid hormone receptor (in a TRIP1 2 thyroid hormone T3-independent manner) and with retinoid X receptor (RXR). Could be E3 ubiquitin- protein ligase which accepts ubiquitin from an E2 ubiquitin- conjugating enzyme in the form of a thioester and then directly transfers the ubiquitin to targeted substrates U2 small nuclear RNA auxiliary factor 1; Plays a critical role in both constitutive and enhancer- dependent splicing by mediating protein-protein interactions and protein-RNA interactions required for U2AF1 accurate 3'-splice site selection. Recruits U2 snRNP to the branch point. Directly mediates interactions between U2AF2 and proteins bound to the enhancers and thus may function as a bridge between U2AF2 and the enhancer complex to recruit it to the adjacent intron ubiquitin-conjugating enzyme E2D 3 (UBC4/5 homolog, yeast); Catalyzes the covalent attachment of UBE2D3 ubiquitin to other proteins. Mediates the selective degradation of short-lived and abnormal proteins. Functions in the E6/E6-AP-induced ubiquitination of p53/TP53 ubiquitin-conjugating enzyme E2I (UBC9 homolog, yeast); Accepts the ubiquitin-like proteins SUM01 , SUM02, SUM03 and SUM04 from the UBLE1A-UBLE1 B E 1 complex and catalyzes their covalent UBE2I attachment to other proteins with the help of an E3 ligase such as RANBP2 or CBX4. Essential for nuclear architecture and segregation ubiquitin-conjugating enzyme E2Z; Catalyzes the covalent attachment of ubiquitin to other proteins (By UBE2Z similarity). Specific substrate for UBE1 L2, not charged with ubiquitin by UBE1 . May be involved in apoptosis regulation ubiquilin 1; Links CD47 to the cytoskeleton. Promotes the surface expression of GABA-A receptors (By UBQLN1 similarity). Promotes the accumulation of uncleaved PSEN1 and PSEN2 by stimulating their biosynthesis. Has no effect on PSEN1 and PSEN2 degradation ubiquitin specific peptidase 39; May play a role in mRNA splicing. It is unsure if the protein really USP39 exhibits hydrolase activity. Could be a competitor of ubiquitin C-terminal hydrolases (UCHs) USP4 ubiquitin specific peptidase 4 (proto-oncogene) valosin-containing protein; Necessary for the fragmentation of Golgi stacks during mitosis and for their reassembly after mitosis. Involved in the formation of the transitional endoplasmic reticulum (tER). The VCP transfer of membranes from the endoplasmic reticulum to the Golgi apparatus occurs via 50-70 nm transition vesicles which derive from part-rough, part-smooth transitional elements of the endoplasmic reticulum (tER). Vesicle budding from the tER is an ATP-dependent process. The ternary complex containing UFD1 L, VCP and NPLOC4 binds ubiquitinated proteins and is necessary for the e [...] vacuolar protein sorting 4 homolog A (S. cerevisiae); Involved in late steps of the endosomal multivesicular bodies (MVB) pathway. Recognizes membrane-associated ESCRT-III assemblies and catalyzes their disassembly, possibly in combination with membrane fission. Redistributes the ESCRT- VPS4A III components to the cytoplasm for further rounds of MVB sorting. MVBs contain intraluminal vesicles (ILVs) that are generated by invagination and scission from the limiting membrane of the endosome and mostly are delivered to lysosomes enabling degradation of membrane proteins, such as stimulated growt [...] 5'-3' exoribonuclease 2; Possesses 5'->3' exoribonuclease activity (By similarity). May promote the termination of transcription by RNA polymerase II. During transcription termination, cleavage at the XRN2 polyadenylation site liberates a 5' fragment which is subsequently processed to form the mature mRNA and a 3' fragment which remains attached to the elongating polymerase. The processive degradation of this 3' fragment by this protein may promote termination of transcription YME1-like 1 (S. cerevisiae); Putative ATP-dependent protease which plays a role in mitochondrial YME1 L 1 protein metabolism. Seems to act in the processing of OPA1 ZC3H1 1A zinc finger CCCH-type containing 11A ZNF207 zinc finger protein 207

Example 4 : Functional clustering of candidate control genes The functional classification of the entire list of 119 genes identified from the Leucegene data set was evaluated using the DAVID algorithm [14, 15] (Table 8). Interestingly, a significant proportion of these highly stable genes fell into two main functional categories: RNA splicing/processing, with an enrichment score of 5.92 (ex. SF3B2); and proteasome/ubiquitin ligase activity, with an enrichment score of 5.76 (ex. PSMA1). In addition to these functional clusters, 12 genes involved in transcription and 7 genes involved in translation (ex. EIF4H) were also found. A prominent group of genes identified (n=8) are the heterogeneous nuclear ribonucleoproteins (ex. HNRNPL, HNRNPK), some of which are also involved in the above cellular processes.

Table 8 : functional classification of the entire list of 119 genes identified from the Leucegene data set, as assessed using the DAVID algorithm

T/CA2014/050174 59

Example 5 : Validation of new control genes in other RNA-seq cancer data sets The expression stability of the 15 candidate control genes was further examined in 8 different data sets from TCGA, representing 6 different cancer types and normal tissue samples, as well as in normal cord blood data obtained by Leucegene (Table 3). The 15 candidate control genes proved to be very stable in all 4 data sets of normal tissues, each yielding a CV less than or equal to 25%, and a MFC less than or equal to 10 (Table 9).

Table 9 : Variability of select candidate endogenous control genes in normal hematopoietic cells and in TCGA data sets

Leucegene_CD34+ CB_normal LAML Tumor gene mean CV (%) MFC mean CV ( ) MFC HNRNPK 205.3 8.6 1.3 124.3 25.3 8.6 PCBP2 193.1 10.7 1.6 122.9 23.8 4.9 GNB1 114.1 8.7 1.3 144.9 19.0 3.0 EIF4H 111.2 9.8 1.4 57.1 23.2 3.7 SRP14 127.8 10.3 1.4 109.5 21.7 3.9 HNRNPL 144.6 10.8 1.7 52.0 21.5 3.5 SLC25A3 188.4 9.5 1.4 108.4 23.4 3.6 VPS4A 41.7 8.6 1.5 35.1 19.2 4.2 PSMF1 27.4 13.3 1.7 33.1 18.6 3.0 ZNF207 85.1 15.2 1.7 78.9 20.0 3.4 PSMD6 47.9 11.6 1.6 27.0 22.8 5.9 SRSF9 97.3 7.1 1.3 63.9 20.8 4.4 PSMA1 47.1 14.2 1.7 68.9 21.5 3.1 SF3B2 55.7 7.4 1.3 51.0 20.8 3.7 UBE2I 50.0 9.6 1.5 33.4 20.7 3.5

BRCA normal BRCA_tumor gene mean CV (%) MFC mean CV (%) MFC HNRNPK 171.5 16.5 2.9 217.5 18.3 3.5 PCBP2 185.6 16.8 3.4 179.0 27.1 8 5 GNB1 100.3 12.3 120.0 31.0 EIF4H 122.8 16.6 2.4 128.6 26.6 13. SRP14 145.0 18.8 2.5 164.6 30.6 1A HNRNPL 98.1 15.5 2.2 138.0 20.8 4.7 SLC25A3 164.4 16.3 2.8 149.9 31.0 82 VPS4A 31.1 19.7 3.4 28.0 39.1 11.6 PSMF1 32.7 11.6 2.4 40.9 34.3 13.4 ZNF207 44.0 20.7 3.2 53.7 23.6 5i> PSMD6 32.5 14.0 1.9 38.8 28.8 6 SRSF9 56.3 24.4 10.3 91.7 30.9 PSMA1 55.3 15.1 2.1 72.7 35.7 10.7 SF3B2 93.2 20.2 5.0 117.3 35.2 24.4 UBE2I 44.0 17.5 3.7 71.0 30.8 8.5

KIRC_Normal KIRC_Tumor gene mean CV (%) MFC mean CV (%) MFC HNRNPK 171.6 18.9 2.5 152.5 23.9 5,0 PCBP2 223.1 13.4 1.9 200.7 28.0 11 GNB1 100.1 15.0 2.0 103.8 28.3 6_-4 EIF4H 119.5 18.0 2.4 134.1 25.2 7 SRP14 178.2 25.1 2.6 142.1 23.3 L HNRNPL 99.6 16.3 2.1 96.4 16.4 3.7 SLC25A3 213.3 14.0 2.3 145.0 41.6 16.0 VPS4A 36.6 14.2 2.3 35.9 22.9 13.5 PSMF1 32.9 13.1 2.1 38.6 20.4 6 ZNF207 33.4 17.7 2.2 37.9 19.3 4.2 PSMD6 38.6 14.8 2.9 22.5 32.3 6,9 SRSF9 67.6 16.8 2.1 63.7 25.9 1A PSMA1 45.1 15.6 1.9 56.2 25.9 7 0 SF3B2 87.3 20.6 2.5 73.8 22.4 8 UBE21 28.2 21.1 2.4 41.8 22.0 6.5

LUAD_Normal LUAD_Tumor COAD_Tumor gene mean CV (%) MFC mean CV (%) MFC mean CV (%) MFC HNRNPK 131.3 9.1 1.6 154.5 22.7 3.5 155.1 27.0 PCBP2 145.6 9.3 1.5 173.6 37.0 LI 182.2 33.9 93 GNB1 104.5 12.0 1.6 110.9 33. 1 4.8 94.5 28.9 9 EIF4H 112.3 11.8 1.7 118.0 34.9 8 116.1 22.2 6 SRP14 137.6 21.8 2.2 142.3 33.8 6 105.7 29.1 HNRNPL 93.0 :7.3 ¾ 1.4 129.6 23.5 4.3 150.7 18.8 3.7 SLC25A3 133.9 11.9 1.7 146.1 29.2 6.0 238.2 29.4 Λ VPS4A 32.0 11.0 1.7 33.0 26.5 5.3 36.1 22.8 6.4 PSMF1 33.6 10.5 1.6 36.6 29.2 63 46.1 36.7 6 ZNF207 35.3 11.0 1.8 52.9 24.3 4.1 54.0 23.1 7 PSMD6 28.1 15.3 2.1 30.7 28.8 33.6 31.4 12.5 SRSF9 59.3 11.2 1.5 95.2 35.8 102.8 27.7 5.0 PSMA1 46.6 17.8 2.0 73.8 30.6 4.7 90.6 37.6 16.9

SF3B2 89.9 16. 1 2.5 106.5 29.9 53 83.7 26.0 4 UBE2I 37.2 9.1 1.4 60.2 26.9 3.8 79.9 27.6 4.8 Underline = %CV > 25, MFC > 5

Of note, the candidate genes showed highest stability in the 17 CD34+ cord blood samples (enriched normal stem and progenitor cells), which each yielded CVs less than or equal to 15%, and MFCs less than 2 . Within the tumor data sets, more variability was observed, with the highest CV being 42% for SLC25A3 in kidney cancer, and the highest MFC being 24 for SF3B2 in breast cancer. However, the majority of the candidate genes exhibited lower variability in all data sets as compared to the standard housekeeping genes. A score was determined for each candidate gene based on the number of data sets analyzed (10 total) in which the CV and MFC values complied with the initial selection criteria (CV<25%, MFC<5). The genes were then ranked according to this scoring system. The expression variability of the candidate control genes was also calculated using the combined TCGA data set (FIG. 1 and Table 10). As with the standard control genes, the variability was not higher compared to the individual data sets, reflecting the diversity of tissue types included. Nonetheless, all 15 of the candidate genes displayed stability that was greater than the majority of the commonly used control genes. The CV values were all lower than that of TBP, however, UBE2I and SF3B2 yielded CV values slightly higher than ABU. Only SF3B2 gave a MFC higher than that of ABU (Table 10). The majority of the candidate genes had CV values in the lowest 5th quantile and the remainder fell below the 25th quantile, in contrast to the standard control genes, of which HPRT1 and GAPDH were actually more variable than half the genes present at similar expression levels (FIG. 1). Overall, the 15 newly selected control genes display a greater degree of stability in gene expression compared to the commonly used control genes, as determined by RNA-seq. The highest ranking genes, as determined by having low coefficient of variation (CV) and maximum fold change (MFC) values in the most data sets analyzed are: HNRNPL and ZNF207, with high and medium expression ranges, respectively.

Table 10: Variability of select candidate endogenous control genes in combined TCGA data sets Expression > 100 RPKM rank gene mean CV (%) MFC score Chr Gene Description 1 HNRNPL 116,3 32,0 8,8 10,0 19 heterogeneous nuclear ribonucleoprotein L 2 HNRNPK 177,0 28,4 8,8 7,5 9 heterogeneous nuclear ribonucleoprotein K 3 EIF4H 120,4 3 1,3 17,1 6,5 7 eukaryotic translation initiation factor 4H 4 PCBP2 180,3 30,4 9,8 6,0 12 poly(rC) binding protein 2 5 GNB1 113,2 30,4 16,9 6,0 1 guanine nucleotide binding protein beta 6 SRP14 145,6 3 1,8 10,9 6,0 15 signal recognition particle 14kDa 7 SLC25A3 156,1 38,0 16,0 6,0 12 solute carrier family 25 member 3 Expression < 100 RPKM rank gene mean CV (%) MFC score Chr Gene Description 1 ZNF207 50,6 32,3 10,7 9,0 17 zinc finger protein 207 2 UBE2I 57,0 4 1,7 12,6 7,5 16 ubiquitin-conjugating enzyme E2L 3 VPS4A 32,3 30,3 17,4 7,0 16 vacuolar protein sorting 4 homolog A 4 PS F1 39,0 3 1 , 1 19,1 6,5 20 proteasome inhibitor subunit 1 (PI31) 5 PS A 1 67,8 36,8 18,0 6,5 11 proteasome subunit, alpha type,

I 6 SF3B2 93,7 39,9 34,3 6,5 11 splicing factor 3b, subunit 2 7 SRSF9 80,3 35,6 23,8 6,0 12 serine/arginine-rich splicing factor 9 8 PSMD6 32,1 35,2 16,6 5,5 3 proteasome 26S subunit, non- ATPase, 6 Mean relates to RPKM values within each data set. CV indicates the coefficient of variation and equals the standard deviation divided by the mean RPKM, expressed as a percentage. MFC, mean fold change, represents the maximum divided by minimum RPKM value of the data set. Score represents the sum of the number of datasets (out of 10 total) which have a CV < 25% and MFC < 5 for each gene, divided by 2. Chr. indicates the chromosome number. Rank is based on highest to lowest score.

Example 6: QPCR validation of new control genes In order to assess the effectiveness of the newly identified control genes for quantitative RT-PCR (qRT-PCR) analysis, assays were developed for the candidates using the Universal Probe Library (UPL, Roche) (Table 11). New assays were designed to span intron boundaries, and tested for optimal efficiency by standard curve analysis. SRP14 was excluded due to the inability to design an intron spanning assay. qRT-PCR was performed for each of the 14 new genes, as well as for 5 standard control genes (GAPDH, ACTB, TBP, HPRT1, ABL1), on cDNA from a panel of 14 leukemia samples (10 AML, 4 ALL) plus one CD34 + cord blood sample (using equal amounts of RNA). The average expression stability (M) of each gene was calculated using the GeNorm algorithm [16] (FIG. 2). By qRT-PCR, all 14 of the newly identified control genes had lower M values than the standard control genes, confirming that they were more stably expressed in the leukemia samples, in agreement with the RNA-seq data, with EIF4H and PSMA1 being the most stable in this experimental condition. Table 11: Primers used for QPCR assays

UPL Gene Left primer (SEQ ID NO): Right primer (SEQ ID NO): Assay Note Probe

HNRNPL tggaggtgaccgaggaga ( 1 ) cgctcacttttgcctgaga (2) 33 common to transcripts 1,2 HNRNPK acgcattctgcttcagagc (3) gggactgaaacactggcatt (4) 43 common to transcripts 1,2,3 EIF4H ctacgacgatcgggcctac (5) ttctggctacgggaaccat (6) 70 common to transcripts 1,2 PCBP2 ggctcaatatctaatcaatgtcagg (7) agcagaaagggattatggatga (8) 32 common to transcripts 1-7

GNB1 tctgggatgcactcaaagc (9) ccatgccatcgtcagtca ( 10) 27

SLC25A3 gccagagcagctggttgta ( 1 1) gggtgagaaacaattgcaca ( 12) 78 common to transcripts 1,2,3 common to transcripts 2,3, but ZNF207 ccattaatgccaggtgttcc ( 13) ccacccattggcatcatt ( 14) 72 not 1

UBE2I acctcatgaactgggagtgc ( 15) tcatctttgaaaagcatccgta ( 16) 12 common to transcripts 1,2,3,4

VPS4A gagcctgtggtttgcatgt ( 17) aggaggtcgtctgcattcac ( 18) 49 PS F1 caccacccacacaccagtc ( 19) aagtcttctcccccgacaac (20) 75 common to transcripts 1,2 PSMA1 tgacaatgatgtcactgtttgg (21 ) agaccaactgtggctgaacc (22) 37 common to transcripts 1,2,3 SF3B2 catccatggggacctgtact (23) ggcttcttctccttcagtcg (24) 12 SRSF9 aggaatgggcctcctacaag (25) cgcatgtgatccttcaggt (26) 27 PSMD6 catataggtcattaacccttggcta (27) ccggcagcaataaacctg (28) 29 GAPDH agccacatcgctcagacac (29) gcccaatacgaccaaatcc (30) 60 HPRT1 tgatagatccattcctatgactgtaga (31 ) caagacattctttccagttaaagttg(32) 22 ACTB attggcaatgagcggttc (33) tgaaggtagtttcgtggatgc (34) 11 TBP gaacatcatggatcagaacaaca (35) atagggattccgggagtcat (36) 87 ABL1 agaaggactaccgcatggag (37) gagggattccactgccaac (38) 44 common to transcripts 1,2 CD33 aatgacacccaccctaccac (39) tcagtggggccatgtaactt (40) 75 common to transcripts 1,2 FLT3 ctttaagcacagctccctgaa (41 ) tgaccatggaaacaactcctc (42) 5

Although it is widely presumed that RNA-seq data correlates well with qRT-PCR data, there is little evidence available to address this topic. The expression of CD33 and FLT3 in the same 15 leukemia and cord blood samples was therefore assessed in order to demonstrate correlation between the RPKM and delta Ct (dCt) values for these genes. These two genes were selected due to their known variability of expression in leukemia. The delta Ct values for each sample were calculated using either a standard control gene (GAPDH), or a newly identified control gene (HNRNPL, EIF4H, PSMA1, or SF3B2). Spearman correlation analysis of CD33 expression data demonstrated high correlation between RPKM and dCt (p = -0.9714 to - 0.9893 for EIF4H), except when GAPDH was used as the control gene (p = -0.775) (FIG. 3). Analysis with FLT3 showed similar correlation. The lower degree of correlation between RPKM and dCt when using GAPDH as a control gene demonstrates the importance of proper control gene selection in qRT-PCR experiments. To further address the importance of proper control gene selection in qRT-PCR analysis, the relative quantification (RQ) values was calculated for a stably expressed gene (EIF4H), using either GAPDH or HNRNPL for normalization (FIG. 4). The RQ of EIF4H was very stable between leukemia samples when HNRNPL was used as the control gene (CV=14%; MFC= .6). However, RQ values of the same samples calculated using GAPDH varied as much as 10.7-fold, with RQ values ranging from 0.22 to 2.29 (CV=88%). Normalization with GAPDH resulted in up to a 5.3-fold difference in EIF4H expression within individual samples, as compared to HNRNPL normalization. These findings highlight the importance of using more stable control genes as identified in the present study in qRT-PCR analysis, and further validate the newly identified control genes. Table 12 depicts the sequence identifiers (SEQ ID NOs) corresponding to the sequences of the 15 newly selected control genes display a greater degree of stability in gene expression compared to the commonly used control genes, as determined by RNA-seq. Table 12

Although the present invention has been described hereinabove by way of specific embodiments thereof, it can be modified, without departing from the spirit and nature of the subject invention as defined in the appended claims. The scope of the claims should not be limited by the preferred embodiments set forth in the examples, but should be given the broadest interpretation consistent with the description as a whole. In the claims, the word "comprising" is used as an open-ended term, substantially equivalent to the phrase "including, but not limited to". The singular forms "a", "an" and "the" include corresponding plural references unless the context clearly dictates otherwise.

REFERENCES

1. Bustin, S.A., Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays. J Mol Endocrinol, 2000. 25(2): p. 169-93.

2 . Lee, P.D., et al., Control genes and variability: absence of ubiquitous reference transcripts in diverse mammalian expression studies. Genome Res, 2002. 2(2): p. 292-7.

3. Suzuki, T., P.J. Higgins, and D.R. Crawford, Control selection for RNA quantitation. Biotechniques, 2000. 29(2): p. 332-7.

4. Thellin, O., et al., Housekeeping genes as internal standards: use and limits. J Biotechnol, 1999. 75(2-3): p. 291-5.

5. Warrington, J.A., et al., Comparison of human adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiol Genomics, 2000. 2(3): p. 143-7.

6. Huggett, J., et al., Real-time RT-PCR normalisation; strategies and considerations. Genes Immun, 2005. 6(4): p. 279-84.

7. de Jonge, H.J., et al., Evidence based selection of housekeeping genes. PLoS One, 2007. 2(9): p. e898.

8. Popovici, V., et al., Selecting control genes for RT-QPCR using public microarray data. BMC Bioinformatics, 2009. 10: p. 42.

9. Mortazavi, A., et al., Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods, 2008. 5(7): p. 621-8.

10. Oshlack, A., M.D. Robinson, and M.D. Young, From RNA-seq reads to differential expression results. Genome Biol, 2010. 11(12): p. 220.

11. Shendure, J., The beginning of the end for microarrays? Nat Methods, 2008. 5(7): p. 585-7.

12. Wang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 2009. 10(1): p. 57-63.

13. Wilhelm, B.T. and J.R. Landry, RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods, 2009. 48(3): p. 249-57.

14. Huang da, W., B.T. Sherman, and R.A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc, 2009. 4(1): p. 44-57. 15. Huang da, W., et al., DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res, 2007. 35(Web Server issue): p. W 169-75.

16. Vandesompele, J., et al., Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol, 2002. 3(7): p. RESEARCH0034.

17. Mestdagh, P., et al., A novel and universal method for microRNA RT-qPCR data normalization. Genome Biol, 2009. 10(6): p. R64.

18. Smith, R.D., et al., Exogenous reference RNA for normalization of real-time quantitative

PCR. Biotechniques, 2003. 34(1 ): p. 88-91 .

19. Mane, S.P., et al., Transcriptome sequencing of the Microarray Quality Control (MAQC) RNA reference samples using next generation sequencing. BMC Genomics, 2009. 10: p. 264.

20. Beillard, E., et al., Evaluation of candidate control genes for diagnosis and residual disease detection in leukemic patients using 'real-time' quantitative reverse-transcriptase polymerase chain reaction (RQ-PCR) - a Europe against cancer program. Leukemia, 2003. 17(12): p. 2474-86.

2 1. Consortium, E.P., A user's guide to the encyclopedia of DNA elements (ENCODE).

PLoS Biol, 201 1. 9(4): p. e 1001 046.

22. Shayegi et al., The level of residual disease based on mutant NPM1 is an independent prognostic factor for relapse and survival in AML. Blood 201 3, 122 (1):83-92.

23. Papadaki C et al., Monitoring minimal residual disease in acute myeloid leukaemia with NPM1 mutations by quantitative PCR: clonal evolution is a limiting factor. Br J Haematol. 2009, 144 (4):517-23. Epub 2008 Nov 26.

24. Beillard E et al., Evaluation of candidate control genes for diagnosis and residual disease detection in leukemic patients using 'real-time' quantitative reverse-transcriptase polymerase chain reaction (RQ-PCR) - a Europe against cancer program. Leukemia. 2003, 17(12):2474-86 WHAT IS CLAIMED IS:

1. A method for comparing expression levels of a test gene in a plurality of samples, comprising: a) measuring the expression of one or more of control genes set forth in Table 1 in said plurality of samples; Table 1

b) measuring the expression of the test gene in said plurality of samples; c) normalizing the expression of the test gene in each sample by

i. comparing expression of the one or more control genes across the samples, and ii. applying normalization to the test gene to obtain normalized expression levels of the test gene; and d) comparing the normalized expression levels of the test gene across said plurality of samples.

2. A method for normalizing the levels of a test gene present in a plurality of samples comprising a) measuring the expression of one or more of control genes defined in claim 1 across said plurality of samples; b) comparing the expression levels of the one or more control genes across said plurality of samples; c) deriving a value for normalizing expression of the one or more control genes across said plurality of samples; and d) normalizing the expression of the test gene in said plurality of samples based on the value obtained in step c).

3. The method of claim 1 or 2 , wherein said one or more control genes encode protein involved in RNA splicing/processing, and is/are KHDRBS1 , RBM22, SNW1 , CASC3, SF3A1 , POLR2C, PAPOLA, HNRNPH3, HNRNPUL1 , RBM8A, GTF2F1 , USP39, U2AF1 , XRN2 and/or ADAR.

4. The method of claim 1 or 2, wherein said one or more control genes encode protein involved in proteasome/ubiquitination and is/are USP4, UBE2I, PSMF1 , PSMA1 , VCP, PSMD6, PSMD7, KHDRBS1 , and/or VPS4A.

5. The method of claim 4, wherein said one or more control genes is/are UBE2I, PSMF1 , PSMA1 , PSMD6 and/or VPS4A.

6. The method of claim 1 or 2, wherein said one or more control genes is/are HNRNPL, PCBP2, GNB1 , SLC25A3, ZNF207, UBE2I, VPS4A, PSMF1 , PS A 1, SRSF9 and/or PSMD6.

7. The method of any one of claims 1 to 6, wherein said expression is measured at the mRNA level.

8. The method of claim 7, wherein said mRNA is reverse transcribed to cDNA prior to said measuring.

9. The method of claim 7 or 8, wherein said mRNA or cDNA is amplified prior to said measuring.

10. The method of claim 9, wherein the amplification is by PCR.

. The method of claim 0, wherein the PCR is real time PCR (RT-PCR).

12. The method of claim 11, wherein the RT-PCR is quantitative RT-PCR (qRT-PCR).

13. The method of any one of claims 1 to 12, wherein said plurality of samples comprises a normal cell sample.

14. The method of any one of claims 1 to 13, wherein said plurality of samples comprises a tumor cell sample.

15. The method of claim 14, wherein said tumor cell sample is a leukemia cell sample, a breast cancer cell sample, a colon cancer cell sample, a kidney cancer cell sample and/or a lung cancer cell sample. 16. The method of claim 15, wherein said tumor cell sample is a leukemia cell sample.

17. Use of one or more of the control genes defined in any one of claims 1 to 6 for normalizing the levels of one or more test genes across a plurality of samples.

18. A method for identifying a gene useful for normalizing the expression of a test gene across a plurality of samples, comprising: a) performing whole Transcriptome Shotgun Sequencing (RNA-seq) on said plurality of samples; b) comparing the level of expression of the genes of the transcriptome across the plurality of samples; and c) identifying the gene(s) exhibiting a coefficient of variation (CV) of about 25% or less and a maximum fold-change (MFC) of about 10 or less across the plurality of samples.

19. The method of claim 18, wherein the MFC is about 5 or less.

20. The method of claim 19, wherein the MFC is about 2 or less.

2 1. The method of any one of claims 18 to 20, wherein the CV is about 20% or less.

22. The method of claim 2 1, wherein the CV is about 15% or less.

International application No. INTERNATIONAL SEARCH REPORT PCT/CA2014/050174

A . CLASSIFICATION OF SUBJECT MATTER IPC: C12Q 1/68 (2006.01) , C40B 30/00 (2006.01) , G06F 19/20 (201 1.01)

According to International Patent Classification (IPC) or to both national classification and IPC B . FIELDS SEARCHED

Minimum documentation searched (classification system followed by classification symbols) C12Q 1/68 (2006.01) , C40B 30/00 (2006.01) , G06F 19/20 (201 1.01)

Documentation searched other than mimmum documentation to the extent that such documents are included in the fields searched

Electronic database(s) consulted during the international search (name of database(s) and, where practicable, search terms used) Databases: Canadian patent database, Intellect, Total patent, Pubmed, google patents. Key words: normalization, minimal variation, expression level, housekeeping genes, quantitative, transcriptome, Shotgun, Sequencing, coefficient of variation, and combinations thereof. C . DOCUMENTS CONSIDERED TO BE RELEVANT Category' Citation of document, with indication, where appropriate, of the relevant passages Relevant to claim No.

P,X MACRAE, T. et al. RNA-Seq reveals spliceosome and proteasome genes as most consistent 1-22 transcripts in human cancer cells. PLoS One. 17 September 201 3 (17-09-2013); Volume 8(9): e72884. ISSN: 1932-6203. Whole document.

X EP 2405022 A2 (DAVIS-BANKAITIS, D . et al.) 11 January 2012 ( 11-01-2012). Whole 1-22 document.

A WANG, Z . et al. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews, 1-22 Genetics. 0 1 January 2009 (01-01-2009); Volume 10(1): 57-63. ISSN: 1471-0056. Whole document.

A WO 2010/065940 Al (MCCLELLAND. M . et al). 10 June 2010 (10-06-2010). Whole 1-22 document.

Further documents are listed in the continuation of Box C . See patent family annex.

Special categories of cited documents: later document published after the international filing date or priority document defining the general state of the art which is not considered date and not in conflict with the application but cited t o understand to be of particular relevance |the principle or theory underlying the invention Έ ' earlier application or patent but published on or after the international 'X' document of particular relevance; the claimed invention cannot be filing date considered novel or cannot be considered t o involve an inventive 'L' document which may throw doubts on priority claim(s) or which is step when the document is taken alone cited t o establish the publication date of another citation or other document of particular relevance; the claimed invention cannot be special reason (as specified) considered t o involve an inventive step when the document is ' document referring t o an oral disclosure, use, exhibition or other means combined with one or more other such documents, such combination document published prior t o the international filing date but later than being obvious to a person skilled in the art 'Ρ' the priority date claimed document member of the same patent family

Date of the actual completion of the international search Date of mailing of the international search report 23 May 2014 (23-05-2014) 23 May 2014 (23-05-2014) Name and mailing address of the ISA/CA Authorized officer Canadian Intellectual Property Office Place du Portage I, CI 14 - 1st Floor, Box PCT Adrian Ali, Ph.D. (819) 934-7930 50 Victoria Street Gatineau, Quebec K1A 0C9 Facsimile No.: 001-819-953-2476

Form PCT/ISA/210 (second sheet ) (July 2009) Page 3 of 5 International application No. INTERNATIONAL SEARCH REPORT PCT/CA2014/050174 C (Continuation). DOCUMENTS CONSIDERED TO BE RELEVANT

Category' Citation of document, with indication, where appropriate, of the relevant passages Relevant to claim No.

A HELLEMANS, J. et al. qBase relative quantification framework and software for 1-22 management and automated analysis of real-time quantitative PCR data. Genome Biology. 9 February 2007 (09-02-2007); Volume 8(R19). Whole document.

ENQUOBAHRIE, D . et al. Early pregnancy peripheral blood gene expression and risk of A 1-22 preterm delivery: a nested case control study. BioMed Central: The Open Acces Publisher, BMC Pregnancy and Childbirth. 10 December 2009 (10-12-2009); Volume 9(56). Whole document.

Form PCT/ISA/210 (continuation of second sheet) (July 2009) Page 4 of 5 INTERNATIONAL SEARCH REPORT International application No. Information on patent family members PCT/CA2014/050174

Patent Document Publication Patent Family Publication Cited in Search Report Date Member(s) Date

EP2405022A2 11 January 201 2 ( 1 1-01 -201 2) EP2405022A2 11 January 201 2 ( 1 1-01 -201 2) EP2405022A3 02 May 201 2 (02-05-201 2) AU2009268659A1 14 January 201 0 ( 14-01 -201 0) CA2730277A1 1 January 201 0 ( 14-01 -201 0) EP231 5858A2 04 May 201 1 (04-05-201 1) WO201 0006048A2 14 January 201 0 ( 14-01 -201 0) WO201 0006048A3 29 April 201 0 (29-04-201 0) US201 2009581 A 1 12 January 201 2 ( 12-01 -201 2)

WO201 0065940A1 10 June 201 0 ( 1 0-06-201 0) WO201 0065940A1 10 June 201 0 ( 10-06-201 0) CA2745961 A 1 10 June 201 0 ( 1 0-06-201 0) CN1 0230821 2A 04 January 201 2 (04-01 -201 2) EP237081 3A1 05 October 201 1 (05-1 0-201 1) EP237081 3A4 23 May 201 2 (23-05-201 ) US201 1236903A1 29 September 201 1 (29-09-201 1) US201 401 1861 A 1 09 January 201 4 (09-01 -201 4)

Form PCT/ISA/210 (patent family annex ) (July 2009) Page 5 of S