Allele-Specific Expression Detection in Cancer Tissues and Cell Lines
Total Page:16
File Type:pdf, Size:1020Kb
Mayba et al. Genome Biology 2014, 15:405 http://genomebiology.com/2014/15/8/405 METHOD Open Access MBASED: allele-specific expression detection in cancer tissues and cell lines Oleg Mayba1*, Houston N Gilbert2, Jinfeng Liu1, Peter M Haverty1, Suchit Jhunjhunwala1, Zhaoshi Jiang1, Colin Watanabe1 and Zemin Zhang1* Abstract Allele-specific gene expression, ASE, is an important aspect of gene regulation. We developed a novel method MBASED, meta-analysis based allele-specific expression detection for ASE detection using RNA-seq data that aggregates information across multiple single nucleotide variation loci to obtain a gene-level measure of ASE, even when prior phasing information is unavailable. MBASED is capable of one-sample and two-sample analyses and performs well in simulations. We applied MBASED to a panel of cancer cell lines and paired tumor-normal tissue samples, and observed extensive ASE in cancer, but not normal, samples, mainly driven by genomic copy number alterations. Background whichmustbeovercomeforeffectiveASEdetection[3-5]. Transcriptional activity at the different alleles of a gene in In addition, data from multiple heterozygous single nucleo- a non-haploid genome can differ considerably. Both gen- tide variants (SNVs) in the same gene must be integrated, etic and epigenetic determinants govern this allele-specific and the large number of tested genes requires appropriate expression (ASE) [1] and impairment of this highly regu- statistical treatment of the multiplicity of tested hypotheses. lated process can lead to disease [2]. To understand the Despite these obstacles, next-generation sequencing tech- biological role of ASE and its underlying mechanisms, a nology has been recently used to identify putative sites of comprehensive identification of ASE events is required. ASE within and between samples [4,6-14]. Previous work Recent advances in sequencing technology enable investi- using short reads to detect ASE focused either on model gation of entire genomes at increasingly fine resolution. organisms [11,13] or on normal human tissues or cell lines Whole exome DNA sequencing (WES) or whole genome [4,10,12], although limited studies have explored the ASE DNA sequencing (WGS) allows identification of single landscape in cancer [15,16]. Further, there is currently no nucleotide mutations or polymorphisms in all exonic standard and robust way to aggregate information across regions or the entire human genome, respectively, while SNVs into a single measure of ASE for an entire transcript messenger RNA sequencing (RNA-Seq) enables quantita- isoform or gene. Most published studies either tested ASE tive analysis of gene expression. The expression state of at the SNV-level, sometimes requiring agreement across the heterozygous loci detected in WES or WGS assays can SNVs within a gene [3,6,7,10,12,17,18], or used available be investigated in a matched RNA-Seq sample from the phasing information to sum reads across SNVs [4]. A re- same individual, leading to a detailed map of the ASE cent study [13] incorporated phased SNV-level information activity. This approach allows the investigator to uncover into a gene-level statistical model, allowing for extra vari- the instances of complete or near allele silencing, which ability due to alternative splicing effects on allelic ratios at would be impossible using only RNA-Seq data. individual SNVs. However, with the exception of limited Next-generation sequencing of short reads is prone to samples such as those from the HapMap Project [19], most technical biases, for example, over- or under-representation specimens do not have SNV phasing information. In some of certain sequence motifs or inhomogeneous mapping, cases, population genetics-based approaches and existing databases can be used to phase common single nucleotide * Correspondence: [email protected]; [email protected] polymorphisms (SNPs) [20]. However, the ability to phase 1Department of Bioinformatics and Computational Biology, Genentech Inc, South San Francisco, CA 94080, USA common SNPs into individual haplotypes, whether based Full list of author information is available at the end of the article © 2014 Mayba et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Mayba et al. Genome Biology 2014, 15:405 Page 2 of 21 http://genomebiology.com/2014/15/8/405 on previous knowledge or a statistical method, does not cancer phenotypes. Joint analysis of tumors and matched apply to somatic mutations in cancer. This makes it chal- normal samples did not reveal any instances of loss of lenging to assign the ASE status to the mutant allele and imprinting, although several instances of loss of ASE in reduces the ability to study the ASE of mutation-carrying tumor were observed, including a switch of the overex- genes. pressed allele in the mono-allelically expressed pro- To overcome these difficulties, we developed a novel apoptotic factor BCL2L10. Our comprehensive analysis ASE detection method, called MBASED. MBASED as- revealed a rich landscape of ASE in cancer and highlighted sesses ASE by combining information across individual the flexibility and usefulness of our proposed method heterozygous SNVs within a gene without requiring a priori MBASED for ASE detection. knowledge of haplotype phasing; therefore, it can be ap- plied to a wide array of existing RNA-Seq data sets, most Results and discussion of which do not have phasing information available. When MBASED: meta-analysis based allele-specific expression phasing information is present, MBASED takes advantage detection of it to increase the power of ASE detection. In practice, First, we give an overview of our method, MBASED, with even with modest sequencing depths, a large number of detailed descriptions provided in Materials and methods genes show more than one detectable heterozygous exonic and in Supplementary methods in Additional file 1. Given SNV in RNA-Seq data, highlighting the importance of RNA read counts supporting reference and alternative having a framework for aggregating expression infor- alleles at individual SNVs within a unit of expression, mation across individual loci. MBASED provides an estimate of ASE and a correspond- To robustly estimate gene-level ASE from SNV-level ing P-value. A unit of expression can be a gene, a tran- RNA-Seq read counts, MBASED employs a meta-analytic script isoform, an exon, or an individual SNV: MBASED approach [21], used originally to combine information is agnostic with respect to the nature of the unit provided from several studies into a global effect estimate. Our ap- by the user. In this work, we choose the gene as a unit of proach can be used in both one-sample and two-sample ASE, which we define as the union of all exons forming analyses, making MBASED a versatile tool for investigat- individual transcript isoforms. ing allele-specific expression, both within an individual For a given gene, MBASED provides a framework for sample and in the context of differential ASE. aggregation of SNV-level information into a single meas- We applied MBASED to a panel of human lung cancer ure of ASE. The meta-analytic approach adopted by cell lines and paired tumor-normal lung and liver tissue MBASED relies on specification of gene haplotypes, which samples. None of our samples had haplotype phasing may be unknown for many data sets. In one-sample ASE information available, exemplifying a typical situation in analysis, when true haplotypes are unknown, MBASED gene expression studies. Our goal was to investigate the uses RNA read counts at individual SNVs within a gene to landscape of ASE in cancer and to identify potential in- phase SNVs into two haplotypes. We adopt a pseudo- stances of ASE contributing to cancer phenotypes. Previ- phasing approach that assigns an allele with a larger read ous studies of ASE in cancer were limited by sample size count at each SNV to the ‘major’ haplotype, with the im- [15] (three paired tumor-normal samples) or concen- plicit assumption that ASE is consistent in one direction trated on detecting monoallelic expression in the context along the length of the gene. This procedure is not of loss of heterozygosity events [16]. In this study we intended to faithfully reconstruct the true underlying present a general view of ASE, monoallelic or otherwise, haplotypes in all cases, but we expect it to do so for in a panel of 25 cancer samples across 2 tissue types, in- genes showing sufficiently strong ASE. We quantify the cluding direct tumor/normal comparisons. We observed allelic imbalance within a sample as the major allele high rates of ASE (9 to 26%) in tumor tissue samples (haplotype) frequency (MAF) of the gene. The ASE detec- relative to normal tissue samples (0.5 to 2%), as well as tion then becomes a problem of identifying genes with variable ASE rates in cancer cell lines (1 to 31%). We MAF >0.5. Phased counts from the ‘major’ haplotype are found the observed elevated ASE rates in cancer samples transformed into normally distributed scores, and scores to be mainly driven by underlying changes in genomic from individual SNVs are combined into a single gene- copy number and allelic composition. Numerous instances level score using a meta-analytic approach. This score is of genes with recurrent ASE in cancer were attributed to then used to obtain an estimate of underlying allelic recurrent genomic alterations involving known cancer imbalance. The meta-analytic statistical inference requires genes, for example, TP53 and KRAS. We found a number the correct specification of gene haplotypes in order to of mutations with known or suspected roles in cancer, assign proper statistical significance to the observed ASE.