Genomic Sequencing and Analysis of a Chinese Hamster Ovary Cell Line
Total Page:16
File Type:pdf, Size:1020Kb
Hammond et al. BMC Genomics 2011, 12:67 http://www.biomedcentral.com/1471-2164/12/67 RESEARCHARTICLE Open Access Genomic sequencing and analysis of a Chinese hamster ovary cell line using Illumina sequencing technology Stephanie Hammond1,2, Jeffrey C Swanberg1,2, Mihailo Kaplarevic2, Kelvin H Lee1,2* Abstract Background: Chinese hamster ovary (CHO) cells are among the most widely used hosts for therapeutic protein production. Yet few genomic resources are available to aid in engineering high-producing cell lines. Results: High-throughput Illumina sequencing was used to generate a 1x genomic coverage of an engineered CHO cell line expressing secreted alkaline phosphatase (SEAP). Reference-guided alignment and assembly produced 3.57 million contigs and CHO-specific sequence information for ~ 18,000 mouse and ~ 19,000 rat orthologous genes. The majority of these genes are involved in metabolic processes, cellular signaling, and transport and represent attractive targets for cell line engineering. Conclusions: This demonstrates the applicability of next-generation sequencing technology and comparative genomic analysis in the development of CHO genomic resources. Background currently used to examine differential expression of With over half of all recombinant therapeutic proteins high-producing cell lines and to identify gene candidates produced in mammalian cell lines, Chinese hamster for host cell engineering [5-7]. Transcriptomic studies ovary (CHO) cells remain the predominant production which rely on cross-hybridization to mouse DNA micro- system for glycosylated biopharmaceuticals [1]. Although arrays showed some success [8,9], but also demonstrated improvements in cell engineering, cell line selection, and the need for CHO-specific sequence information. Con- culture conditions have increased productivity levels [2], tinued EST sequencing of CHO cells lines has generated the genetic basis underlying hyperproductivity remains databases containing more than 60,000 sequences and poorly defined. The further development of genomic allowed for the development of CHO-specific DNA resources will facilitate detailed studies of genome struc- microarrays [10,11]. Furthermore, mapping of CHO EST ture, gene regulation, and gene expression in high-pro- sequences to a mouse genomic scaffold can potentially ducing cell lines and aid in the use of sequence-specific reveal structural and regulatory features of the CHO molecular tools in cell line development. genome [12]. Such studies are limited in that only a A number of resources are required to support the subset of genes expressed at sufficiently high levels are assembly and annotation of the CHO genome including captured for sequence analysis, providing little informa- physical maps, genomic sequences, expressed sequence tion regarding genome structure or non-transcribed por- tag (EST) sequences, and proteomic data. Recent efforts tions of the genome. to sequence and characterize bacterial artificial chromo- At present, there is little genomic sequence data avail- some (BAC) libraries derived from CHO cells provide able for CHO cells. This limits the application of high- information for physical mapping of the CHO genome throughput molecular tools in gene discovery and cell [3,4]. Transcriptomic and proteomic studies are line engineering. CHO cell lines also undergo multiple genomic rearrangements during the generation of high- * Correspondence: [email protected] producing cell lines, necessitating the sequencing of 1Department of Chemical Engineering, University of Delaware, Newark, DE individual cell lines rather than the Chinese hamster 19711, USA [13,14]. Until recently, EST sequences were obtained by Full list of author information is available at the end of the article © 2011 Hammond et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Hammond et al. BMC Genomics 2011, 12:67 Page 2 of 8 http://www.biomedcentral.com/1471-2164/12/67 traditional Sanger technology [15], but current efforts Table 1 Alignment of CHO short reads to mouse and rat are employing next-generation sequencing technologies reference genomes including 454 and Illumina [11,16,17]. 454 pyrosequen- Sequence Type Mouse Rat cing can generate up to 1 Gb of data in a single run, Total aligned 6,582,209 (8.71%) 6,281,589 (8.31%) producing average read lengths of 330 bp with an aver- Aligned to unique regions 3,202,228 (4.47%) 3,046,308 (4.28%) age error rate of 4%, although a major limitation of this Aligned to repeat regions 3,379,981 (4.24%) 3,235,281 (4.03%) technology is the resolution of homopolymer regions Aligned to protein-coding 2,678,662 (3.52%) 2,359,457 (3.12%) [18,19]. Illumina sequencing can produce up to 90 Gb genes* of data in a single run, generating reads up to 100 bp in * Pseudogenes, RNA genes, and genes on Y chromosome were excluded from length with an average error rate of 1-1.5% [19,20]. this analysis. Genes include exons, introns, and 5’- and 3’-untranslated regions These technologies have significantly improved sequen- cing throughput and decreased cost, making mammalian mapping qualities, 50% of the short reads from the raw genome sequencing feasible [20]. data set were aligned to repetitive regions of the refer- In this work, Illumina sequencing technology was used ence genomes (Table 1). In general, results from align- to generate an initial genomic sequence library of a Chi- ment to both mouse and rat reference genomes suggest nese hamster cell line with the goal of making these that CHO genomic sequences are generally more similar data available to the community. Comparative genomic to mouse genomic sequences, as previously demon- analysis of this library was used to identify and function- strated [15]. ally classify assembled sequences that were aligned to The distribution of CHO-SEAP reads aligned to the mouse and rat genes. An initial ~ 1x coverage of the mouse and rat reference genomes was examined. The CHO cell genome provided CHO-specific sequence raw number of reads aligned was summed over 5 Mb information for a large number of protein coding genes, bins along reference chromosomes (Figure 1). Reads including those from functional classes typically under- were mapped along all reference chromosomes, although represented in EST libraries. This demonstrates that several regions of higher coverage are present in both even low coverage genomic sequencing studies of CHO reference genomes. Many chromosomes in the current cell lines can increase the amount of sequence informa- build of the mouse genome contain gaps in the 5’ end of tion available for this cell line. the reference sequence, which may account for the low number of reads mapping to the chromosome ends. Results Simulation studies based on Sanger technology suggest Illumina sequencing and reference-guided alignment that most of the genomic coding sequence can be sur- Gene-amplified CHO-SEAP cells expressing human veyed with less than 2-fold coverage [23] and low-cover- secreted alkaline phosphatase (SEAP) were previously age sequencing studies using Sanger [24] and next- generated from CHO-DUK cells as a model for hetero- generation [25,26] sequencing technologies are success- logous protein production [21]. Initial sequencing of the ful in producing partial sequences of orthologous genes. CHO-SEAP genome using Illumina technology yielded Nearly 40% of the aligned CHO reads are mapped to 2.72 Gb of genomic sequence, which represents a ~ 1x protein-coding mouse and rat genes (Table 1). Sequence coverage of the CHO genome, estimated to be 2.8 Gb in information was collected for 97% of known protein- size [4]. Reference-guided alignment was utilized in this coding genes in the mouse genome and 93% of known study because a 1x genomic coverage is insufficient for protein-coding genes in the rat genome. To further de novo genome assembly. Previous work suggests that examine gene coverage by this data set, the number of mouse and rat show a high degree of DNA sequence reads aligned to each gene in the mouse and rat pro- homology with the Chinese hamster [15] and several tein-coding gene sets was examined (Figure 2). The raw transcriptomic studies have employed a comparative read count was normalized by gene size to generate a approach to examine and annotate CHO sequence data normalized read count for each gene. This allows for a [12,16]. Therefore, these species were chosen for com- better comparison between the number of reads aligned parative analysis. Reads were mapped to the reference to both very small, from thousands of bases, and very genomes using MAQ software, which stands for Map- large, to millions of bases, genes. Only 3% of mouse ping and Assembling with Qualities [22]. Nearly 9% of genes and 7% of rat genes showed no coverage. Most the total reads were aligned to both reference genomes, genes showed low to medium coverage by the CHO although a slightly higher number of reads were aligned short reads, with 33% of mouse and 48% of rat genes to the mouse genome (Table 1). Each aligned read was showing low coverage and 60% of mouse and 40% of rat assigned a mapping quality that indicates whether the genes showing medium coverage. A small proportion of read has a unique alignment or can be aligned to multi- both mouse and rat genes, less than 5%, showed a high ple positions in the genome [22]. Based on MAQ level of coverage. Because CHO-SEAP cells are Hammond et al. BMC Genomics 2011, 12:67 Page 3 of 8 http://www.biomedcentral.com/1471-2164/12/67 Mouse Rat 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 Reads Aligned 14 14 0-5,000 15 15 5,000-7,500 16 16 7,500-10,000 17 17 10,000-12,500 Reference Chromosome 18 18 12,500-15,000 19 19 20 15,000-20,000 X X 20,000 + 8040 120 160 200 8040 120 160 200 240 280 Chromosome Coordinate (Mb) Figure 1 Distribution of CHO short reads along the mouse and rat reference genomes.