Identification of Transcribed Sequences in Arabidopsis Thaliana by Using High-Resolution Genome Tiling Arrays

Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays Viktor Stolc*†‡§, Manoj Pratim Samanta‡§¶, Waraporn Tongprasitʈ, Himanshu Sethiʈ, Shoudan Liang*, David C. Nelson**, Adrian Hegeman**, Clark Nelson**, David Rancour**, Sebastian Bednarek**, Eldon L. Ulrich**, Qin Zhao**, Russell L. Wrobel**, Craig S. Newman**, Brian G. Fox**, George N. Phillips, Jr.**, John L. Markley**, and Michael R. Sussman**†† *Genome Research Facility, National Aeronautics and Space Administration Ames Research Center, Moffett Field, CA 94035; †Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06520; ¶Systemix Institute, Cupertino, CA 94035; ʈEloret Corporation at National Aeronautics and Space Administration Ames Research Center, Moffett Field, CA 94035; and **Center for Eukaryotic Structural Genomics, University of Wisconsin, Madison, WI 53706 Edited by Sidney Altman, Yale University, New Haven, CT, and approved January 28, 2005 (received for review November 4, 2004) Using a maskless photolithography method, we produced DNA Genome-wide tiling arrays can overcome many of the shortcom- oligonucleotide microarrays with probe sequences tiled through- ings of the previous approaches by comprehensively probing out the genome of the plant Arabidopsis thaliana. RNA expression transcription in all regions of the genome. This technology has was determined for the complete nuclear, mitochondrial, and been used successfully on different organisms (5–12). A recent chloroplast genomes by tiling 5 million 36-mer probes. These study on A. thaliana reported measuring transcriptional activities probes were hybridized to labeled mRNA isolated from liquid of four different cell lines by using 25-mer-based tiling arrays that grown T87 cells, an undifferentiated Arabidopsis cell culture line. were developed with a chromium masking-based photolithog- Transcripts were detected from at least 60% of the nearly 26,330 raphy technique (11). annotated genes, which included 151 predicted genes that were In this study, we used a modified optical synthesis technique not identiﬁed previously by a similar genome-wide hybridization to develop a genome-wide tiling array for the complete genome study on four different cell lines. In comparison with previously of A. thaliana. This procedure used a maskless array synthesizer published results with 25-mer tiling arrays produced by chromium (5–7) that created high-density oligonucleotide arrays rapidly masking-based photolithography technique, 36-mer oligonucleo- without the need to invest resources in creating expensive and tide probes were found to be more useful in identifying intron– unchangeable chromium masks. For this experiment, the mask- exon boundaries. Using two-dimensional HPLC tandem mass spec- less array synthesizer was used to create a whole-genome tiling trometry, a small-scale proteomic analysis was performed with the array containing 5 million different 36-mer oligonucleotide same cells. A large amount of strongly hybridizing RNA was found probes that covered the Arabidopsis genome from end to end, in regions ‘‘antisense’’ to known genes. Similarity of antisense with 10-bp gaps on both strands, for all nuclear, mitochondrial, activities between the 25-mer and 36-mer data sets suggests that and chloroplast chromosomes. The arrays were hybridized with it is a reproducible and inherent property of the experiments. mRNA isolated from an undifferentiated, rapidly dividing, A. Transcription activities were also detected for many of the inter- thaliana cell line (T87; ref. 13) for which preliminary evidence genic regions and the small RNAs, including tRNA, small nuclear had suggested that a large number of genes were being ex- RNA, small nucleolar RNA, and microRNA. Expression of tRNAs pressed. The results of this study are reported here. correlates with genome-wide amino acid usage. Materials and Methods higher plant ͉ transcriptome ͉ maskless array synthesizer Design of the Arrays. The version of the sequence released in January 2002 at GenBank (www.ncbi.nlm.nih.gov) with acces- sion entries NC࿝003076, NC࿝003075, NC࿝003074, NC࿝003071, he flowering plant Arabidopsis thaliana is an important NC࿝003070, NC࿝000932, and NC࿝001284 was used to design the Tgenetic model organism for investigating the basic mecha- arrays. Based on the chromosomal sequences, 13 high-density nisms involved in the growth and development of multicellular arrays were constructed, each with Ϸ400,000 different 36-mer eukaryotes. In addition, it was the first plant for which a oligonucleotide probes. Probes were selected uniformly from complete genome sequence was determined (1). The availability both strands of the chromosomal sequences with gaps of 10 base of a complete genome sequence opens many avenues for new pairs between two consecutive probes covering the entire ge- experiments probing different cellular functions on a genome- nome. Additional details about the probe selection algorithm are GENETICS wide scale (2). However, designing such large-scale experiments provided in Supporting Materials and Methods, which is published is only possible when a complete map of all transcribed regions as supporting information on the PNAS web site. on the genome is available. Typical computationally derived preliminary annotations of the new sequences are error-prone, Hybridization Experiment. The selected oligonucleotides were syn- especially for shorter genes and for genes with multiple introns. thesized on 13 glass-based arrays by using an in situ maskless Therefore, experimentally determined annotation is considered photolithographic synthesis device and hybridized with mRNA to be the most definitive proof for transcription of predicted extracted from the established T87 A. thaliana cultured cell line. genes and for the correct identification of coding regions. Several experimental methods have been developed in recent years to carry out this difficult task. They include cloning and This paper was submitted directly (Track II) to the PNAS ofﬁce. cataloging of expressed sequence tags (ESTs) and full-length Abbreviation: miRNA, microRNA. cDNAs, serial analysis of gene expression (3) and ‘‘sequence ‡V.S. and M.P.S. contributed equally to the work. tagged site’’-based genetic mapping (4). However, none of the §To whom correspondence may be addressed. E-mail: [email protected] or above-mentioned methods comprehensively probe the complete [email protected]. genome sequences for proof of RNA expression. Moreover, ††M.R.S. is a cofounder of NimbleGen Systems, Inc., the company that is commercializing these methods are not sufficiently scalable to test multiple tissue the maskless array synthesizer technology described in this article. types or experimental conditions in a cost-effective manner. © 2005 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0408203102 PNAS ͉ March 22, 2005 ͉ vol. 102 ͉ no. 12 ͉ 4453–4458 Downloaded by guest on September 26, 2021 Sample labeling. Total RNA was extracted from the T87 line (13) Signals on the Genes. Probes matching unique locations on the of liquid grown tissue culture cells of A. thaliana, converted to genome were considered for all subsequent analysis. For each double-stranded cDNA by using a SuperScript Choice System protein-coding gene annotated in the V5 GenBank release of A. (GIBCO͞BRL), and mRNA was labeled by using an oligo(dT) thaliana, the average expression level was computed by taking primer containing the T7 RNA polymerase promoter the arithmetic mean of log-normalized signals on all probes that (5Ј-GGCCAGTGAATTGTAATACGACTCACTATAGG- were at least half within the exon regions of the gene. Of 26,330 GAGGCGG-T24-3Ј). Briefly, 10 ␮g of total RNA was incubated annotated protein-coding genes in the V5 release, 245 (0.9%) with 1ϫ first strand buffer͞10 mM DTT͞500 ␮M dNTPs͞5pM genes were excluded from this study because they did not have primer for 60 min at room temperature. Second-strand synthesis any representative probes for their exons. This result happened was accomplished by incubation with 200 ␮M dNTPs͞0.07 because either those genes were located on repeat regions of the units/␮l DNA ligase͞0.27 units/␮l DNA polymerase I͞0.013 genome or the segments containing them were not present in the units/␮l RNase, 1ϫ second-strand buffer͞10 units T4 DNA earlier release of the chromosome, based on which the arrays polymerase for 2 h. Double-stranded cDNA was purified by were designed. using phenol-chloroform extraction and Eppendorf Phase-Lock To decide whether a gene was expressed, a threshold level Gel tubes, and ethanol was precipitated, washed with 80% representing the ‘‘background signal’’ was calculated for the ethanol, and resuspended in 3 ␮l of water. In vitro transcription whole genome based on the average expression levels in the was used to produce biotin-labeled cRNA from the cDNA by promoter regions of genes (11). Because promoters are known using the Ambion (Austin, TX) MEGAscript T7 kit. Briefly, 1 to be transcriptionally inactive, it is reasoned that the majority of ␮g of double-stranded cDNA was incubated with 7.5 mM ATP signal on them would be hybridization noise and not real signal. and GTP͞5.6 mM UTP and CTP͞1.9 mM bio-11-CTP and Only genes verified by full-length cDNAs, for which promoter bio-16-UTP (Sigma-Aldrich) in 1ϫ transcription buffer and 1ϫ regions were known for certain, were considered in this calcu- T7 enzyme mix for 5 h at 37°C. Before hybridization,

Identification of Transcribed Sequences in Arabidopsis Thaliana by Using High-Resolution Genome Tiling Arrays

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

Efficient Storage and Analysis of Genome Data in Relational Database Systems

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

CRAM Format Specification (Version 3.0)

Htslib - C Library for Reading/Writing High-Throughput Sequencing Data

Which Genome Browser to Use for My Data ? July 2019

Training Modules Documentation Release 1

Three Data Delivery Cases for EMBL- EBI's Embassy

An Integrated Pipeline for Metagenomic Taxonomy Identification and Compression Minji Kim1* , Xiejia Zhang1,Jonathang.Ligo1, Farzad Farnoud2, Venugopal V

Variant Calling Objective: from Reads to Variants

Future Development of the Samtools Software Package John Marshall, Petr Daněček, James K

Investigation of Chloride Transport Mechanisms in Arabidopsis Thaliana Root