Supplementary Text

S1. PCR Rescue In addition to the effect of target size on outcome (Supplementary Figure S1a), success also correlated with the level of gene expression, as measured by the number of reported ESTs corresponding to a given gene (Supplementary Figure S1b). A similar result was obtained by comparing PCR rescue success rate to SAGE tag number (data not shown). We compared the PCR rescue success rate to expression levels using both EST number and SAGE tag number, for each gene. Comparing these two expression measures to PCR rescue success for genes with both EST and SAGE data available gave a Pearson rank correlation coefficient of 0.752 (99% CI = 0.74-0.76).

S2. DNA Synthesis: Prior to attempting synthesis of several thousand transcripts, MGC conducted a pilot study to assess the feasibility of using DNA synthesis to prepare cDNA clones in the preferred Gateway Entry vector. MGC assigned to four companies 18 different transcripts, ranging in size from 0.2 kb - 4.0 kb, plus three identical transcripts of 6.7 kb, 8.5 kb, and 11 kb, to be delivered within 90 days. All four companies succeeded with synthesis and cloning of the 20 assigned targets of 0.2kb - 8.5 kb, and three of the four delivered the 11 kb CDS (dystrophin; NM_004006). Based on the outcome of this pilot, the MGC decided to proceed with the synthesis of cDNAs for the remaining outstanding genes.

For the large-scale synthesis of 3647 assigned NM-accession targets, the DNA synthesis companies used automated gene design and construction pipelines to design, synthesize, and verify the MGC constructs. All coding sequences were designed, processed, and analyzed using in-house software platforms. The design strategy determined the oligonucleotide composition and assembly strategy for each construct, taking into account the length and complexity of each sequence. Oligonucleotides were synthesized and subsequently assembled using a combination of ligation and PCR-based methods. Codon Devices included a proprietary mutS error filtration strategy to increase the percentage of correctly-assembled clones (Carr et al. 2004). Gene constructs were ligated into in-house vectors, cloned, and verified by standard capillary sequencing, prior to final subcloning into pENTR223.1, followed by full sequence confirmation of the inserted sequence and flanking sequences.

The sequence configuration of MGC synthetic CDS clones was tested for its effect on protein synthesis. Six mammalian positive control human CDS inserts flanked by the 5’ and 3’ sequences used in the MGC synthetic clones produced similar levels of protein in HEK293E cells compared to the same six inserts in Gateway expression vectors with previously reported flanking sequences (Waybright et al. 2008). Protein expression of selected synthetic clones as N-terminal fusion proteins was confirmed in E. coli, and also using in vitro reticulocyte and wheat germ T7- coupled-transcription/translation systems (Promega).

S3. Sequence Variation Due to RNA Editing: Published examples of mRNA editing in mammalian cells are limited almost exclusively to A-to-I editing, mediated by the Adenosine Deaminases Acting on RNA (ADAR) family of enzymes (Bass 2002; Gommans 2008). ADARs deaminate specific adenosines in extensively double-stranded regions of RNA. The resulting inosine in the edited RNA is read as guanosine by the in vivo cellular machinery, as well as by the enzymes used in cDNA cloning and sequencing. Editing in the CDS of mRNAs has the potential to expand the diversity of mRNAs and encoded proteins, as illustrated by the essential editing of mRNAs for human and mouse glutamate and serotonin receptor subunit proteins (Burns et al. 1997; Sommer et al. 1991). Editing in non-coding regions has been shown to influence alternative splicing (Rueter et al. 1999), nuclear retention in some instances (Kumar and Carmichael 1997), and has been hypothesized to play a role in mRNA stability and the regulation of translation (Liang and Landweber 2007).

Although some preferred sequence motifs have been noted, structural aspects of the RNA are the primary determinants of ADAR-mediated editing (Bass 2002). RNA duplex structures harboring mismatches, bulges, and loops are more selectively edited than completely double-stranded RNA. Editing of protein-coding sequences (CDS) occurs within regions generally forming only partially double-stranded structures (such as between partially complementary sequences of an exon and a nearby intron in a pre-mRNA) and therefore is usually more selective. In contrast, editing in non- coding regions tends to occur in clusters, due to extensively base-paired, duplexed RNA (such as fold-back hairpin structures formed by complementary Alu sequences) (Bass 2002; Kikuno et al. 2002; Morse et al. 2002).

To date only about 70 human mRNAs have been reported to contain A-to-I editing sites in the CDS, of which at least eleven are supported by two or more published studies (Suppl. Table S3). In contrast, several thousand examples of A-to-I editing in non-coding sequences of the 5' and 3' UTRs and within introns of pre-mRNA sequences have been reported (Athanasiadis et al. 2004; Kim et al. 2004; Levanon et al. 2004). Nearly 90% or more of the edits were found within repeated sequences, primarily of the Alu family. Because Alu sequences compose ~10% of the human genome (Lander et al. 2001) and are concentrated in gene-rich regions, including introns and UTR sequences (Chen et al. 2002), inverted pairs of Alu sequences provide potential substrates for ADAR-mediated editing in most human pre-mRNAs. Far fewer mouse transcripts display A-to-I edits (Kim et al. 2004), consistent with the lack of Alu repeat sequences in rodents (Neeman et al. 2006; Waterston et al. 2002).

We sought to identify candidate A-to-I editing sites in MGC clones. Because MGC has produced only a single full-CDS clone for most genes, we could not use the occurrence of coincident edits within multiple clones to identify loci of selective RNA editing. Therefore we used two different tests to focus on identifying clones statistically enriched for clusters of A-to-G changes compared to the genome sequence.

For the first test, we followed the methodology of Kim et al (Kim et al. 2004), and sought clusters of 5 or more A-to-G changes occurring within a window of 100 nt, where more than half of all differences noted are A-to-G changes. This search identified 113 human clones with such A-to-G clusters (Suppl. Table S4), but no such mouse clones. This discrepancy between human and mouse is consistent with the observations of Kim et al, and 48 of the 113 MGC clones overlap with the predictions made by these investigators (Kim et al. 2004). As a control, a similar search for G-to-A changes revealed no such clusters within the same population of human and mouse transcripts. However, analogous clusters of three C-to-T changes and two T-to-C changes were observed, suggesting a false-positive rate of about 2%.

Our second test identified clones with at least one 100 nt window of sequence with enough changes of a single type that the probability of observing this window by chance is 10-8. We defined the probability of observing a window with m changes as p = 0.25*N*r^(m), where r is the observed mismatch rate per clone and N is the number of genomic instances of the original nucleotide in the sequence window (number of As for A-to-I editing). (Since transitions are more common than transversions, 0.25 is a slight underestimate of the number of changes expected for any single type of transition.) We set p to 10-8, which means that for each 100 nt window we assessed whether there are m changes where m = log(10-8/0.25*N)/log(r). We identified 118 clones with at least one such window of m A-to-G changes and only two clones for G-to-A changes (Suppl. Table S5). These 118 clones include 87 clones identified by test 1 (detailed in Suppl. Table S6).

For each of the 87 clones detected by both tests, we identified the 100 nt window of sequence with the highest frequency of A-to-G changes. Analysis of these windows confirmed the previously reported result that most RNA edits are within Alu-containing UTRs. Specifically, 89% (77/87) of the windows partially or completely overlap Alu sequences and 88% of the 585 A-to-G changes are within UTRs. Details of the candidate edited sites detected are provided in Supplementary Table S6. Eleven clones within the set of 87 common to both tests show evidence of CDS editing, including seven genes that to our knowledge have not been reported previously to have edits in the CDS (Supplementary Table S7).

We also examined MGC clones for 69 genes previously reported as being subject to ADAR- mediated editing in the CDS, 59 of which are represented in MGC (Supplementary Table S3). Three MGC clones in this set of 59 genes showed A-to-G changes, two within the CDS, although none met the threshold of either of our tests. It is not surprising that we do not detect more of these previously reported CDS edits, because our largely “single-clone-per-gene” MGC clone set is not well-suited for detecting highly selectively edited loci (such as the CDS). For two of the 69 human genes (GABRA3 and C20orf30), our analysis provides the second publication supporting A-to-I editing of the CDS. The sites of putative editing are detailed in Supplementary Table S6.

GO-category enrichment, analyzed using DAVID (Dennis et al. 2003; Huang et al. 2008) (http://david.abcc.ncifcrf.gov/ ), revealed after Benjamini multiple testing correction that the list of 87 edited genes in common to both tests 1 and 2 is significantly enriched for genes encoding zinc finger and KRAB binding domains (p < 5x10-4). But because these GO terms are the most common among all human genes, the biological relevance of this result is uncertain.

REFERENCES

Athanasiadis, A., Rich, A., and Maas, S. 2004. Widespread A-to-I RNA editing of Alu- containing mRNAs in the human transcriptome. PLoS Biol 2: e391. Bass, B.L. 2002. RNA editing by adenosine deaminases that act on RNA. Annu Rev Biochem 71: 817-846. Burns, C.M., Chu, H., Rueter, S.M., Hutchinson, L.K., Canton, H., Sanders-Bush, E., and Emeson, R.B. 1997. Regulation of serotonin-2C receptor G-protein coupling by RNA editing. Nature 387: 303-308. Carr, P.A., Park, J.S., Lee, Y.J., Yu, T., Zhang, S., and Jacobson, J.M. 2004. Protein- mediated error correction for de novo DNA synthesis. Nucleic Acids Res 32: e162. Chen, C., Gentles, A.J., Jurka, J., and Karlin, S. 2002. Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22. Proc Natl Acad Sci U S A 99: 2930-2935. Dennis, G., Jr., Sherman, B.T., Hosack, D.A., Yang, J., Gao, W., Lane, H.C., and Lempicki, R.A. 2003. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 4: P3. Gommans, W.M.e.a. 2008. Diversifying Exon Code Through A-to-I RNA Editing. In RNA and DNA Editing: Molecular Mechanisms and Their Integration into Biological Systems, (ed. H.C. Smith), pp. 3-30. John Wiley & Sons, Inc. Huang, D.W., Sherman, B.T., and Lempicki, R.A. 2008. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protocols 4: 44-57. Kikuno, R., Nagase, T., Waki, M., and Ohara, O. 2002. HUGE: a database for human large proteins identified in the Kazusa cDNA sequencing project. Nucleic Acids Res 30: 166-168. Kim, D.D., Kim, T.T., Walsh, T., Kobayashi, Y., Matise, T.C., Buyske, S., and Gabriel, A. 2004. Widespread RNA editing of embedded alu elements in the human transcriptome. Genome Res 14: 1719-1725. Kumar, M. and Carmichael, G.G. 1997. Nuclear antisense RNA induces extensive adenosine modifications and nuclear retention of target transcripts. Proc Natl Acad Sci U S A 94: 3542-3547. Lander, E.S. Linton, L.M. Birren, B. Nusbaum, C. Zody, M.C. Baldwin, J. Devon, K. Dewar, K. Doyle, M. FitzHugh, W. et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921. Levanon, E.Y., Eisenberg, E., Yelin, R., Nemzer, S., Hallegger, M., Shemesh, R., Fligelman, Z.Y., Shoshan, A., Pollock, S.R., Sztybel, D. et al. 2004. Systematic identification of abundant A-to-I editing sites in the human transcriptome. Nat Biotechnol 22: 1001-1005. Li, J.B., Levanon, E.Y., Yoon, J.K., Aach, J., Xie, B., Leproust, E., Zhang, K., Gao, Y., and Church, G.M. 2009. Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing. Science 324: 1210-1213. Liang, H. and Landweber, L.F. 2007. Hypothesis: RNA editing of microRNA target sites in humans? RNA 13: 463-467. Morse, D.P., Aruscavage, P.J., and Bass, B.L. 2002. RNA hairpins in noncoding regions of human brain and Caenorhabditis elegans mRNA are edited by adenosine deaminases that act on RNA. Proc Natl Acad Sci U S A 99: 7906-7911. Neeman, Y., Levanon, E.Y., Jantsch, M.F., and Eisenberg, E. 2006. RNA editing level in the mouse is determined by the genomic repeat repertoire. RNA 12: 1802-1809. Rueter, S.M., Dawson, T.R., and Emeson, R.B. 1999. Regulation of alternative splicing by RNA editing. Nature 399: 75-80. Sommer, B., Kohler, M., Sprengel, R., and Seeburg, P.H. 1991. RNA editing in brain controls a determinant of ion flow in glutamate-gated channels. Cell 67: 11-19. Waterston, R.H. Lindblad-Toh, K. Birney, E. Rogers, J. Abril, J.F. Agarwal, P. Agarwala, R. Ainscough, R. Alexandersson, M. An, P. et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562. Waybright, T., Gillette, W., Esposito, D., Stephens, R., Lucas, D., Hartley, J., and Veenstra, T. 2008. Identification of highly expressed, soluble proteins using an improved, high-throughput pooled ORF expression technology. Biotechniques 45: 307-315.