Chemical synthesis rewriting of a bacterial to achieve design flexibility and biological functionality

Jonathan E. Venetza, Luca Del Medicoa, Alexander Wolfle¨ a, Philipp Schachle¨ a, Yves Buchera, Donat Apperta, Flavia Tschana, Carlos E. Flores-Tinocoa, Marielle¨ van Kootena, Rym Guennouna, Samuel Deutschb, Matthias Christena,1, and Beat Christena,1

aInstitute of Molecular Systems Biology, Eidgenossische¨ Technische Hochschule Zurich,¨ CH-8093 Zurich,¨ Switzerland; and bDepartment of Energy Joint Genome Institute, Walnut Creek, CA 94598

Edited by David Baker, University of Washington, Seattle, WA, and approved March 6, 2019 (received for review October 29, 2018)

Understanding how to program biological functions into artificial been covered. The redesigned chromosomes removed repetitive DNA sequences remains a key challenge in . sequences (tRNA genes, , and transposons) to increase Here, we report the chemical synthesis and testing of Caulobacter targeting fidelity during stepwise homologous replacement as ethensis-2.0 (C. eth-2.0), a rewritten bacterial genome composed well as included the seeding of loxP sites to permit iterative of the most fundamental functions of a bacterial cell. We rebuilt genome reduction on completion of yeast chromosomes. In the the essential genome of Caulobacter crescentus through the pro- beginning of the yeast 2.0 synthesis project, CRISPR had not yet cess of chemical synthesis rewriting and studied the genetic entered the stage, but today, it offers an alternative approach for information content at the level of its essential genes. Within progressive genome reduction. the 785,701-bp genome, we used sequence rewriting to reduce The redundancy of the defining the same amino the number of encoded genetic features from 6,290 to 799. acid by multiple synonymous codons offers the possibility to Overall, we introduced 133,313 base substitutions, resulting in erase and reassign codons throughout an entire genome. Such the rewriting of 123,562 codons. We tested the biological func- rewriting efforts are used to engineer with altered tionality of the genome design in C. crescentus by transposon genetic codes and free up codons for incorporation of artifi- mutagenesis. Our analysis revealed that 432 essential genes of cial amino acids, which do not occur within natural organisms. C. eth-2.0, corresponding to 81.5% of the design, are equal in To date, genome-wide rewriting efforts have been primarily functionality to natural genes. These findings suggest that nei- reported for viral (12–14), and a few are focused on the ther changing mRNA structure nor changing the codon context rewriting of microbial genomes of Escherichia coli, Salmonella, have significant influence on biological functionality of synthetic and S. cerevisiae. Using oligo-mediated recombineering (15), all genomes. Discovery of 98 genes that lost their function identi- 321 instances of the TAG stop codon in E. coli were altered to fied essential genes with incorrect annotation, including a limited TAA, demonstrating the dispensability of a stop codon within set of 27 genes where we uncovered noncoding control fea- the genetic code (16). In an extension of this approach, rewrit- tures embedded within protein-coding sequences. In sum, our ing of 13 sense codons across a set of ribosomal genes (17) results highlight the promise of chemical synthesis rewriting to and genome-wide rewriting of 123 instances of the arginine rare decode fundamental genome functions and its utility toward the codons AGA and AGG (18) were accomplished in E. coli. These design of improved organisms for industrial purposes and health studies unearthed unexpected recalcitrant synonymous rewriting benefits.

Caulobacter crescentus | chemical genome synthesis | genome rewriting | Significance synonymous recoding | de novo DNA synthesis The fundamental biological functions of a living cell are stored within the DNA sequence of its genome. Classical genetic n the early 2000s, the template-independent chemical synthesis approaches dissect the functioning of biological systems by Iof the 7.4-kb polio virus (1) and 5.4-kb phiX174 analyzing individual genes, yet uncovering the essential gene genomes (2) using oligonucleotides has ushered in the field of set of an has remained very challenging. It is argued synthetic genomics. The initial progress on moderately sized viral that the rewriting of entire genomes through the process of genomes has spurred whole-genome synthesis of more complex chemical synthesis provides a powerful and complementary organisms. In 2008 and 2010, the Craig Venter Institute reported research concept to understand how essential functions are the chemical synthesis of genome replicas from Mycoplasma programed into genomes. genitalium (583 kb) and Mycoplasma mycoides (1.1 Mb) (3, 4), respectively. These efforts expanded the chemical synthesis scale Author contributions: M.C. and B.C. designed research; J.E.V., L.D.M., A.W., P.S., Y.B., to megabases and improved in vitro DNA assembly strategies D.A., F.T., C.E.F.-T., M.v.K., R.G., and S.D. performed research; J.E.V., L.D.M., M.C., and and genome transplantation methods. However, the work also B.C. analyzed data; and J.E.V., M.C., and B.C. wrote the paper.y highlighted the challenges of whole-genome synthesis, as a sin- Conflict of interest statement: Eidgenossische¨ Technische Hochschule holds a patent gle missense within the dnaA gene initially prevented application (WO2017085249A1) with M.C. and B.C. as inventors that covers functional boot up. To gain insights into a minimal gene set for cellular life, testing of synthetic genomes. M.C. and B.C. hold shares from Gigabases Switzerland AG.y the teams of Craig Venter built a 473-gene reduced version of This article is a PNAS Direct Submission.y the M. mycoides genome (5). This open access article is distributed under Creative Commons Attribution-NonCommercial- Along with these accomplishments, the concept of whole- NoDerivatives License 4.0 (CC BY-NC-ND).y genome synthesis and genome minimization has been expanded Data deposition: The sequence of the C. eth-2.0 genome reported in this paper has been deposited in the National Center for Biotechnology Information database (GenBank toward the rebuilding of all 16 chromosomes of Saccharomyces accession no. CP035535).y cerevisiae driven by an international consortium composed of 1 To whom correspondence may be addressed. Email: [email protected]. 21 institutions. In 2014, the consortium reported synthesis of ch or [email protected] the artificial yeast chromosome synIII (273 kb) (6). Subse- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. quently, five additional chromosomes (7–11) were generated, 1073/pnas.1818259116/-/DCSupplemental.y and as of 2018, roughly 40% of the entire yeast genome has Published online April 1, 2019.

8070–8079 | PNAS | April 16, 2019 | vol. 116 | no. 16 www.pnas.org/cgi/doi/10.1073/pnas.1818259116 Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 h mato oecmlxrwiigshms env DNA al. et novo Venetz de schemes, investigate rewriting to complex Recently, more of 19). impact (18, the 5 sequences of vicinity protein-coding the of in primarily occurred that events and defined fully function. are desired specific their parts a for DNA to encode rewritten assigned only to resulting been The not used function. have was that biologic Methods) features sequence and putative itself. (Materials erase part sites DNA rewriting binding essential the sequence ribosome of functionality or Computational for TSSs, required not ORFs, DNA are that alternative nondisruptable (RBSs) putative the for pinpoints encode lines] may native gray [transpo- the insertions as within transposon plotted of of regions are Absence parts hits pairs. DNA (Tn) base essential son few of a set of entire resolution the identify was to sequencing Transposon used process. parts rewriting DNA synthesis of chemical (B and positions genomes. cation connect oper- rewritten (blue) (blue) basic and Lines the native design cell. run between bacterial to genome a required of genes rewritten system essential a ating of list into entire the reorganized native comprising and the from (gray) extracted were genome parts DNA 1,745 cess; eth-1.0 C. 1. Fig. atdsg,cmiain n hmclsnhssrwiigo the of rewriting synthesis chemical and compilation, design, Part B A ceai ersnaino h iia einpro- design digital the of representation Schematic (A) genome. hits Tn ORFs TSS RBS 0.8 0.4 1.2 n rewritten essential a

Essential t

iv C e 1.6

0

essential genes

. C ei end begin

Caulobacter

sequence rewriting

0

676 rewritten e .

t c

0.2 h

r

C. eth-1.0

- e 2

s 2.0

DNA

. c

0.4

0 e

eoe uhesnilDAparts DNA essential Such genome. g n

C. eth-2.0

e t 0.6 u

oko ftepr identifi- part the of Workflow )

DNA part n

s

o 2.4

m g

e

e n

gene o

m 500 bp 4.0

Caulobacter 2.8 e

0

Caulobacter n 3 and 3.6 3.2 0 termini NA1000 ta at oe eune.DAprswr xrce rmtenative NC no. the accession (NCBI) from Information extracted Biotechnology were pre- pro- parts the endogenous Caulobacter DNA including resolution sequences. genes, pair essential moter base of with from coordinates identified sequencing genome cise that bacterial transposon (30) a high-resolution list dataset building genome published entire for the previously well-annotated parts generated a a DNA computationally essential into We of 29). integrated omics (28, measure- been model multidimensional ribosome-profiling have and which (27) ments essential transcriptome- for and of (22–25) (26) set organism bacterium entire freshwater model the the 1A). from (Fig. encoding sequences Build design DNA to genome List terial Part Essential Results genes. essential its content of information level genetic the the at studied we of writing, synthesis genome cal essential genome the customized crescentus a rebuilding into a By cell the present a sequence. program of We functions to cell. fundamental approach most bacterial design–build–test a applicable of broadly functions fundamental most 2.0 hogpterrdanssapoc opoeterwiigof rewriting the probe high- to genomes. applicable entire approach broadly diagnosis a error lacks currently throughput gene field and the genes been (21), individual has using clusters schemes progress recoding some study pro- to while are made However, functions genomes. biological general into the how gramed onto of light principles shed design will sequence causes system- error imprecise a of with investigation as conjunction atic in well hypothesized rewriting We synonymous as cause. massive (CDSs) underlying that the of at sequences are presence annotations signals coding genome that of control speculated termini translational been the and remained has has transcriptional It debugging embedded 19). and defined, (17, ill challenging remained have ciples synthesis genome 57-codon complete the a with gene underway. toward (21), strategies synthesis of reported novo was replacement rewriting genome de genomic the Ongoing with 20). for (15, conjunction used in been cassettes have methods synthesis omnpolmo aua eoesqecs hc have facilitate than which rather sequences, information are genome biological maintain constraints natural to of Synthesis evolved problem constraints. common synthesis a of multitude a of blocks thesis. of Rewriting Sequence the of location native replication. the at inserted in was stringent replication permitting sequence, copy vector and 10 low shuttle 1 of (31) pMR10Y (Table set across the per- design a seeded and genome were and (ARSs) ADE2) the sequences assembly LEU2, replicating MET14, faithful autonomous HIS3, in for (TRP1, inter- maintenance genes select 1,015 stable bacterial To and mit noncoding, a sequences. 54 of genic protein-coding, functions 676 fundamental including most Cumulatively, the cell. for termed encodes design genome orien- a bp and 1A into organization (Fig. gene concatenated tation preserving and design rules genome design digital predefined to according ee erpr h hmclsnhssof synthesis chemical the report we Here, prin- design rewriting underlying the progress, this Despite ,abceilmnmzdgnm opsdo the of composed genome minimized bacterial a eth-2.0), (C. ewr nbet ban3 o4k N building DNA 4-kb to 3- obtain to unable were We (Caulobacter .eth-1.0 C. Caulobacter A00gnm eune[ainlCne for Center [National sequence genome NA1000 and PNAS .eth-1.0 C. 3) h eutn 785,701- resulting The (31). Appendix) SI rmcmeca N upir u to due suppliers DNA commercial from .eth-1.0 C. eefe)truhtepoeso chemi- of process the through hereafter) srcgie sa xust elcycle cell exquisite an as recognized is .cerevisiae, S. | pi 6 2019 16, April albce ethensis-1.0 Caulobacter uorpi marker auxotrophic cerevisiae , S. .eth-1.0. C. ossso ,6 N parts, DNA 1,761 of consists oEal eNv eoeSyn- Genome Novo de Enable to .Furthermore, Appendix). SI and coli, E. | ecnevdabac- a conceived We o.116 vol. Caulobacter albce ethensis- Caulobacter | Caulobacter, o 16 no. Caulobacter Caulobacter .eth-1.0) (C. 011916.1] rgnof origin .coli E. | 8071

BIOPHYSICS AND SYSTEMS BIOLOGY COMPUTATIONAL BIOLOGY Table 1. Part list used to build the C. eth-1.0 genome design additional information beyond the amino acid code is necessary DNA part category Quantity Size (bp) Fraction (%) for proper functioning. Achieving functional C. eth-2.0 genes, however, will provide fully defined artificial genes composed of Protein-coding sequences 676 660,789 83.9 a minimized number of genetic elements. The precise knowl- Essential 462 471,072 59.8 edge of which genes remain functional and the subsequent repair Semiessential 113 114,270 14.5 of nonfunctional genes will ultimately lead to a fully defined Redundant 15 14,970 1.9 artificial cell. Nonessential 86 60,477 7.7 Noncoding sequences 54 9,726 1.2 Chemical Synthesis of C. eth-2.0. We computationally devised a tRNA 44 3,455 0.4 four-tier DNA assembly strategy starting from 3- to 4-kb assem- rRNA 3 4,387 0.6 bly blocks to build the complete C. eth-2.0 chromosome in yeast ncRNA 7 1,884 0.2 (32) (Fig. 2A). Demonstrating the ease of genome-scale synthesis Intergenic sequences 1,015 96,043 12.2 on sequence rewriting, 235 of 236 blocks were successfully man- Genome replication and assembly 16 19,143 2.5 ufactured (SI Appendix), and only a single DNA block required Click markers* 5 6,352 0.8 custom synthesis. We progressively assembled these initial 236 ARS 10 2,121 0.3 DNA blocks into 37 chromosome segments (19–22 kb in size) pMR10Y† 1 10,670 1.4 and further into 16 megasegments (38–65 kb in size) (Fig. 2A Total no. of DNA parts 1,761 785,701 and SI Appendix) using yeast transformation. To select for the complete chromosome assembly, we applied a click marker strat- *Auxotrophic selection markers that were used to direct the assembly and maintenance of the genome design in yeast. ncRNA, noncoding RNA. egy by introducing five auxotrophic yeast genes (TRP1, HIS3, †The pMR10Y shuttle vector contains a broad host-range RK2-based low MET14, LEU2, and ADE2) split between adjacent megaseg- copy replicon (GenBank accession no. AJ606312.1), a kanamycin selec- ments. On correct chromosome assembly in an engineered yeast tion marker, and oriT function for conjugational transfer from E. coli to strain lacking all auxotrophic marker genes (YJV04), click mark- Caulobacter as well as URA3 marker, ARS, and centromere (CEN) elements ers will form functional genes and reconstitute prototrophy for selection and replication in yeast. (Fig. 2B). Initial attempts to assemble the C. eth-2.0 chromosome from 16 megasegments were not successful. Sequencing of yeast clones with partial C. eth-2.0 assemblies identified two defec- chemical synthesis. Recent bioinformatics work (31) showed tive ARS elements (ARS416 and ARS1213), which prevented that more than three-quarters of all deposited bacterial genome replication of the full-length chromosome. We corrected these sequences are not amenable for low-cost synthesis. We hypothe- sized that computational synonymous rewriting into an easy to synthesize sequence would facilitate chemical synthesis of the Table 2. Sequence rewriting of C. eth-1.0 into C. eth-2.0 leads to 785-kb genome while maintaining the encoded biological func- massive reduction of genetic features tions (SI Appendix, Fig. S1A). We used our previously reported Type C. eth-1.0 C. eth-2.0 Fraction (%) computational DNA design algorithms (31, 32) and generated a synthesis-optimized genome design termed C. eth-2.0. Cumu- Sequence rewriting latively, we introduced 10,172 base substitutions and removed Base substitutions None 133,313 17.0 5,668 synthesis constraints (Table 2). These are composed of Rewritten codons* None 123,562 56.1 1,233 repeats, 93 homopolymeric stretches, and 4,342 regions of Codons high guanine-cytosine (GC) content (Table 2), known to hinder TTG 1,154 0 100 chemical DNA synthesis. Moreover, we erased additional 1,045 TTA 46 0 100 endonuclease restriction sites to facilitate standardized assem- TAG 173 10 94.2 bly of the DNA building blocks into the 785-kb chromosome Alternative genetic features of C. eth-2.0. ORFs† 3,229 407 87.4 TSS‡ 1,730 82 95.3 Sequence Rewriting to Minimize the Number of Genetic Features. RBS§ 1,331 310 76.7 We reasoned that chemical synthesis rewriting offers a powerful Remaining genetic features¶ 6,290 799 experimental approach to probe the accuracy of existing genome DNA synthesis constraints annotations and study where additional layers of information High GC regions# 4,342 0 100 exist beyond the primary amino acid code. Furthermore, funda- Direct repeats ≥8 bp 880 113 87.2 mental functions encoded within the essential genomes can be Hairpins ≥8 bp 606 140 76.9 identified. In addition to the base substitutions introduced for Homopolymers 139 46 66.9 synthesis streamlining, we used computational sequence design Restriction sitesk 1,047 2 99.8 algorithms (31, 32) to deliberately add 123,141 base substitutions Synthesis constraints** 7,014 301 within protein-coding sequences to yield the rewritten C. eth-2.0 *Number of synonymous codon substitutions introduced on sequence design (Table 2) (33). In C. eth-2.0, we replaced 56.1% of all rewriting. codons by synonymous versions. While the amino acid sequence †Number of alternative ORFs residing within the 676 CDSs of C. eth-1.0 and of the 676 annotated genes was maintained, rewriting enabled C. eth-2.0. us to minimize the number of hypothetical genetic elements ‡Number of TSSs internal to CDSs. present within protein-coding sequences of C. eth-2.0. These §Number of ribosome binding sites (RBSs) internal to CDSs. elements include alternative ORFs, predicted gene internal tran- ¶Number of remaining genetic features within CDS of C. eth-1.0 and C. eth-2.0. scriptional start sites (TSSs), and sequence motifs (predicted or # cryptic) that may fine tune rates (Fig. 1B and Mate- Regions of high GC content > 0.8 within a 100-bp window. kTotal number of type IIS restriction sites that were removed (AarI, BsaI, rials and Methods). Overall, we removed 87.4% of all putative BspQI, PacI, PmeI, I-CeuI, I-SceI). Note that the two unique PmeI and PacI ORFs (2,822 of 3,229) (SI Appendix, Fig. S2C), 95.3% of all sites remained within the pMR10Y backbone to facilitate linearization of internal TSSs (1,648 of 1,730), and 76.7% of all predicted ribo- the final assembled chromosome for subsequent analysis by pulsed field gel some stalling motifs (1,021 of 1,331) (Table 2). Testing whether electrophoresis. rewritten genes remain functional will identify genes in which **Number of DNA synthesis constraints of C. eth-1.0 and C. eth-2.0.

8072 | www.pnas.org/cgi/doi/10.1073/pnas.1818259116 Venetz et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 7 n 1Pwr nrdcddrn emn n megaseg- and and segment yeast during in introduced assembly were ment L12P and S7P genes additional and four the thetase) Only within suppliers. commercial 21 two missense of of pro- one were total by that blocks vided a DNA 17 perfect 2E). Thereof, detected nonsequence ). S3 from (Fig. we emanated Table Appendix, process design, (SI mutations synthesis genome nonsynonymous 785-kb genome the the Across of performance the for morphologies S4B). Fig. the cell Appendix, bearing elec- YJV04 yeast and observation, normal cells this parental showed with micrographs agreement In tron S4A). Fig. rearrangements Appendix, chromosomal occur- or no found mutations within we adaptive generations, of 60 rences over for cultivation. propagation prolonged After on sequencing whole-genome performed be to proven have far, yeast So in (35). sequences genome. difficult GC yeast high low native clone exhibit the to attempts 34) matching 8, closely (5, contents GC chromosomes synthesized chemically PCR ous 2E). diagnostic (Fig. sequencing 2D), eth-2.0 whole-genome (Fig. C. and S3), electrophoresis Fig. Appendix, gel (SI field indicating pulsed by markers, of click presence the firmed auxotrophic six of assembly complete all restored which for of one clones, prototrophy two yielded spheroplasts yeast into yeast. in eeze al. et Venetz pro- to GC-rich sequences the ARS of replication additional efficient five mote added and errors shown. design are (Bottom) 785-kb assembly the (E chromosome controls. of complete as validation the serve and Size digest) (Middle) (D) (YJV04 level chromosomes medium. megasegment yeast (SD) and PacI-digested defined (Top) and level synthetic PmeI- segment and requires at (marker) and coverage Undigested grow pMR10Y. vector to shuttle fails the (YJV04) segments from strain genome parental 37 the into eth-2.0 while assembled C. were markers, boxes) auxotrophic (green all (C complete blocks for markers. the DNA LEU2 into 236 and assembled (blue); TRP1 further PacI auxotrophic and and boxes) PmeI (orange for eth-2.0 megasegments sites C. 16 restriction and the boxes) and (black), (blue ARSs 11 (red), markers 2. Fig. 727 ARS ARS4 ADE2 A ARS_HI LEU2 esqec verified sequence We whether assess To megasegments corrected 16 the of transformation One-step ARS516 sebyof Assembly niaigsal aneac nYV4(SI YJV04 in maintenance stable indicating eth-2.0 , C. ARS1323 hoooewsasmldi igerato rm1 eaemnsb es peols rnfrainadsbeun rwhslcinfor selection growth subsequent and transformation spheroplast yeast by megasegments 16 from reaction single a in assembled was chromosome hoooeb usdfil e lcrpoei.Dgsinwt mIadPc eessa71k oto fthe of portion 771-kb a releases PacI and PmeI with Digestion electrophoresis. gel field pulsed by chromosome a ihG otn xedn 7,wieprevi- while 57%, exceeding content GC high a has fabI

236 Blocks

16 Mega-segments C. eth 37 Segments .eth-2.0 C. ay-are rti)adterbsmlgenes ribosomal the and protein) (acyl-carrier URA3 ARS1213 PmeI, PacI MET14 .eth-2.0 C. .eth-2.0 C.

-2.0 785,701 bp .eth-2.0 C. , ARS209 .eth-2.0 C. in

ceai ersnaino h iclr785,701-bp circular the of representation Schematic (A) cerevisiae . S. epciey oadditional No respectively. coli, E. ssal anandi es,we yeast, in maintained stably is .W usqetycon- subsequently We 2C). (Fig. sasnl iclrchromosome circular single a as tec sebylvlt assess to level assembly each at ARS1113 positive eth-2.0) (C. 2 clone yeast identified Ade and Leu, Met, His, Trp, Ura, lacking medium on selection Growth ) .eth-2.0 C. argS ARS1018 .eth-2.0 C. ARS_Max2 TRP1 ARS416 agnltN syn- (arginyl-tRNA HIS3 hoooe(SI chromosome D B chromosome 565 kb 680 kb 750 kb 825 kb 945 kb Met Ura

J0 YJV04 YJV04 - , Leu - , Trp Marker 16 Mega-segments - - , Ade , His YJV04 digest - - iooa rti eecutr(CETH (fabB, cluster gene protein acid (rpoC , ribosomal fatty components polymerase and RNA LPS in lptD, involved we genes genes, (dnaQ, toxic genes six the replication Among chromosome three S6 ). three Table found additional Appendix, for (SI An selection S5) growth on Table fast deletions toxicity. small the Appendix , acquired alleviating (SI segments to chromosome mutations loci Methods). led genetic bear and toxic strains that 14 suppressor (Materials of of identification segments sequencing genome Whole-genome tolerated that toxic strains evolved formerly identified and analysis Appendix 18.9 suppressor cover (SI collectively a sequence that observed genes suggest- in toxic segments, we of 12 However, for presence S4). efficiency ing segments transfer Table chromosome in and 37 reduction S5 drastic of 25 Fig. in Appendix, absent (SI were genes toxic from that transfer conjugational of coli Quantification the genes. of toxic form of the in design on individual the 37 residing tested genes we Therefore, toxic Caulobacter. eth-2.0 that speculated genomes We C. 37). microbial expression (36, dosage-sensitive natural and products toxic that for encoding genes studies contain sequencing genome Genes. Toxic of Mapping build genome the in fidelity process. the sequence of high a assembly indicating final mosome, the in occurred mutations oiiy vrl,teosre rcino .%(4o 3)toxic 730) of for (14 cause complexes, 1.9% likely of form a fraction as observed that the dosage Overall, subunit proteins toxicity. in imbalance interacting an for suggesting encode genes antiporter sodium-proton digest C. eth-2.0 + C. eth-2.0 Met Ura to lpxD, + Caulobacter 771 kb C. eth-2.0 + , Leu , Trp + + accC, , Ade , His .eth-2.0 C. ol rvn hoooa aneac in maintenance chromosomal prevent would .eth-2.0 C. + + murU, C E .eth-2.0 C. ncnucinwt eunigdemonstrated sequencing with conjunction in PNAS Sequencing coverage [log10] C. eth-2.0 1.0 3.0 1.0 3.0 1.0 3.0 h complete The (B) track). gray (outermost genome YJV04 twspeiul eotdi clone-based in reported previously was It 0 0 0 0 0 700 600 500 400 300 200 0 hoooesget o h presence the for segments chromosome waaF | utpeo h dnie toxic identified the of Multiple nhaA. .W are u genetic out carried We S4). Table , hoooewt i uorpi selection auxotrophic six with chromosome pi 6 2019 16, April 100 ,tognsecdn interacting encoding genes two ), eetv eimSDmedium Selective medium -Met, -Leu,-Ade -Ura, -Trp, -His, Genome coordinates[kb] .eth-2.0 C. ,teS10-spc-alpha the topA), | Mega-segment levelin 10-12) n the and 01304-01323), o.116 vol. Segment levelin hoooe(arrow) chromosome .eth-2.0 C. N sequencing DNA ) | dnaB, o 16 no. C. eth- ± E. coli E. coli . kb 3.6 rarA), | chro- 2.0 8073 E.

BIOPHYSICS AND SYSTEMS BIOLOGY COMPUTATIONAL BIOLOGY genes found in C. eth-2.0 is well in agreement with the previously for proper gene functioning. We asked whether rewritten genes reported average of 2.15 ± 0.8% of toxic genes identified among are functionally equivalent to native genes despite the massive seven E. coli strains (36). We concluded that computational level of sequence modification introduced. We subjected 37 sequence rewriting as part of the chemical synthesis rewriting merodiploid test strains bearing C. eth-2.0 chromosome seg- process does not induce additional gene toxicity. In agreement ments as episomal copies along the native chromosome to with this hypothesis, 6 genes among the identified 14 toxic transposon mutagenesis to test gene functionality. We compared rewritten genes have been previously identified as “unclonable transposon insertion patterns obtained between complementing genes” in E. coli (37). Furthermore, misbalanced expression and noncomplementing conditions and assessed the function- of rpoC, rarA, dnaB, and accC has previously been reported ality of C. eth-2.0 (Materials and Methods and Dataset S1). to elicit toxicity due to imbalance in protein complex subunit Nucleotide substitutions introduced on rewriting and sequence stoichiometry (38–41). Given the precedence of toxicity for optimization of the C. eth-2.0 genome allowed us to unambigu- wild-type genes, we argue that the toxicity of these genes when ously assign transposon insertions to the native Caulobacter ectopically expressed is likely a general property and is not genome and C. eth-2.0 chromosome segments. Cumulatively, attributed to the rewriting process that maintains identical we found 81.5% (432 of 530) of all essential and semiessential proteins. C. eth-2.0 genes to be functional (Fig. 3B and Table 3). Functional rewritten C. eth-2.0 genes encompass a drastic Genome-Wide Functionality Assessment of C. eth-2.0. While reduction in the number of genetic features (annotated, cryptic, throughout the build process, the C. eth-2.0 genome was or predicted) compared with the wild-type Caulobacter genome maintained in heterologous hosts, we next investigated whether annotation. Maintenance of biologic functionality within rewritten genes resume their anticipated function on intro- rewritten genes suggests dispensability of these genetic features, duction into Caulobacter. Functionality assessment and error and hence, it will lead to refinement of the current genome diagnosis of large-scale DNA constructs are major challenges annotation. During the design process of C. eth-2.0, we have for bioengineering of synthetic genomes. To permit parallel reduced the number of genetic features within CDS from 6,290 functionality assessment of rewritten C. eth-2.0 genes, we to 799 (Table 2). The high functionality level of 81.5% observed developed a transposon-based testing approach. This approach within the rewritten C. eth-2.0 suggests that the large majority of assesses the functionality of rewritten genes in merodiploid test the 6,290 previously annotated and predicted genetic features do strains, which harbor episomal copies of C. eth-2.0 chromosome not adopt essential function. Among the genetic features found segments in addition to the native chromosome. The testing to be dispensable were the three formerly assigned antisense approach measures the functional equivalence between native transcripts (sRNAs) CCNA R0109, R0151, and R0194 internal and rewritten C. eth-2.0 genes through genetic complementation. to rpoC, sufB, and atpD that acquired 16, 17, and 62 base In the presence of functional C. eth-2.0 genes, previously essen- substitutions during the rewriting process, respectively (Fig. 4A). tial native genes become dispensable and acquire disruptive Dispensability of the formerly assigned antisense transcripts transposon insertions (Fig. 3A). In contrast, native genes remain suggests that the majority of chromosomally encoded sRNAs essential and do not tolerate disruptive transposon insertions identified by transcriptome analysis (42, 43) do not elicit an in the presence of nonfunctional rewritten genes (Fig. 3A). In essential function. the case of functional C. eth-2.0 genes, such an analysis will prove that rewritten gene variants are functionally equivalent to Sequence Design Flexibility Beyond Protein-Coding Sequences. The essential native Caulobacter genes. Failure in complementation large majority of the 133,313 base substitutions were intro- will identify specific genes where sequence rewriting in C. eth-2.0 duced within protein-coding sequences of C. eth-2.0. How- erases additional genetic control elements that are important ever, a significant number of nonsynonymous substitutions

A B

Merodiploid Caulobacter strain

Native Caulobacter C. eth-2.0 20K 40K 60K 80K 100K 120K 140K chromosome chromosome segments 1-37

160K 180K 200K 220K 240K 260K 280K 300K TnSeq

Fault diagnosis 320K 340K 360K 380K 400K 420K 440K 460K presence absence of Tn insertions

Caulobacter chromosome 480K 500K 520K 540K 560K 580K 600K 620K Functional genes functional non-functional Non-functional genes C. eth-2.0 640K 660K 680K chromosome Non-essential genes

Fig. 3. Fault diagnosis and error isolation across the C. eth-2.0 chromosome. (A) Functionality assessment of the C. eth-2.0 chromosome. Merodiploid strains bearing episomal C. eth-2.0 chromosome segments (orange and blue circle) are subjected to transposon sequencing (TnSeq). Presence of transposon insertions (blue marks) in a previously essential chromosomal gene (gray arrows) indicates functionality of the homologous C. eth-2.0 gene (blue arrow), while absence of insertions indicates a nonfunctional C. eth-2.0 gene (orange arrow). (B) Functionality map of the C. eth-2.0 chromosome with functional genes (blue arrows), nonfunctional genes (orange arrows), and nonessential control genes (gray arrows).

8074 | www.pnas.org/cgi/doi/10.1073/pnas.1818259116 Venetz et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 eeze al. et Venetz genes. native multiple of inactivation of functionality with reveals testing Transposon box). (gray anticodons and CCNA 4. Fig. functional- gene impair not sequences did noncoding process. and within synthesis tolerated substitutions DNA frequently base novo were that de as the found such facilitate We regions, to noncoding genes, and tRNA intergenic within inserted were § ‡ categories. shown are class † per numbers gene parentheses. total in vs. genes functional of Numbers ing. functional of *Fraction Translation Category of processes Functionality 3. Table ihtetasoo neto atr ftewl-yecnrlsri genmrs,idctn htatsnetasrpsaennseta.( nonessential. are transcripts antisense that indicating tRNA marks), rewritten (green the strain of control structure wild-type letters; secondary the the of of pattern depiction insertion transposon the with genes of CDSs with embedded scripts processes Cellular replication DNA yohtclproteins Hypothetical production Energy aeoiso ee htdslyasgicn nraei functionality. in increase significant a functionality. display in that decrease genes significant of a Categories display that genes of Categories P elcycle Cell factors Translation tRNAs synthetases tRNA proteins Ribosomal rti turnover Protein envelope Cell ausfrfntoaiyercmn n eniheto ifrn gene different of deenrichment and enrichment functionality for values atpD Total .eth-2.0 C. 09 dtdarw)itra to internal arrows) (doted R0194 Right BC AB eunedsg eiiiywti rewritten within flexibility design Sequence ar irpietasoo neto bu ak)i h rsneo complementing of presence the in marks) (blue insertion transposon disruptive carry ‡ § idrn hmclsnhsso RAgnswr rsdb nrdcn aesbtttos(le nteatcdnam hl anann the maintaining while arms anticodon the in (blue) substitutions base introducing by erased were genes tRNA of synthesis chemical hindering ) prn,crmsmlgnstlrt irpietasoo netos(lemrs hogottentv prn edn osimultaneous to leading , native the throughout marks) (blue insertions transposon disruptive tolerate genes chromosomal , ‡ ‡ .eth-2.0 C. .eth-2.0 C. Functional ee sasse ytasoo sequenc- transposon by assessed as genes 36 (81/110) 73.6% 06 (20/33) 60.6% 42 (34/53) 64.2% 03 (121/134) 90.3% 75 (28/32) 87.5% (24/27) 88.9% (19/28) 67.8% (18/22) 81.8% 72 (123/141) 87.2% (26/31) 83.9% (13/15) 86.7% 15 (432/530) 81.5% (22/25) 88.0% (73/84) 86.9% 39 (34/46) 73.9% rpoC, rpoC, ee codn ocellular to according genes .eth-2.0 C. and sufB, and sufB, .eth-2.0 C. genes* atpD atpD Trp curd1,1,ad6 aesbtttos epciey seta hoooa genes chromosomal Essential respectively. substitutions, base 62 and 17, 16, acquired bu ros.O yoyosrwiig nies rncit CCNA transcripts antisense rewriting, synonymous On arrows). (blue ipnaiiyo nies Ns ceai eitn ipnal nies tran- antisense dispensable depicting Schematic RNAs. antisense of Dispensability (A) genes. n tRNA and .eth-2.0 C. 5.49E-03 2.01E-03 7.45E-04 2.60E-04 2.52E-01 2.22E-01 4.73E-02 6.14E-01 1.12E-02 4.69E-01 4.51E-01 2.81E-01 8.48E-02 1.03E-01 P value Tyr † RAgns (C genes. tRNA yeISrsrcinsts(e letters; (red sites restriction IIS Type . 0 falrbsmlgnslkl oti diinlregulatory additional contain likely genes the to ribosomal close of all other Likewise, of features one-third sequence. 40% protein-coding genetic that annotated for the estimated encode than we genes essential findings, hypothetical these on Based (P tively functional 60.6% underrep- and were 64.2 with genes resented, (44, ribosomal level and enzymatic hypothetical the However, at 45). occurs mainly regulation metabolism that transcrip- bacterial observation of of the with their level correlates within finding low embedded This a CDS. elements control contain translational genes and metabolic tional that idea the over cellular with (P among enriched functionality were 90.3% functionality genes metabolic gene that revealed of processes elements Assignment control CDS. translational and harboring within transcriptional classes cellular of gene levels different of high among identification functionality permit gene would processes of analysis the Processes. that Cellular Among Functionality of Gene of Level level high functions biological a imprint that, DNA. sequences, to into suggest exists protein-coding flexibility findings design from sequence These apart 4B). retained even (Fig. complemen- genes transposon-based measurements our tRNA by tation revealed rewritten as Both function their removed. was synthesis 4 tRNA constraints (Fig. synthesis sequences tRNA DNA wild-type of erase the to within arm present introduced anticodon were tRNA that the genes changes within tRNA base despite two functional the remained example, For ity. Tyr oooyei euneptenhneigDNA hindering pattern sequence homopolymeric a , ucinlt etn of testing Functionality ) auso .5- n .1-,rsetvl)(al 3). (Table respectively) 2.01E-3, and 5.45E-4 of values Trp ermvdaTp I etito ie n in and site, restriction IIS Type a removed we , .eth-2.0 C. PNAS au f26E4 Tbe3.Ti supports This 3). (Table 2.60E-4) of value hoooesget bu ak)compared marks) (blue segments chromosome | pi 6 2019 16, April Left .eth-2.0 C. n oooyei eune (red sequences homopolymeric and ) prn.O complementation On operons. | .eth-2.0 C. o.116 vol. 00,CCNA R0109, Trp .I h case the In B). | ee,respec- genes, n tRNA and ereasoned We o 16 no. Schematic B) 05,and R0151, rpoC, | sufB, 8075 Tyr

BIOPHYSICS AND SYSTEMS BIOLOGY COMPUTATIONAL BIOLOGY elements embedded within their protein-coding sequence. From Genetic Control Features Within the Cell Division Genes. We next a gene regulatory perspective, this is not surprising, as protein investigated the presence of additional genetic features within synthesis is the major consumer of cellular energy in the cell division genes murG, murC, ftsQ, and ftsZ, in which (46). Furthermore, the biogenesis of functional ribosome com- genetic complementation failed with the corresponding C. eth- plexes depends on the concerted transcriptional control of many 2.0 counterparts (Fig. 5A). We hypothesized that computational ribosomal operon genes (47). In sum, our analysis suggests that a sequence rewriting likely erased critical control elements needed low level of additional essential regulatory elements is embedded for proper gene expression. Indeed, we found that rewriting of within the protein-coding sequences of metabolic genes. How- an overlapping CDS upstream of murC corrupted the associated ever, a high number of regulatory elements are embedded within promoter region. Similarly, sequence rewriting of ddlB erased an coexpressed ribosomal genes and other multigene core modules internal ribosome binding site necessary for translation of the of the bacterial cell. downstream gene ftsQ (Fig. 5B). The rewriting of ftsW erased an embedded short transcript (ftsWs) necessary for murG transla- Rewritten C. eth-2.0 Operons Encompass Fully Functional Biological tion (29) (Fig. 5B). Finally, we found a short annotated CDS (29) Modules. Although a significant amount of chemical synthesis upstream of the nonfunctional ftsZ gene. However, sequence rewritten C. eth-2.0 genes are functional on an individual basis, analysis revealed that the wild-type sequence contains a hairpin we hypothesized that additive fitness effects might arise when secondary structure, which resembles a transcriptional attenu- multiple synthetic genes were combined. We thus searched for ation element. This may control ftsZ expression depending on chromosomal transposon insertions within essential Caulobacter the metabolic conditions (Fig. 5B). While additional studies are operons leading to simultaneous inactivation of multiple gene needed to unravel the exact molecular functions of these genetic products due to truncation of a polycistronic mRNA transcript. We observed such transposon insertions within 41 formerly essential Caulobacter operon genes (Fig. 4C and Dataset S3), 1 kb suggesting that the chemical synthesis rewritten C. eth-2.0 genes A may indeed fully encompass functional biological modules com- plementing the function of their native counterparts. One exam- ple includes the mreBCD-rodA operon, which is involved in the coordination of cell wall peptidoglycan biosynthesis machinery. This complex is critical for the generation and maintenance of ftsI ftsL ftsW ddlBftsQ ftsA ftsZ bacterial cell shape (48). We found transposon insertions dis- mraW murE murF mraY murD murG murC murB alaS rupting the polycistronic mRNA, suggesting that the function of the chromosomal mreBCD-rodA operon is complemented by the C. eth-2.0 counterparts (Fig. 4C). Similar patterns of trans- B Translational coupling Promoter poson insertions were obtained for the groEL-ES operon (49) RBS and the membrane protein chaperone operon yidC-yidA (50). murG *** murC Both tolerated disruptive transposon insertions throughout the ftsW native sequence, leading to simultaneous inactivation of multiple ftsWs genes. These findings support the idea that additive fitness effects Attenuator do not likely arise when multiple synthetic genes are combined RBS into functional modules, which will ultimately simplify the build RBS ftsQ ftsZ process of artificial chromosomes by using chemical synthesis rewritten DNA. ddlB CETH_03971 Discovery of Essential Genetic Features Within CDS. We reasoned C that the process of chemical synthesis rewriting offers a pow- 240 180 1800 750 erful experimental approach to map hitherto unknown genetic regulatory elements encoded within the protein-coding sequence 160 120 1200 500 and validate the annotation accuracy of an organism’s genome. 80 250 Discovery of genes that lost their function on rewriting sug- 60 600

gested the presence of additional essential genetic features, LacZ activity 0 0 0 0 which have evaded previous genome annotation efforts. Error murG murC ftsQ ftsZ cause classification of the 98 nonfunctional C. eth-2.0 genes (Materials and Methods) pinpointed to 52 instances of impre- cise annotation of the ancestral Caulobacter genome including repaired repaired repaired misannotated promoter regions and incorrect TSSs predictions. repaired This implies that a significant number of protein-coding genes non-functional non-functional non-functional remain misannotated within curated genomes. We found evi- non-functional dence for 27 transcriptional and translational control signals Fig. 5. Fault diagnosis and repair across the C. eth-2.0 chromosome. (A) embedded within protein-coding sequences that were erased Fault diagnosis across the C. eth-2.0 cell division gene cluster. Transposon due to sequence rewriting. This finding suggests that inter- insertions in the wild-type control (green marks) and on complementation nal transcriptional and translational control elements do not with C. eth-2.0 cell division genes (blue marks) are shown. With the excep- often occur within CDS of the Caulobacter genome. In only 13 tion of the four nonfunctional genes murG, murC, ftsQ, and ftsZ (orange instances, we detected nonfunctional genes due to base substitu- arrows), the large majority of rewritten genes are functional (blue arrows). tions introduced outside of protein-coding sequences to optimize (B) Chemical synthesis rewriting reveals genetic control elements present within the cell division gene cluster, including translational coupling sig- synthesis. Furthermore, six genes acquired deleterious mutations nals (murG), internal ribosome binding sites (RBSs; ftsQ), extended promoter during the build and boot-up process (Dataset S2). These find- regions (murC), and attenuator sequences upstream of ftsZ.(C) Insertion of ings suggest that inaccurate annotation of protein-coding se- the wild-type sequence elements upstream of nonfunctional cell division quences is the main cause for losing functionality on synonymous genes restores gene expression as measured by β-galactosidase assays using rewriting. lacZ reporter gen fusions.

8076 | www.pnas.org/cgi/doi/10.1073/pnas.1818259116 Venetz et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 and the into elements sequence the wild-type of found upstream we sequence the cluster, the gene of division repair cell that the within elements control eeze al. et Venetz lay- information acid genetic additional amino erases encoded of likely of rewriting the but rewriting synonymous sequences maintains that synonymous sequences speculated 133,313 the protein-coding We introduced in codons. we resulting Overall, 123,562 protein-coding 799. substitutions, within to present base 6,290 features from 785,701-bp genetic sequences the of number Within the the sequences. the on of protein-coding cell assessed genome bacterial its we a of rewriting, of level content synthesis information chemical genetic essential of its process of replicating the content independently information an genetic deter- the for probe genes experimentally genes. not essential of did to set they While valuable cell, set. core highly gene the its were “unknowns”) mine of as one-third studies labeled over func- 65 these to unknown and corresponds with “generic,” This as (55). genes labeled 149 cell. (84 replicative encompasses a tions of also creation it the in However, resulted (5), genome 531-kb as under- such encoded mycoides bacteria, our are M. autonomous synthetic advance functions Indeed, biological to genomes. fundamental within DNA how of of rewriting standing However, the functionality. chemical synthesis: of of opportunity DNA key likelihood a misses design design the genome modest conservative increase only to maintained implementing largely changes have sequences, Contempo- 8) genome 5, genomes. how (3, natural into projects understand genome power- programed to synthetic a rary are concept provides functions research synthesis genomes essential complementary chemical entire chal- and of of very rewriting ful process remained the the has that show through genes we essential Herein, of lenging. function uncov- genes, the native individual ering analyzing by systems biological of is that function with proteins hypothetical pre- unknown. were genes for were 91 49 coding essential, and sumably as function, unknown identified of regions regions cell. living noncoding DNA a individual run to the necessary Of collec- information genetic that the elements store noncoding tively and regions sequences of regulatory protein-coding set also, only but identified not The included (30). sequences essential conditions laboratory sequencing the under of transposon survival 12% using that 54). shown work (25, has experimental reported been recent has circuitry More the of iden- simulation been have and circuit tified, regulatory the recent of In components (26–29). many of lifecycle years, the understanding control analy- that our networks global core revolutionized genetic enabled have has that technologies ses genomic (53). order of restricted are advent temporally progression a The in cycle occur cell and integrated and highly differentiation 51, polar (25, control cycle that cell of bacterial feature notable the A 52). of regulation the understanding crescentus C. Discussion synthetic into improve functions further chromosomes. biological will programing that in principles capabilities design our of opportu- DNA formidable repair uncover a to rational provide nity will for genes allow diagnosis noncomplementing error of to and identification deduced Furthermore, designs. be genome rapidly error can identified, are causes elements genetic essential missing after that, yrbidn h seta eoeof genome essential the rebuilding By functioning the dissect approaches genetic classical Although ftsZ murG .Smlry neto fthe of insertion Similarly, 5C). (Fig. levels expression .Ti suggests This 5C). (Fig. expression gene restored also eue eunerwiigt reduce to rewriting sequence used we eth-2.0, C. tanJV-y30md po 7 ee ihna within genes 473 of up made JCVI-syn3.0 strain a mre sa motn oe raimfor organism model important an as emerged has Caulobacter Caulobacter .eth-2.0 C. sta h euaoyevents regulatory the that is eoei seta for essential is genome Caulobacter ftsZ genes eerestored gene Caulobacter murC, through ftsQ, h opeecrmsm xmlf h tlt forapproach genomes. our designer of produce utility rapidly to the syn- exemplify of chromosome low-cost assembly complete Successful higher-order the content. subsequent GC stretches, and high homopolymeric thesis of con- 93 regions synthesis 4,342 repeats, DNA and 1,233 5,668 introduce including remove collectively to straints, and substitutions 32) base used (31, 10,172 we algorithms process, build design genome entire synthe- sequence the genome simplify the To process. facilitates sis rewriting synthesis chemical how defined fully toward a with functions blueprint. organism genetic synthetic genome a essential of construction in rational our gaps yet knowledge understood, current poorly iden- still 98 on are the findings that genes acknowledge nonfunctional We tified unknown. currently attenu- is essential which the the of as upstream such identified element features, ator genetic of discovery the to genes essential transcribed functionality the of our of 432 mRNA, over of that level revealed the analysis at changes drastic such ing essential the of hidden of as sequences genes. well protein-coding essential within as embedded frames elements reading control alternative include These ers. alycl iiingenes division nonfunctional cell identified faulty of eth-2.0 repair C. Targeted proper functioning. for importance fundamental gene of transcrip- protein- are to within that unknown sequences embedded interesting hitherto coding layers control onto be translational nonfunctional. light and will shed tional genes will particular it studies renders future, These rewriting genome the why omics-informed In unravel knowl- previous our efforts. despite in reannotation gaps persisted have that currently edge we of where genome essential revealed the precisely of 20% than less to sponds entire of be fidelity of can annotation approach subset the genomes. validate rewriting func- a experimentally genome protein-coding to that our than used possible together, rather RNA Taken also for tions. is encode it genes within these contain Alternatively, embedded or CDS. elements misannotated genetic their are essential genes unknown these reason- is hitherto that it functionality, conclude their to functionality for sufficient able not transposon-based sequences is protein-coding genes our these the of synonymous solely by on retaining Since functionality detected assessment. lost as that gene genes rewriting of 98 level mapped the cisely at than enzymes of level expression. allosteric the by at rather occurs interactions functions regu- pro- metabolic that high essential fact the of the be lation for might genes explanation metabolic possible functional of A portion cases. most func- in their tionality retain pathways biosynthetic rewritten that suggesting net- core the metabolic of of Among the work role. up layers make significant that a other genes play enzyme-encoding that to probed 134 seem and the not do of proteins control majority for genetic vast exclusively the Further- encode that (56). ORFs suggest context findings codon our including more, factors, con- many is vivo by in translation trolled studies codon previous that reported that genes fact individual the on the given surprising or is function- finding biological structure, This on ality. influence secondary significant genes, no the essential has context most sequence, codon in mRNA sup- that, to primary suggests counterparts the result natural This to viability. functionality port in equal are genes ntelvlo env N ytei,w eendemonstrate herein we synthesis, DNA novo de of level the On rewriting complete in resulted codons all of 56% of Rewriting loehr h dnie e f9 ofntoa ee corre- genes nonfunctional 98 of set identified the Altogether, pre- study our genes, rewritten functional 432 to addition In .eth-2.0 C. h ee ffntoa ee see vr90%, over even is genes functional of level the eth-2.0, C. ee,a xmlfidwti h usto h four the of subset the within exemplified as genes, .eth-2.0 C. orsodn o8.%o l erte essential rewritten all of 81.5% to corresponding Caulobacter PNAS ev sa xeln trigpitt close to point starting excellent an as serve | murG pi 6 2019 16, April rncitm.Dsieincorporat- Despite transcriptome. , murC, ftsZ | and ftsQ, o.116 vol. ee h ucinof function the gene, .eth-2.0 C. | illead will ftsZ, .eth-2.0 C. o 16 no. | into 8077 and

BIOPHYSICS AND SYSTEMS BIOLOGY COMPUTATIONAL BIOLOGY Our results highlight the promise of chemical synthesis rewrit- YJV04. To transform the segments into the yeast cells, a spheroplast proce- ing of entire genomes to understand how the most fundamental dure was applied. The assembly was verified by a junction-amplifying PCR. functions of a cell are programed into DNA. On the systems The correct size of the construct was verified using pulse field agarose gel engineering level, our design–build–test approach enables us to electrophoresis by lysing the yeast cells inside an agarose plug. harness massive design flexibility to produce rewritten genomes The sequence of the C. eth-2.0 construct was verified using the Illumina NextSeq and iSeq systems. that are customized in sequence while maintaining their biolog- ical functionality. On the level of genome synthesis, our findings Construction of Merosynthetic Caulobacter Test Strains. Sequence-confirmed also highlight how chemical synthesis facilitates rewriting of bio- C. eth-2.0 segments were conjugated from E. coli S17-1 into Caulobacter logical information into DNA sequences that can be physically NA1000 to generate a panel of 37 merosynthetic test strains. The occur- manufactured in a highly reliable manner, thereby reducing costs rence of toxic C. eth-2.0 genes was measured by the conjugation frequency and increasing effectiveness of the genome build process. In sum, of the different segments. To pinpoint the toxic genes, the C. eth-2.0 our results highlight the promise of chemical synthesis rewriting segments were sequenced on an Illumina system after the boot up in to decode fundamental genome functions and its utility toward Caulobacter. Using the sequencing data, the mutations within the evolved design of improved organisms for industrial purposes and health C. eth-2.0 segments were analyzed, yielding the precise coordinates of benefits. toxic genes. Materials and Methods Fault Diagnosis of C. eth-2.0 by Transposon Sequencing. To benchmark the functionality of the C. eth-2.0 genes, transposon sequencing was applied Detailed materials and methods are in SI Appendix. The sequence of the C. eth-2.0 genome has been deposited in the NCBI database (GenBank (30). The analysis was conducted using hypersaturated transposon libraries accession no. CP035535). and an Illumina system. The sequencing data were mapped onto the original Caulobacter genome, resulting in a set of all functional C. eth-2.0 genes. Design of C. eth-1.0 Genome and Sequence Rewriting into C. eth-2.0. To After analyzing the nonfunctional genes, a repair of the sequence was done streamline the C. eth-1.0 design (30) for DNA synthesis, the previously using standard cloning techniques. To test the repaired C. eth-2.0 genes, a reported Genome Caligrapher algorithm and sequence design pipeline (31) β on-galactosidase reporter assay was conducted. were applied at a codon recoding probability of 0.56. The streamlined C. ACKNOWLEDGMENTS. We thank R. Schlapbach and L. Poveda from Zurich¨ eth-2.0 design contains a low amount of both synthesis constraints and Functional Genomics Center (ZFGC) for sequencing support; B. Maier and unnecessary genetic features. To enable the retrosynthetic assembly route, members from ScopeM for electron microscopy support; S. Nath from the C. eth-2.0 was partitioned into 3- to 4-kb DNA blocks using the previously Joint Genome Institute (JGI) for DNA synthesis and sequencing support; published Genome Partitioner algorithm (32). F. Rudolf for assistance with yeast marker design; H. Christen for concep- tion of computational algorithms; and Samuel I. Miller, Markus Aebi, and Synthesis and Hierarchical Assembly of the C. eth-2.0 Genome. The parti- Uwe Sauer for critical comments. This work received institutional support tioned 3- to 4-kb DNA blocks for the hierarchical assembly of C. eth-2.0 were from Community Science Program (CSP) DNA Synthesis Award Grants JGI ordered from two commercial suppliers of low-cost de novo DNA synthesis. CSP-1593 (to M.C. and B.C.) and CSP-2840 (to M.C. and B.C.) from the US Department of Energy Joint Genome Institute, Swiss Federal Institute of The blocks were assembled into 20-kb segments and subsequently, into 40- Technology (ETH) Zurich¨ ETH Research Grant ETH-08 16-1 (to B.C.), and to 60-kb megasegments using yeast homologous gap repair. To verify the Swiss National Science Foundation Grant 31003A 166476 (to B.C.). The work assemblies, a junction-amplifying PCR was conducted. conducted by the US Department of Energy Joint Genome Institute, a To assemble the megasegments into the 785-kb C. eth-2.0 genome, Department of Energy Office of Science User Facility, is supported by Office homologous gap repair was done by the newly generated S. cerevisiae strain of Science of the US Department of Energy Contract DE-AC02-05CH11231.

1. Cello J, Paul AV, Wimmer E (2002) Chemical synthesis of poliovirus cdna: Gener- 18. Napolitano MG, et al. (2016) Emergent rules for codon choice elucidated by editing ation of infectious virus in the absence of natural template. Science 297:1016– rare arginine codons in escherichia coli. Proc Natl Acad Sci USA 113:E5588–E5597. 1018. 19. Wang K, et al. (2016) Defining synonymous codon compression schemes by genome 2. Smith HO, Hutchison CA, Pfannkoch C, Venter JC (2003) Generating a syn- recoding. Nature 539:59–64. thetic genome by whole genome assembly: phiX174 bacteriophage from synthetic 20. Lau YH, et al. (2017) Large-scale recoding of a bacterial genome by iterative oligonucleotides. Proc Natl Acad Sci USA 100:15440–15445. recombineering of synthetic dna. Nucleic Acids Res 45:6971–6980. 3. Gibson DG, et al. (2008) Complete chemical synthesis, assembly, and cloning of a 21. Ostrov N, et al. (2016) Design, synthesis, and testing toward a 57-codon genome. genome. Science 319:1215–1220. Science 353:819–822. 4. Gibson DG, et al. (2010) Creation of a bacterial cell controlled by a chemically 22. Holtzendorff J, et al. (2004) Oscillating global regulators control the genetic circuit synthesized genome. Science 329:52–56. driving a bacterial cell cycle. Science 304:983–987. 5. Hutchison CA, et al. (2016) Design and synthesis of a minimal bacterial genome. 23. McGrath PT, et al. (2007) High-throughput identification of transcription start Science 351:aad6253. sites, conserved promoter motifs and predicted regulons. Nat Biotechnol 25:584– 6. Annaluru N, et al. (2014) Total synthesis of a functional designer eukaryotic 592. chromosome. Science 344:55–58. 24. Christen M, et al. (2010) Asymmetrical distribution of the second messenger c-di-GMP 7. Mitchell LA, et al. (2017) Synthesis, debugging, and effects of synthetic chromosome upon bacterial cell division. Science 328:1295–1297. consolidation: Synvi and beyond. Science 355:eaaf4831. 25. Shen X, et al. (2008) Architecture and inherent robustness of a bacterial cell-cycle 8. Richardson SM, et al. (2017) Design of a synthetic yeast genome. Science 355:1040– control system. Proc Natl Acad Sci USA 105:11340–11345. 1044. 26. Schrader JM, et al. (2014) The coding and noncoding architecture of the Caulobacter 9. Shen Y, et al. (2017) Deep functional analysis of synii, a 770-kilobase synthetic yeast crescentus genome. PLoS Genet 10:e1004463. chromosome. Science 355:eaaf4791. 27. Zhou B, et al. (2015) The global regulatory architecture of transcription during the 10. Wu Y, et al. (2017) Bug mapping and fitness testing of chemically synthesized Caulobacter cell cycle. PLoS Genet 11:e1004831. chromosome X. Science 355:eaaf4706. 28. Nierman WC, et al. (2001) Complete genome sequence of Caulobacter crescentus. 11. Xie ZX, et al. (2017) “Perfect” designer chromosome V and behavior of a ring Proc Natl Acad Sci USA 98:4136–4141. derivative. Science 355:eaaf4704. 29. Schrader JM, et al. (2016) Dynamic translation regulation in Caulobacter cell cycle 12. Coleman JR, et al. (2008) Virus attenuation by genome-scale changes in codon pair control. Proc Natl Acad Sci USA 113:E6859–E6867. bias. Science 320:1784–1787. 30. Christen B, et al. (2011) The essential genome of a bacterium. Mol Syst Biol 7:528– 13. Mart´ınez MA, Jordan-Paiz A, Franco S, Nevot M (2016) Synonymous virus genome 528. recoding as a tool to impact viral fitness. Trends Microbiol 24:134–147. 31. Christen M, Deutsch S, Christen B (2015) Genome calligrapher: A web tool for refac- 14. Mueller S, et al. (2010) Live attenuated influenza virus vaccines by computer-aided toring bacterial genome sequences for de Novo DNA synthesis. ACS Synth Biol rational design. Nat Biotechnol 28:723–726. 4:927–934. 15. Wang HH, et al. (2009) Programming cells by multiplex genome engineering and 32. Christen M, Del Medico L, Christen H, Christen B (2017) Genome partitioner: A web accelerated evolution. Nature 460:894–898. tool for multi-level partitioning of large-scale DNA constructs for 16. Isaacs FJ, et al. (2011) Precise manipulation of chromosomes in vivo enables genome- applications. PLoS One 12:e0177234. wide codon replacement. Science 333:348–353. 33. Christen M, Christen B (2019) Caulobacter ethensis CETH2.0 genome sequence. 17. Lajoie M, et al. (2013) Probing the limits of genetic recoding in essential genes. GenBank. Available at https://www.ncbi.nlm.nih.gov/search/all/?term=CP035535. De- Science 342:361–363. posited January 31, 2019.

8078 | www.pnas.org/cgi/doi/10.1073/pnas.1818259116 Venetz et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 5 okvV,e l 21)Asml flre ihGCbceilDAfamnsin fragments DNA chemically bacterial a G+C high by large, controlled of cell Assembly (2012) bacterial al. a et of VN, Creation Noskov (2010) 35. al. et DG, Gibson 34. 6 iemnA ta.(02 atcleto fmcoilgnsta r oi to toxic are that genes microbial of collection vast A (2012) al. et A, Kimelman 36. 7 oe ,e l 20)Gnm-ieeprmna eemnto fbrir to barriers of determination experimental Genome-wide (2007) al. et R, Sorek 37. 9 hbt ,e l 20)Fntoa vra ewe eaadms rr)i h rescue the in (rara) mgsa and reca between overlap rna Functional of (2005) expression al. et controlled T, on Shibata based switch 39. growth synthetic A (2015) al. et J, Izard 38. 1 hiRe ,Coa E(03 h itncroyaeboi abxlcrirprotein carrier carboxyl carboxylase-biotin dnac biotin by The helicase (2003) JE dnab Cronan of E, regulation Choi-Rhee the in 41. balance Fine (1991) A Kornberg G, Allen 40. 2 uhrodS,BslrB 21)Bceilqou esn:Isrl nvrlneand virulence in role Its sensing: quorum Bacterial (2012) BL Bassler ST, Rutherford 42. 3 tr ,VglJ asra M(01 euainb ml nsi bacteria: in rnas small by Regulation (2011) KM Wassarman J, Vogel G, Storz 43. 5 hbkvV eoaL ohnwk ,SurU(04 oriaino microbial of Coordination (2014) U Sauer K, Kochanowski L, Gerosa V, systems. control Chubukov cellular and 45. proteins Allosteric (1963) F Jacob JP, Changeux J, Monod 44. eeze al. et Venetz yeast. genome. synthesized bacteria. oiotlgn transfer. gene horizontal fsaldrpiainfrsi shrci coli. escherichia in forks replication stalled of polymerase. ope feceihacl ctlcacarboxylase. acetyl-coa coli escherichia of complex coli. escherichia in replication in protein osblte o t control. its for possibilities xadn frontiers. Expanding metabolism. Biol Mol J C yt Biol Synth ACS eoeRes Genome 6:306–329. o ytBiol Syst Mol a e Microbiol Rev Nat o Cell Mol 1:267–273. Science 22:802–809. Science 11:840. odSrn abrPrpc Med Perspect Harbor Spring Cold 329:52–56. 43:880–891. 12:327–340. 318:1449–1452. ilChem Biol J ee Cells Genes ilChem Biol J 266:22096–22101. 10:181–191. 2:a012427. 278:30806–30812. 4 kre M abM 20)Cl-yl rgeso n h eeaino asymmetry of generation the and progression Cell-cycle coordinates (2004) spatially MT compass Laub intracellular JM, Skerker An (2016) 54. L Shapiro TH, Mann K, Lasker 53. 6 hvneFV uhsK 21)Cs o h eei oea rpe ftriplets. of triplet a as code genetic the for Case (2017) KT function. Hughes for FFV, quest Chevance in genes 56. Essential unknowns: Unknown (2016) G Fang A, Danchin 55. control. cycle cell bacterial of design System-level (2009) in L Shapiro in component HH, operating McAdams multifunctional network regulatory 52. and cell-cycle bacterial essential A (2003) an L Shapiro as HH, McAdams Yidc 51. (2007) A Kuhn dnak/dnaj D, and spiral Kiefer Groes/groel independent (2006) 50. Two SL (2005) Gomes F, Z Gueiros-Filho Gitai RL, L, Baldini Shapiro MF, JA, Susin Theriot 49. Z, coli. Pincus escherichia NA, in transcription Dye Rrna (2004) 48. RL protein Gourse absolute T, Gaal Quantifying W, Ross (2014) BJ, Paul JS 47. Weissman C, Gross D, Burkhardt GW, Li 46. nCuoatrcrescentus. Caulobacter in crescentus. Caulobacter in modules cycle cell Lett irbBiotechnol Microb space. and time assembly. protein membrane caulobacter in progression cycle cell during and responses crescentus. stress in roles distinct have caulobacter. in shape cell control structures resources. Genet Rev cellular of allocation underlying principles 157:624–635. reveals rates synthesis alAa c USA Sci Acad Natl 583:3991. 38:749–770. Bacteriol J Science 114:4745–4750. 9:530–540. 188:8044–8053. PNAS 301:1874–1877. a e Microbiol Rev Nat n e Cytol Rev Int | pi 6 2019 16, April 259:113–138. rcNt cdSiUSA Sci Acad Natl Proc urOi Microbiol Opin Curr 2:325–337. | o.116 vol. | 33:131–139. 102:18608–18613. o 16 no. | Annu 8079 FEBS Proc Cell

BIOPHYSICS AND SYSTEMS BIOLOGY COMPUTATIONAL BIOLOGY