arXiv:1809.02716v2 [cs.IT] 27 Oct 2019 trg a nrae rmtclyi eetyas follow years, recent in dramatically increased has storage oe fDAsoaessesi udmnal ifrn fro different fundamentally is systems storage DNA of model epciey[ respectively integers hlegn ocma,adaedsusdnext. discussed are and combat, to challenging epcieedit-distance respective n te tigi h e.I hscs h uptsthsthe has set output the case outp this the as In simply of set. to size the referred the in counting string other by any undetectable is which other, rosi h sa es.Mroe,teodrbtentesl the between order the Moreover, sense. set. usual into sliced the is in string errors the transmission, during Then, prob input. open Finally, c 1. constant. code Remark a another to of up sketch redundancy optim a optimal order-wise present with be we redundancy, to sub the shown single Appendix improve is a in and co correct improved substitutions, to can later multiple required that is are code which that a constant, bits of redundant construction of a provide amount the on bounds string a in result they since th handle include to might easier decoder are the insertions inclu hence are and winners erroneously, majority written all and cluster, that in than reads longer the or shorter either is th length encoding whose after strings dissolved storage, and Reaction DNA synthesized Chain are of molecules context DNA the result corresponding (which In insertions another). string), shorter by a in result which a in decoder the at obtained are strings these consequently, (i.e synthesis in restrictions biochemical to due However, lo h eoe ocutteeatnme fapaacsof Section in (see appearances the networks of re-ignited asynchronous number over have exact transmission restrictions the These count to concentrations. decoder the allow h el rae tigarayeit ntesto strings of set the in exists already string created newly the 1 2 aasoaei ytei N oeue ugssunprecede suggests molecules DNA synthetic in storage Data yial,tedt naDAsoaesse ssoe sapool a as stored is system storage DNA a in data the Typically, usiuinerrta cuspirt mlfiainb PC by amplification to prior occurs that error substitution A olwn omldfiiino h hne oe nSection in model channel the of definition formal a Following n ftecvaso hsapoc sta rosi synthesi in errors that is approach this of caveats the of One of set a as encoded is stored be to data the model, this In h dtdsac ewe w tig stemnmmnme o number minimum the is strings two between distance edit The sln stenme fisrin snteult h ubro number the to equal not is insertions of number the as long As muto eudn ista r eurdt orc errors correct whi to that required is are paradigm. paper that correcting this bits in redundant con of result several amount surprising provide The and constants. errors, to substitution under channel mrigapiain nDAsoae mn tes nthis In others. among storage, DNA in applications emerging eateto lcrclEgneig aionaInstitu California Engineering, Electrical of Department M h neeti hne oesi hc h aai eta nu an as sent is data the which in models channel in interest The and h hne hc sdsusdi hspprcnessentially can paper this in discussed is which channel The 16 .Ec niiulsrnsi ujc ovrostpso err of types various to subject is strings individual Each ]. L PR sapid hc rsial mlfistenme fc of number the amplifies drastically which applied, is (PCR) uhthat such nCdn vrSie Information Sliced over Coding On substitutions 1 hn aoiyvt shl ihnec lse nodrto order in cluster each within held is vote majority a Then, . < M r h anfcso hspaper. this of focus main the are , 2 L i Sima Jin yia ausfor values typical ; , eae Raviv Netanel C nSection In . L .I I. eein,isrin,adsbtttosta unoeto one turn that substitutions and insertions, deletions, f III eein,a vn htocr nngiil probability negligible in occurs that event an deletions, f r icre,adtermiigoe r lsee accordi clustered are ones remaining the and discarded, are usrnso qa egh n ahsbtigi ujc to subject is substring each and length, equal of substrings M nalne tig,adsbtttos(.. replacements (i.e., substitutions and string), longer a in NTRODUCTION eetin terest tutos oeo hc r hw ob smttclyopt asymptotically be to shown are which of some structions, ). eteifrainvco ssie noasto nree st unordered of set a into sliced is vector information the le n e ucsflpooyeipeettos[ implementations prototype successful few a ing Abstract n ec,tedcdrwl uptastof set a output will decoder the hence, and , flnt ifrn from different length of e nthe in ded sodrws qiaett h mutrqie nteclass the in required amount the to equivalent order-wise is ,wiig n eunig(.. edn) h underlying the reading), (i.e., sequencing and writing) ., and sernossrn nteotu e.I hscnet delet context, this In set. output the in string erroneous is lweee h ubro usiuin sacntn.T fur To constant. a is substitutions of number the whenever al ae eaayetemnmlrdnac fbnr oe o t for codes binary of redundancy minimal the analyze we paper csi otdrn rnmsin n hyarv sa unord an as arrive they and transmission, during lost is ices n esfrftr eerhaedsusdi Section in discussed are research future for lems t iia-ei counterpart. digital-media its m tst h usiuingnrtsasrn hc sntequa not is which string a generates substitution the set, ut aesz steerrfe n.Teeerrpten,wiha which patterns, error These one. free error the as size same ttto.Ti osrcini hw ob pia pt som to up optimal be to shown is construction This stitution. nieaslto.Te,aceia rcs called process chemical a Then, solution. a inside a nueete n ftopsil ro atrs none In patterns. error possible two of one either induce can R VI unordered L ntuto nSection in onstruction tdavne ndniyaddrblt.Teitrs nDNA in interest The durability. and density in advances nted btsbtttosaegvni Section in given are substitutions mbat and , II ih as h C rcs oapiyasrn htwas that string a amplify to process PCR the cause might s aaa e fsrnsoe orsmo lhbt the alphabet, four-symbol a over strings of set a as data e M ahsrn nteslto,btmrl oetmt relative estimate to merely but solution, the in string each rvoswr sdsusdi Section in discussed is work previous , r urnl ihnteodro antd of magnitude of order the within currently are h osrcinfrasnl usiuini eeaie t generalized is substitution single a for construction the oigoe sets over coding odrdsto iaysrnshsicesdltl,deto due lately, increased has strings binary of set nordered eo ehooy aaea915 A USA CA, 91125, Pasadena Technology, of te fsotsrnsta r isle nieaslto,and solution, a inside dissolved are that strings short of tig flength of strings uptset output eohaBruck Jehoshua ese stkn tigo eti length certain a of string a taking as seen be aho.Frhroe urn ehooyde not does technology current Furthermore, fashion. r,sc sdltos(.. msin fsymbols, of omissions (i.e., deletions as such ors, pe fec tig nteraigprocess, reading the In string. each of opies ftedcdn loih (Figure algorithm decoding the of 2 oe htas nsapiain in applications finds also that model a , L usiuinerr,hwvr r more are however, errors, Substitution . oeu ihtems ieyoii of origin likely most the with up come VII L h oei aal fcorrecting of capable is code The . vracranapae,frsome for alphabet, certain a over another. . M IV III nSection In . − pe n lower and Upper . 2 ,[ ], 1 VIII foesymbol one of tig.I the In strings. ig,the rings, clerror ical 10 7 mlup imal ,[ ], substitution Polymerase gt their to ng . 7 1 ). and osand ions 4 channel his ,[ ], V N ered 10 16 ther to l we as re 2 ]. o e , , 2
DATA (encoding) ⇓ M 0,1 L (synthesis) (PCR) xi i=1 { M} { } ∈ ⇒ ⇒
(sequencing) (majority) (clustering) ⇒ ⇒
Fig. 1. An illustration of a typical operation of a DNA storage system. The data at hand is encoded to a set of M binary strings of length L each. These strings are then synthesized, possibly with errors, into DNA sequences, that are placed in a solution and amplified by a PCR process. Then, the DNA sequences are read, clustered by similarity, and the output set is decided by a majority vote. In the illustrated example, one string is synthesized in error, which causes the output set to be in error. If the erroneous string happens to be equal to another existing string, the output set is of size M − 1, and otherwise, it is of size M.
It follows from the sphere-packing bound [18, Sec. 4.2] that without the slicing operation, one must introduce at least K log(N) redundant bits at the encoder in order to combat K substitutions. The surprising result of this paper, is that the slicing operation does not incur a substantial increase in the amount of redundant bits that are required to correct these K substitutions. In the case of a single substitution, our codes attain an amount of redundancy that is asymptotically equivalent to the ordinary (i.e., unsliced) channel, whereas for a larger number of substitutions we come close to that, but prove that a comparable amount of redundancy is achievable.
II. PRELIMINARIES To discuss the problem in its most general form, we restrict our attention to binary strings. For integers M and L such 3 L 0,1 L L 0,1 L that M 2 we denote by { M} the family of all subsets of size M of 0, 1 , and by { M} the family of subsets ≤ { } L ≤ L L 0,1 0,1 of size at most M of 0, 1 . In our channel model, a word is an element W { M} , and a code { M} is a set of words (for clarity, we{ refer} to words in a given code as codewords). To prevent∈ ambiguity with classicalC ⊆ coding theoretic terms, the elements in a word W = x1,..., xM are referred to as strings. We emphasize that the indexing in W is merely a notational convenience, e.g., by the{ lexicographic} order of the strings, and this information is not available at the decoder. For K ML, a K-substitution error (K-substitution, in short), is an operation that changes the values of at most K ≤ 0,1 L different positions in a word. Notice that the result of a K-substitution is not necessarily an element of { M} , and might 0,1 L be an element of { } for some M K T M. This gives rise to the following definition. T − ≤ ≤ 0,1 L M 0,1 L Definition 1. For a word W { } , a ball (W ) { } centered at W is the collection of all subsets ∈ M BK ⊆ j=M K j of 0, 1 L that can be obtained by a K-substitution in W . − { } S Example 1. For M =2, L =3, K =1, and W = 001, 011 , we have that { } (W )= 001, 011 , 101, 011 , 011 , 000, 011 , 001, 111 , 001 , 001, 010 . BK {{ } { } { } { } { } { } { }} 0,1 L In this paper, we discuss bounds and constructions of codes in { M} that can correct K substitutions (K-substitution codes, for short), for various values of K. The size of a code, which is denoted by , is the number of codewords (that is, sets) in it. The redundancy of the code, a quantity that measures the amount of redundant|C| information that is to be added to L the data to guarantee successful decoding, is defined as r( ) , log 2 log( ), where the logarithms are in base 2. C M − |C| 3We occasionally also assume that M ≤ 2cL for some 0