bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences

Cassius Manuel ·

.

Abstract Models of sequence evolution typically assume that all sequences are possible. However, restriction enzymes that cut DNA at specific recogni- tion sites provide an example where carrying a recognition sequence can be lethal. Motivated by this observation, we studied the set of strings over a finite alphabet with taboos, that is, with prohibited substrings. The taboo-set is referred to as T and any allowed string as a taboo-free string. We consider the graph Γn(T) whose vertices are taboo-free strings of length n and whose edges connect two taboo-free strings if their Hamming distance equals 1. Any (random) walk on this graph describes the evolution of a DNA sequence that avoids deleterious taboos. We describe the construction of the vertex set of Γn(T). Then we state conditions under which Γn(T) and its suffix subgraphs are connected. Moreover, we provide a simple algorithm that can determine, for an arbitrary T, if all these graphs are connected. We concluded that bac- terial taboo-free Hamming graphs are nearly always connected, although 4 properly chosen taboos are enough to disconnect one of its suffix subgraphs.

Austrian Science Fund (FWF, grant number I-1824-B22).

Corresponding author: Cassius Manuel ORCID ID: https://orcid.org/0000-0002-6297-4762 Center for Integrative Bioinformatics Vienna, Labs , Medical University of Vienna Dr. Bohr Gasse 9, A-1030 Vienna, Austria E-mail: [email protected] Arndt von Haeseler ORCID ID: https://orcid.org/0000-0002-3366-4458 Center for Integrative Bioinformatics Vienna, Max Perutz Labs University of Vienna, Medical University of Vienna Dr. Bohr Gasse 9, A-1030 Vienna, Austria Faculty of Computer Science, University of Vienna W¨ahringerStr. 29, A-1090 Vienna, Austria. E-mail: [email protected] bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

2 Cassius Manuel, Arndt von Haeseler

Keywords Bacteriophage DNA evolution · Endonuclease-dependent evo- lution · Restriction-enzyme dependent evolution · Restriction-modification system · Hamming graph with taboos · Connectivity of Hamming graphs Mathematics Subject Classification (2010) 05C40 · 92D15

1 Introduction

In bacteria, restriction enzymes cleave foreign DNA to stop its propagation. To do so, a double-stranded cut is induced by a so-called recognition sequence, a DNA sequence of length 4 − 8 base pairs ([2]). As part of their recognition- modification system, bacteria can escape the lethal effect of their own restric- tion enzymes by modifying recognition sequences in their own DNA ([14]). Nevertheless, a significant avoidance of recognition sequences has been ob- served in bacterial DNA ([7], [19]). Also in bacteriophages, the avoidance of the recognition sequences is evolutionary advantageous ([19]). Therefore the recognition sequence is, as we call it, a taboo in a majority of fragments of both host and foreign DNA. Motivated by this biological phenomenon, we studied the Hamming graph Γn(T), whose vertices are strings of length n over a finite alphabet Σ not containing any taboo of the set T as substring. By definition, an edge connects two strings of this graph if their Hamming distance is 1, i.e. if they differ by a single substitution. Moreover, given a taboo-free string s, we consider s the subgraph Γn(T) of Γn(T) induced by every taboo-free string with suffix s. Suffix s represents a conserved DNA fragment, that is, a sequence which remained invariable during evolution ([20], [6]).

Instead of studying just the connectivity of Γn(T), here we define conditions s under which every graph of the form Γn(T) is connected. Note that this result e is stronger than just the connectivity of Γn(T), for Γn(T) = Γn(T), where e is the empty string. If taboo-set T satisfies that, for every taboo-free string s s and integer n, graph Γn(T) is connected, then evolution can explore all the space of taboo-free sequences by simple point mutation, no matter which DNA suffix fragments remain invariable. Notice that we assume that the taboo-set T does not change in the course of evolution.

2 Spaces of taboo-free sequences

We start with a non-biological example of a space of taboo-free strings. For the binary alphabet Σ = {0, 1} and T = {11}, graphs Γn(T) are known as Fibonacci cubes, proposed in [9] and [10] as a network topology in parallel computing. Fibonacci cubes are connected, bipartite and contain a Hamilto- nian path. Enumeration results regarding e.g. the Wiener Index or the degree distribution have been obtained, mainly using generating functions, in [13], [5] and [16], among others (see the compendium [12]). bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 3

Other types of taboo strings assuming a binary alphabet Σ = {0, 1} have been considered in [10] for T = {1 ... 1}, while the case T = {s} for an arbitrary binary string s was studied in [11]. The number of strings over any finite alphabet with taboos has been studied, using generating functions, in [8], while related software is given in [17]. Therefore our terminology will be based on this previous work: From now on we will use the terms string to refer to a sequence of symbols over an alphabet Σ, while (DNA) sequence is reserved for biological contexts, where the alphabet consists of the four nucleotides, i.e. Σ = {A, C, G, T } . Given a string s, its length is denoted by |s|. Given two strings s1, s2 of equal length, then d(s1, s2) denotes their Hamming distance, that is, the number of sites at which the corresponding symbols differ.

Given a finite set of strings T, we call T a taboo-set if, for any t ∈ T, |t|≥ 2. We refer to the strings of T as taboos. The length of the longest taboo in T will be denoted by M := max {|t|}t∈T.A taboo-free string is a string not having any taboo of T as substring. The set of taboo-free strings of length s n is denoted by Vn(T). Given a taboo-free string s, for n ≥ |s|, Vn (T) denotes the set of strings of Vn(T) with suffix s. The Hamming graph of taboo-free strings of length n, denoted by Γn(T) := (Vn(T), En(T)), is the graph with vertex set Vn(T) such that two vertices u, v ∈ Vn(T) are adjacent if their Hamming distance equals 1. See Fig. s 1 for an example where Γn(T) is disconnected for n ≥ 3. Analogously, Γn(T) s is the Hamming graph with vertex set Vn (T).

Fig. 1 Graph Γn(T) for n ∈ [1, 5] for binary alphabet Σ = {0, 1} and T = {11, 000}. Set Vn+1(T) is constructed by adding every allowed symbol at the beginning of each string in Vn(T).

Taboo-sets as generated by the avoidance of restriction sequences can assume various levels of complexities. We discuss some examples from RE- BASE ([18]). Note that many restriction enzymes of this database have an unknown recognition sequence, hence our taboo-sets may underestimate the actual amount of taboos. Before studying the examples, we will briefly review essential nomenclature for DNA sequences. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

4 Cassius Manuel, Arndt von Haeseler

DNA is double-stranded, where A pairs with T and G pairs with C, thus it suffices to discuss only one of the strands. We adopt the convention that, given any of the strands, the DNA sequence is always represented from the 5’ end to the 3’ end (which is chemically determined). As a consequence, given a DNA sequence, its complementary DNA sequence, the one lying on the opposite strand, is obtained by inverting the order of the symbols and carrying through substitutions A ↔ T and C ↔ G. If a DNA sequence s is identical to its complementary DNA sequence, we say that s is an inverted repeat (see [22]). For example, sequence CCGG is an inverted repeat. The fact that DNA is double-stranded implies that each recognition se- quence induces taboos in pairs, namely itself and its complementary DNA sequence. For example, if AGGGC is a recognition sequence, then also the complementary strand GCCCT is a taboo. If, however, the recognition se- quence is an inverted repeat such as T GCA, then this pair is actually one single recognition sequence. This is the case in most of type II restriction- modification systems ([7]).

2.1 Turneriella parva

The Turneriella parva (REBASE organism number 8970) strain produces a restriction enzyme with recognition sequence GAT C, an inverted repeat. Simi- larly, another of its enzymes has recognition sequences GGACC and GGT CC. Thus, these restriction enzymes generate the taboo-set [ TT.pa = {GAT C} {GGACC, GGT CC}. (1)

2.2 Helicobacter pylori

In H. pylori 21-A-EK1 (studied in [1]), many restriction enzymes have been A S G S C S T identified. For the sake of clarity, let us write TH.py = T T T T, a where T denotes those taboos in TH.py whose first symbol is a ∈ Σ. Then we have

A T ={AC ◦ Σ ◦ GT }, G 2 [ [ T =(GT ◦ Σ ◦ AC) {GT CAC, GT GAC} [ {GT AC, GAGG} (2) C T ={CCGG, CCT C, CAT G}, T T ={T GCA}, where GT ◦ Σ2 ◦ AC represents taboos of the type GGabCC with a, b ∈ Σ, and so on for analogous notations. In Section 10 we will show that every graph s Γn(TH.py) is connected. Thus, in particular, Γn(TH.py) is connected. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 5

2.3 An imaginary bacterium

The taboo-set can significantly influence evolution in the cases where some s Γn(T) is disconnected. To explain this, we will create a plausible, nonexistent example. Suppose that a strain of Bacterium imaginara has taboo-set [ TB.im = {ACCC, T CCC, CGCC, GGCC} {GGGT, GGGA, GGCG}, where the second set contains the complementary DNA sequences of the first set, except that of GGCC, which is an inverted repeat. At first glance, taboo- set TB.im seems less restrictive than TH.py, which has 6 taboos of length four and 22 taboos of length five or more. However, we will show in Section 10 that CCC the graph Γn (TB.im) is disconnected for n ≥ 5. This happens because the GCCC first set in TB.im represents a ”wall of taboos” between graphs Γn (TB.im) CCCC and Γn (TB.im). This produces the following evolutionary implications: Assume that we have two correctly aligned DNA fragments fα and fβ from viruses α and β that infect Bacterium imaginara. Assume moreover that we can write fα = rαGCCC and fβ = rβCCCC for some strings rα and rβ, as also that the suffix CCC is invariable due to functional constrains. Then fα cannot have evolved from fβ by simple point mutations, because at some point in evolu- tion a taboo string is produced, what would be lethal for the carrier. Thus, the standard models of sequence evolution do not apply ([21]).

Thus, taboo-set TB.im is an interesting source of information to study evo- lution. This motivates our study, from a general point of view, of the Hamming graph with taboos.

3 Outline of the theory of Γn(T)

We will characterize those taboo-sets T such that every graph of the form s Γn(T) is connected. To do this we introduce a very general type of taboo-sets, which we call left proper (Sec. 5, Def. 4). Left proper taboo-sets satisfy that, for each string s ∈ VM (T), a symbol a ∈ Σ exists such that as is taboo-free. Our biological examples discussed so far are left proper. For the sake of clarity, consider the DNA-alphabet Σ = {A, C, G, T } and the taboo-set T = {AA, CCC}. Since T is left proper, for any taboo-free string s of length at least M = 3 there exists a string w of any desired length such that ws is taboo-free (Def. 3 of k-prefix, Prop. 5 and Prop. 6). Then synchronization comes into play (Def. 6). Two taboo-free strings s1 and s2 are left k-synchronized if there exists a string w of length k such that ws1 and ws2 are taboo-free. More informally, synchronization sticks together connected components that partition the Hamming graph. Prop. 7 states that, if s1 and s2 are left (M − 1)-synchronized, then they are left k-synchronized for every k ∈ N. For taboo-set T = {AA, CCC}, given two taboo-free strings bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

6 Cassius Manuel, Arndt von Haeseler

s1 and s2, it holds that T T s1 and T T s2 are taboo-free. Thus s1 and s2 are left k-synchronized for every k ∈ N. In Sec. 6 and 7, we realize that not the whole string s is relevant to construct s graph Γn(T), but only the biggest prefix of s which is a suffix of a taboo, which we call s[1, ks] (Prop. 10). In particular, we infer the graph isomorphism s s[1,ks] Γn(T) ' Γn (T) (Theorem 15). This implies that, to study every graph s s Γn(T), it is enough to consider Γn(T) for s ∈ {e, A, C, CC}, where e is the empty string. In Section 8 we introduce the quotient graph, a useful tool to describe the structure of graphs with many vertices. Given a partition of the vertices, the quotient graph represents the edges connecting the sets of this partition. s1 s2 Lemma 19 states that, if Vn (T) and Vn (T) are two of the sets of this partition, then they are adjacent in the quotient graph for any n ∈ N iff d(s1, s2) = 1 and s1 and s2 are left (M − 1)-synchronized. For T = {AA, CCC}, every 2 strings are left 2-synchronized, implying that only condition d(s1, s2) = 1 is needed.

s Combining all these results, we can prove the connectivity of any Γn(T) s (Sec. 9). This is done as follows: We check first if Γ|s|+3(T) is connected for any s ∈ {e, A, C, CC}. Using the classification of Theorem 15, that implies w that, for any w ∈ V3(T), Γ|w|+3(T) is connected. Then we repeatedly make w F aw use of the quotient graph induced by partition Vn (T) = a∈L1(w) Vn (T) for s w ∈ V3(T), which allows us to inductively prove that every Γn(T) is connected (Lemma 20 and Theorem 21). Remarkably, connectivity of the biological examples we have dealt with (as also of T = {AA, CCC}) can be proven by applying Prop. 24, which does not make use of most of the theory we developed. In fact, only the definition of k- suffix and left k-synchronization are sufficient to understand and apply Prop. 24. However, our theory can handle a much more general types of taboo-sets.

4 Basic notation and properties of set Vn(T)

Our alphabet comprises m ≥ 2 symbols Σ = {a1, ··· , am}. The set of strings over Σ of length n will be referred to as Σn, with cardinality |Σn| = mn. The empty string will be denoted by e, and satisfies |e| = 0 and {e} = Σ0.

If string s1 is substring of string s2, we write s1 ≺ s2, while s1 6≺ s2 denotes that s1 is not a substring of s2. By convention, e ≺ s for any string s. Given two strings s1 and s2, we define s1s2 as the concatenation of s1 and s2. Note that es = se = s for any s. Given a set of strings S = {s1, ··· sk} and a string s, we define s ◦ S := {ss1, ··· ssk} and S ◦ s := {s1s, ··· sks}. Given two string k sets S1 and S2, we define S1 ◦ S2 := {s1s2}s1∈S1;s2∈S2 , and S1 = 1≤i≤kS1. If S1 and S2 are disjoint, then the disjoint union of S1 and S2 will be denoted F by S1 S2. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 7

Definition 1 Given symbols b1, ··· , bn ∈ Σ, let us consider string s = b1 ··· bn. For i ∈ N and j ∈ N0, we define ( b ··· b if i ≤ j ≤ n s[i, j] := i j e otherwise,

i.e. s[i, j] denotes the substring of s starting at the i-th symbol up to the j-th symbol, and e when this substring is not well-defined (for example if j = 0). Given a set of strings S, we define

S[i, j] := {s[i, j] | s ∈ S}.

Example 1 If Σ = {D,O,R,W } and s = WORD, then we have s[1, 1] = W , s[1, 2] = WO, s[2, 4] = ORD and s[1, 0] = e.

Proposition 1 For any two taboo-sets T1 and T2 and any n ∈ N, it holds that: S T S a) Set T1 T2 is a taboo-set and Vn(T1) Vn(T2) = Vn(T1 T2). b) If for every t1 ∈ T1 there exists t2 ∈ T2 such that t2 ≺ t1, then for any n ∈ N, Vn(T2) ⊂ Vn(T1). Proof S S a) Every t ∈ T1 T2 has length at least 2, thus T1 T2 is a taboo-set. The T fact that a string s ∈ Vn(T1) Vn(T2) satisfies t1 6≺ s for any t1 ∈ T1 and S t2 6≺ s for any t2 ∈ T2 is equivalent to s satisfying t 6≺ s for any t ∈ T1 T2. b) Consider s ∈ Vn(T2). Assume that s∈ / Vn(T1); then there exists t1 ∈ T1 such that t1 ≺ s. But there also exists a t2 ∈ T2 such that t2 ≺ t1, thus t2 ≺ s, a contradiction. Hence s ∈ Vn(T1). 

The set Vn(T) can be obtained from other taboo-sets than T. In this sense, taboo-sets are not unique, as we illustrate in the following proposition. Proposition 2 Given string s, for any n ≥ 1 + |s| it holds that  [  Vn({s}) = Vn (s ◦ Σ) (Σ ◦ s) .

Proof S – ⊆ : Any taboo in T1 := (s ◦ Σ) (Σ ◦ s) has s ∈ T2 := {s} as substring,  S  thus Prop. 1.b implies Vn({s}) ⊂ Vn (s ◦ Σ) (Σ ◦ s) .  S  – ⊇ : Assume there is w ∈ Vn (s ◦ Σ) (Σ ◦ s) with s ≺ w. Since |w| = n and n ≥ |s| + 1, the substring s is either preceded or followed by some symbol a ∈ Σ. This contradicts {as, sa}⊂ (s ◦ Σ) ∪ (Σ ◦ s). 

Prop. 2 implies that, for any T, we can construct many taboo-sets T0 such that 0 0 0 Vn(T) = Vn(T ) for any n ≥ max(M,M ), where M denotes the length of the longest taboo in T0. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

8 Cassius Manuel, Arndt von Haeseler

F Example 2 If T = T1 T2 with T2 = (s◦ Σ) ∪ (Σ ◦ s), Prop. 1.a and 2 imply 0 F 0 that T := T1 {s} satisfies Vn(T) = Vn(T ) for any n ≥ M. Repeating this 0 S 0 process, we can construct a taboo-set T such that (s ◦ Σ) (Σ ◦ s) 6⊂ T for 0 any string s and satisfying Vn(T) = Vn(T ) for any n ≥ M. Example 2 motivates the following definition. Definition 2 A taboo-set T is minimal if the following conditions hold: S – For every j ∈ [0, M − 1] and s ∈ Vj(T), we have (s ◦ Σ) (Σ ◦ s) 6⊂ T. – For every different t1, t2 ∈ T, it holds that t1 6≺ t2. When a taboo-set T is not minimal, it can be understood that this T is unnecessarily complicated. For computational purposes, it is recommendable to force T to be minimal. Apart from this, also M, the length of the longest taboo, is of great importance: If M is known, the following result guarantees that the concatenation of three taboo-free strings is a taboo-free string.

Proposition 3 Given T, consider strings s1, s2, and s3 of respective lengths

j1, j2, j3 ∈ N0. If s1s2 ∈ Vj1+j2 (T), s2s3 ∈ Vj2+j3 (T) and j2 ≥ M − 1, then

s1s2s3 ∈ Vj1+j2+j3 (T).

Proof Assume either j1 > 0 or j3 > 0, yielding j := j1 + j2 + j3 ≥ M. For each i ∈ [1, j + 1 − M], the fact that j2 ≥ M − 1 implies that either s[i, i + M − 1] ≺ s1s2 or s[i, i + M − 1] ≺ s2s3, hence each s[i, i + M − 1] is taboo-free and the result follows. 

5 Prefixes and suffixes of a taboo-free string

s Given a taboo-free string s, the construction of set Vn (T) for n > |s| depends on which strings can be concatenated to the left side of s, yielding ws ∈ Vn(T). This object is of interest for us.

Definition 3 For taboo-set T, j ∈ N0 and string s ∈ Vj(T), let k ∈ N0 be given. The k-prefixes of s are the elements of the set Lk(s), defined as k k s L (s) := {w ∈ Σ such that ws is taboo-free } = Vj+k(T)[1, k]. If Lk(s) 6= ∅, then we will say that s is k-prefixable. Similarly, the k-suffixes k k of s, denoted R (s), are the strings w ∈ Σ such that sw ∈ Vj+k(T). When Rk(s) 6= ∅, we say that s is k-suffixable. Example 3 If Σ = {0, 1} and T = {011, 1111}, then L1(11) = {1}, while L2(11) = ∅. Hence string 11 is 1-prefixable, but not 2-prefixable. Moreover, R1(11) = {0, 1}, hence string 11 is 1-suffixable.

By construction, given s ∈ Vj(T), for any n ∈ N0 it holds that s n Vn+j(T) = L (s) ◦ s. (3) s n That is, Vn+j(T) is L (s) with suffix s appended. Moreover, the k-prefixes s of a string s induce a partition of the set Vn (T). bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 9

Proposition 4 Given a taboo-set T, an integer k ∈ N0 and a string s ∈ Vj(T) with j ∈ N0, for any n ≥ j + k it holds that

s G ws Vn (T) = Vn (T), w∈Lk(s)

s that is, the set Vn (T) can be partitioned into the disjoint sets of taboo-free strings of length n with suffix ws, where w ∈ Lk(s).

k s Proof If s is not k-prefixable, then L (s) = ∅ and Vn (T) = ∅, hence the equation holds. Otherwise, the inclusion ⊃ is clear, while the ⊂ follows from the fact that, for any string w ∈ Σk preceding the suffix s, this w must k necessarily belong to L (s). 

∗ Clearly, if a taboo-free string s ∈ Vj(T) is k -prefixable, then it is also k-prefixable for any integer k < k∗, while nothing can be said a priori about the case k > k∗. We can, however, define a type of taboo-sets satisfying that, if s has length at least M, then s is k-prefixable for any k ∈ N.

Definition 4 A taboo-set T is left proper if every s ∈ VM (T) is 1-prefixable. Analogously, T is right proper if every s ∈ VM (T) is 1-suffixable.

Example 4 If Σ = {A, C, G, T } and T = Σ ◦ A, then AC ∈ V2(T) and AC is not 1-suffixable. Thus T is not left proper.

Proposition 5 Consider a left proper taboo-set T, j ∈ N and s ∈ Vj(T), such that either of the following conditions holds. a) j ≥ M b) j ≤ M − 1 and s is (M − j)-prefixable

Then s is k-prefixable for any k ∈ N.

Proof If condition a) applies, then consider s[1, M] ∈ VM (T), yielding that s is 1-prefixable, that is, there exists a ∈ Σ with as ∈ Vj+1(T). Proceeding analogously with (as)[1, M], we would infer that s is 2-prefixable. Continuing with this process, we deduce that s is k-prefixable for any k ∈ N. s If condition b) holds, then we can take any string in VM (T) and proceed as we did assuming a).  Prop. 5 is the reason why we only study proper taboo-sets, for the existence of arbitrary k-suffixes is necessary in many of our proofs. In particular, we will mostly focus on the case in which T is left proper, although analogous results for right proper taboo-sets are obtained by reversing the order of the symbols composing the string. When the length of a taboo-free string is at least M, then Prop. 5 is easy to apply. Otherwise, one needs to study the k-prefixability of this string. To that end, the following definition can be useful. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

10 Cassius Manuel, Arndt von Haeseler

Definition 5 For set of strings S, the suffixes of S are defined as  [ [  [ suf(S) := s[i, |s|] {e}. s∈S i∈[2,|s|] Example 5 If S = {ACG, GGG, T T C, CC} then suf(S) = {CG, G, GG, T C, C, e}.

If T is left proper, then suf(VM (T)) is of great interest, for it indicates which small suffixes are prefixable, as the following proposition indicates.

Proposition 6 Consider a left proper taboo-set T, j ∈ N0 and s ∈ Vj(T). s a) If j ≤ M − 1 and s∈ / suf(VM (T)), then for every n ≥ M,Vn (T) = ∅. s b) If either j ≥ M or s ∈ suf(VM (T)), then for n ≥ max(j,M), Vn (T) 6= ∅. Proof s a) If j ≤ M − 1 and s∈ / suf(VM (T)), since suf(VM (T)) ⊂ suf(VM (T)), it s s holds that VM (T) = ∅. This implies that Vn (T) is empty for every n ≥ M, because otherwise s s ∅ ( Vn (T)[n − M + 1,n] ⊂ VM (T), what would be a contradiction. b) If j ≥ M, since T is left proper, Prop. 5 does the work. If s ∈ suf(VM (T)), then s is (M − j)-prefixable, so again we apply Prop. 5. 

s To prove the connectivity of graphs Γn(T), we will need to know whether two different strings have a k-prefix in common, hence a definition is needed. Definition 6 Given a taboo-set T, we say that two taboo-free strings s and r (maybe of different length) are left k-synchronized if Lk(s) T Lk(r) 6= ∅. If Rk(s) T Rk(r) 6= ∅, then we say that s and r are right k-synchronized. Clearly, two taboo-free strings that are left k∗-synchronized are also left k-synchronized for any k ≤ k∗ (one simply has to ”cut” the last k symbols of ∗ ∗ Lk (s) T Lk (r)). The following proposition states when we can also guarantee this property for k > k∗ if T is left proper. Proposition 7 Consider a left proper taboo-set T and two taboo-free strings s, r, with length greater than zero, such that s and r are left (M−1)-synchronised. Then s and r are left k-synchronized for any k ∈ N. Proof Clearly the case k ≤ M − 1 holds. For k > M − 1, consider a string w ∈ LM−1(s) T LM−1(r). We know that ws and wr are taboo-free strings with length at least M. Since is left proper, Prop. 5 implies that w is k0- T 0 prefixable for any k0 ∈ N. For any k0, take x ∈ Lk (w), and consider strings xws and xwr. The fact that |w| = M − 1, together with the fact that xw and the pair ws, wr are taboo-free, allows applying Prop. 3, hence xws and 0 0 xwr are also taboo-free. All in all, xw ∈ LM−1+k (s) T LM−1+k (r). We write 0 k := M − 1 + k , yielding the result for any k > M − 1.  bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 11

For cases where the number of taboos is small, the following proposition is an easy way to prove that every two taboo-free strings of length M are left k-synchronized.

Proposition 8 Take a left proper taboo-set T such that any w1, w2 ∈ VM (T) with d(w1, w2) = 1 are left 1-synchronized. Then, for each w1, w2 ∈ VM (T) with d(w1, w2) = 1, w1 and w2 are left k-synchronized for any k ∈ N0.

Proof Given any left 1-synchronized pair w1, w2 with d(w1, w2) = 1, there exists a ∈ Σ such that aw1 and aw2 are taboo-free. Since (awi)[1, M] ∈ VM (T) for i ∈ {1, 2} and the Hamming distance between these two strings is at most 1, aw1, aw2 are 1-synchronized, hence there exists b ∈ Σ such that baw1 and baw2 are taboo-free, i.e. w1 and w2 are left 2-synchronized. Continuing with this process, the result follows. 

6 Simplification of suffixes

ws Consider two taboo-free strings w, s and the set Vn+|w|+|s|(T). In general, it ws w is not true that Vn+|w|+|s|(T) = Vn+|w|(T) ◦ s, because it can happen that, for w some r ∈ Vn+|w|(T), the concatenation rs creates a taboo substring. Only if ws w w Vn+|w|+|s|(T) = Vn+|w|(T) ◦ s, one can simply take Vn+|w|(T) an append an s at the end of each of its strings. This property is studied in this section.

M−1 j Proposition 9 For a taboo-set T, consider w ∈ Σ and s ∈ Σ , j ∈ N0, such that ws ∈ VM−1+j(T). Then for any n ≥ M − 1,

ws w Vn+j(T) = Vn (T) ◦ s.

Proof The inclusion ⊂ is clear. The inclusion ⊃ follows from the fact that, if w we are given rw ∈ Vn (T) such that ws ∈ VM−1+j(T), since |w| = M − 1, we can apply Prop. 3, yielding that rws is a taboo-free string. 

Definition 7 Given a taboo-set T and s ∈ Vj(T), j ≥ 0, we define n o

ks := max i ∈ [0, j] s[1, i] ∈ suf(T) ,

i.e. ks denotes the length of the longest prefix of s being the suffix of a taboo.

Note that ks is always well defined, for s[1, 0] = e ∈ suf(T), hence ks ∈ [0, min (M − 1, j)]. Using Def.7 , Prop. 9 can be improved.

Proposition 10 Let a taboo-set T, j ∈ N0 and s ∈ Vj(T) be given. For any n ∈ N0 it holds that

V s ( ) = V s[1,ks]( ) ◦ s[k + 1,j]. n+j T n+ks T s bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

12 Cassius Manuel, Arndt von Haeseler

Proof The result is obvious if j = 0 or n = 0, hence assume j > 0 and n > 0. Along the reasoning that follows, note that it also applies for the case ks = 0. Clearly V s ( ) ⊂ V s[1,ks]( ) ◦ s[k + 1,j]. For r ∈ V ( ), consider rs[1, k ] ∈ n+j T n+ks T s n T s V s[1,ks]( ). We need to prove that the string n+ks T

rs[1, ks]s[ks + 1,j] = rs

is taboo-free. But otherwise, since rs[1, ks] and s are taboo-free, there would exist integers c, d such that 1 ≤ c ≤ |r| ≤ |r|+ks < d ≤ |r|+j and (rs)[c, d] ∈ T. ∗ ∗ Take k := d − |r| > ks, what yields s[1, k ] ∈ suf(T), contradicting the maximality of ks. Hence rs is taboo-free, as desired. 

Corollary 11 Given a taboo-set T, j ∈ N0 and s ∈ Vj(T), then for any k ∈ N0 k k L (s) = L (s[1, ks]). k s Proof By construction, L (s) = Vj+k(T)[1, k]. Prop. 10 yields   V s ( )[1, k] = V s[1,ks]( ) ◦ s[k + 1,j] [1, k] = k+j T k+ks T s = V s[1,ks]( )[1, k] = Lk(s[1, k ]). k+ks T s  Corollary 12 Given taboo-set T and any pair of taboo-free strings s and r, the following statements are equivalent for any k ∈ N0: – s and r are left k-synchronized – s[1, ks] and r[1, kr] are left k-synchronized. Proof Strings s and r are left k-synchronized iff Lk(s) T Lk(r) 6= ∅. We just have to apply Corollary 11. 

All in all, we learnt in this section that string s[1, ks] is informative enough s k to construct Vn (T) or L (s).

s 7 Isomorphisms of graphs Γn(T)

We will make use of common terminology of graph theory (cf. [24]). In this work, the term ”graph” denotes a simple undirected graph. We say that graph G1 = (V1,E1) is subgraph of G2 = (V2,E2) if V1 ⊂ V2 and E1 ⊂ E2, and write G1 ⊂ G2. Given a graph G = (V,E) and a subset S ⊂ V , the subgraph induced by S in G, G(S) = (S, ES), has vertex set S and, for any u, v ∈ S, {u, v} ∈ ES s iff {u, v} ∈ E. To avoid a too complex notation, given a subset Vn (T) ⊂ Vn(T) s for some taboo-free string s, we refer to the subgraph induced by Vn (T) in s s Γn(T) as Γn(T). In particular, Γn(T) ⊂ Γn(T).

Two graphs G1 = (V1,E1) and G2 = (V2,E2) are isomorphic, denoted by G1 ' G2, if there exists a bijection f : V1 → V2 such that, for every u, v ∈ V1, {u, v} ∈ E1 iff {f(u), f(v)} ∈ E2. That is, G1 and G2 are isomorphic if there exists an edge-preserving bijection between their vertex sets. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 13

Proposition 13 Given T, j ∈ N0 and s ∈ Vj(T), consider a taboo-free string w satisfying ws ∈ V|w|+j(T) and such that, for a given n ≥ |w|, ws w Vn+j(T) = Vn (T) ◦ s. ws w Then Γn+j(T) and Γn (T) are isomorphic. ws ws w Proof Since the vertex set of Γn+j(T) is Vn+j(T) = Vn (T)◦s, we will establish this bijection using the map w w f : Vn (T) ◦ s → Vn (T) w 7→ w[1, n]. The map f is well defined and bijective. Moreover, this is an edge-preserving n j bijection: Given any pair of strings s1, s2 ∈ Σ and any string s ∈ Σ , d(s1, s2) = 1 iff d(s1s, s2s) = 1.  s Propositions 13 and 9 imply that, given Γn+j(T) for some s ∈ Vj(T) with s s[1,M−1] j ≥ M, we can do Γn+j(T) ' Γn+M−1 (T). Going further, Prop. 10 yields Γ s ( ) ' Γ s[1,ks]( ), n+j T n+ks T s or differently said, we have split graphs Γn(T) into classes, each of which is s[1,ks] represented by Γn (T):

Proposition 14 Consider a taboo-set T, j ∈ N0 and s ∈ Vj(T). Then a unique w ∈ suf(T) exists such that w = s[1, ks]. Moreover, for any n ≥ 0, s w Γn+j(T) ' Γn+|w|(T). s Prop. 14 does not describe in which cases Vn+j(T) = ∅. However, if T is left proper, Prop. 6 implies that this happens iff j ≤ M − 1 and s∈ / suf(VM (T)). This suggests that we can state a version of Prop. 14 for left proper T, although before doing so let us introduce a new object.

Definition 8 Given a left proper taboo-set T, the long suffix classification lsc(T) is defined as

lsc(T) := {w ∈ suf(T) such that ∃s ∈ VM (T) satisfying s[1, ks] = w}, that is, lsc(T) is the set of all suffixes of taboos that are the longest prefix of at least one taboo-free string of length M.

Example 6 If Σ1 = {A, C, G, T } and T1 = {AA, CC, GG, T T }, then [ lsc(T1) ⊂ suf(T1) = {A, C, G, T, e} = Σ1 {e}.

For any s ∈ V2(T1), we see ks > 0, hence e∈ / lsc(T1). Moreover,

{AC, CG, GT, T A} ⊂ V2(T1), 0 0 yielding lsc(T1) = Σ1. If we consider Σ2 := {A, C, G, T, C }, where C could 0 represent a 5-methylcytosine, and T2 := T1, then string s = C A satisfies ks = 0, hence lsc(T2) = suf(T2). bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

14 Cassius Manuel, Arndt von Haeseler

s The following theorem classfies graphs Γn(T) for left proper T.

Theorem 15 Given a left proper taboo-set T and j ∈ N0, consider s ∈ Vj(T) s such that either j ≥ M or s ∈ suf(VM (T)). Then Vn+|s|(T) 6= ∅ for n ≥ 0. T There exists a unique w ∈ suf(VM (T)) suf(T) such that w = s[1, ks], which s w satisfies Γn+j(T) ' Γn+|w|(T) for n ≥ 0. If j ≥ M, then w ∈ lsc(T).

Proof Prop. 6 yields V s ( ) 6= ∅ for n ≥ 0, and Γ s ( ) ' Γ s[1,ks]( ) for n+|s| T n+j T n+ks T n ≥ 0 follows from Prop. 14. Hence we can set w := s[1, ks], which by definition belongs to suf(T). We also have w ∈ suf(VM (T)), which follows from s being k-suffixable for any k ∈ N0 and Prop. 6. This w is trivially unique since ks is uniquely determined given s. If j ≥ M, the fact that s[1, M] ∈ VM (T) and the definition of lsc(T) imply that w ∈ lsc(T). 

To compute lsc(T), we recommend that T be minimal, for otherwise the complexity of the problem increases. Theorem 15 implies that \ lsc(T) ⊂ suf(VM (T)) suf(T), (4) thus we define the short suffix classification as  \  ssc(T) := suf(VM (T)) suf(T) − lsc(T), (5)

or equivalently, \ ssc(T) ={w ∈ suf(T) suf(VM (T)) s.t. there exists i ∈ [1, M − 1 − |w|] (6) M−|w| satisfying w ◦ R (w)[1, i] ⊂ suf(T)}. Eq. 6 is quite technical, but if w satisfies the stronger condition |w| < M −1 and w ◦ Ri(w) ⊂ suf(T) for some i ∈ [1, M − 1 − |w|], then any s ∈ w ◦ Ri(w) satisfies s[1, ks + i] ∈ suf(T), hence w∈ / lsc(T). In words, ssc(T) contains every taboo suffix w such that, for some i and every i-suffix r, string wr is the suffix of a taboo.

Example 7 If Σ1 = {A, C, G, T } and T1 = {AA, CC, GG, T T }, then e ∈ ssc(T1), for e ◦ Σ ⊂ suf(T1).

8 The quotient graph

The quotient graph is a simple tool used to describe some properties related to the connectivity of graphs (see [3]). Definition 9 Take a graph G = (V,E) and a partition of vertex set V , namely F V = b∈J Vb for some index set J. Then the quotient graph of G, denoted as Q(G)=(J, EJ ), is the graph whose vertices are J and such that {b1, b2} ∈ EJ iff an edge connects a vertex

in Vb1 with a vertex in Vb2 . bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 15

3 1 5

8

7 a b c d 2 6 4

Fig. 2 Example of a quotient graph. Given graph G = (V,E) on the left hand side, where F F F V = {1, 2, 3, 4, 5, 6, 7, 8}, we consider partition V = Va Vb Vc Vd, which induces the quotient graph Q(G) on the right hand side.

See Figure 2 for a graphical example of a quotient graph. Our strategy to prove connectivity will be centered at the usage of following proposition, whose proof is simple enough to be omitted. F Proposition 16 Given graph G = (V,E), consider a partition V = b∈J Vb. If every induced subgraph G(Vb) is connected and the quotient graph Q(G) is connected, then G is also connected.

The following is a similar result.

Proposition 17 Given graph G = (V,E), the following are equivalent: – G is connected. – For every partition of V , the quotient graph Q(G) is connected.

Both Prop. 16 and 17 are basic, although productive results. The following proposition will highlight the strong interaction between the quotient graph and the synchronization of taboo-free strings.

Proposition 18 Given taboo-set T, j ∈ N0 and n ∈ N0, consider graph F Γn+j(T), a subset S ⊆ Vn+j(T) and a partition S = b∈J Sb. Assume that s1 s2 Sb1 = Vn+j(T) and Sb2 = Vn+j(T) for a pair of different b1, b2 ∈ J and differ- ent s1, s2 ∈ Vj(T), and consider the quotient graph Q(Γn+j(T)(S)) = {J, EJ }.

Then {b1, b2} ∈ EJ iff s1 and s2 are left n-synchronized and d(s1, s2) = 1.

Proof By definition, b1 and b2 are adjacent in Q(Γn+j(T)(S)) iff in graph s1 s2 Γn+j(T) an edge connects a vertex in Vn+j(T) with a vertex in Vn+j(T). Since d(s1, s2) ≥ 1, this edge exists iff d(s1, s2) = 1 and there exists s ∈ Vn(T) such that ss1, ss2 ∈ Vn+j(T). The last condition is the definition of s1 and s2 being left n-synchronized. 

Prop. 18 combines very well with Prop. 7, where we proved that, if T is left proper and strings s, r are left (M − 1)-synchronized, then s and r are bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

16 Cassius Manuel, Arndt von Haeseler

∗ left k-synchronized for any k ∈ N0. Moreover, if s, r are left k -synchronized, then they are left k-synchronized for any k < k∗. We now take S = V s ( ) for some taboo-free string s and consider partition n0 T s F ws V ( ) = k V ( ) for some k ∈ with n ≥ j + k . Then for two n0 T w∈L 0 (s) n0 T 0 N 0 0 k0 different w1, w2 ∈ L (s), we set s1 := w1s and s2 := w2s. Prop. 18 implies that w and w are adjacent in Q(Γ s ( )) iff s and s are are left (n −j−k )- 1 2 n0 T 1 2 0 0 synchronized and d(w1, w2) = 1. If n0 − j − k0 = M − 1, then we infer that w1 and w2 are left k-synchronized for arbitrary k, thus w1 and w2 are adjacent in s P (Γn(T)) for arbitrary n ≥ j + k0. All in all, we have the following lemma.

Lemma 19 Given a left proper taboo-set T, j ∈ N0, s ∈ Vj(T) and k ∈ N, s F ws consider, for any n ≥ j + k, partition Vn (T) = w∈Lk(s) Vn (T) and quotient s k graph Q(Γn(T)) = (L (s), ELk(s)). Then it holds that

s s s Q(Γj+k(T)) ⊇ Q(Γj+k+1(T)) ⊇ · · · ⊇ Q(Γj+k+M−1(T)) = s s = Q(Γj+k+M (T)) = Q(Γj+k+M+1(T)) = ··· .

s s If Q(Γj+k+M−1(T)) is connected, then Q(Γn(T)) is connected for n ≥ j + k. Proof We apply the mentioned k-synchronization properties and Prop. 18. If n > n ≥ j + k, then Q(Γ s ( )) ⊆ Q(Γ s ( )). If n ≥ j + k + M − 1 and 1 2 n1 T n2 T 1 n ≥ j + k + M − 1, then Q(Γ s ( )) = Q(Γ s ( )). Regarding connectivity, 2 n1 T n2 T given graphs G1 and G2 with the same vertex set V1 = V2 such that G1 ⊂ G2, if subgraph G1 is connected, then G2 is connected.  Figure 3 visualizes Lemma 19 for alphabet Σ = {a, b, c}, taboo-set T = {ba, aa, ac, cc} (which is left proper), suffix s = b and k = 1.

s 9 Connectivity of Γn(T) and Γn(T)

s We are finally ready to study the connectivity of graphs Γn(T) for |s| ≥ M. w Lemma 20 Given a left proper T, for any w ∈ VM (T) consider the set V2M (T) and partition w G aw V2M (T) = V2M (T), a∈L1(w) w inducing the quotient graph P (Γ2M (T)). Then the following are equivalent: w 1 a) For every w ∈ VM (T), P (Γ2M (T)) = (L (w), EL1(w)) is connected. w b) For every w ∈ VM (T) and integer n ≥ M, Γn (T) is connected. Proof Prop. 17 states that, in a connected graph, every quotient graph is connected, thus b) implies a) by considering n = 2M. Let us prove by induction that a) implies b). For n = M and w ∈ VM (T), we have that w VM (T) = {w}, bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 17

a a a

b c b c b c

Fig. 3 Visualization of Lemma 19 for Σ = {a, b, c}, T = {ba, aa, ac, cc}, s = b and k = 1. It holds that L1(b) = {a, b, c}.

w w hence ΓM (T) is connected. For the inductive step, assume that Γn (T) is con- nected for every w ∈ VM (T) and up to an integer n ≥ M. We will prove that w also every Γn+1(T) is connected. Consider

w G aw Vn+1(T) = Vn+1(T). a∈L1(w) Let us write w separating the first M − 1 symbols from the last one, that is M−1 1 aw arc w = rc for r ∈ Σ and c ∈ Σ. Then for any a ∈ L (w), Vn+1(T) = Vn+1(T). arc ar Since |r| = M −1, Prop. 9 implies Vn+1(T) = Vn (T)◦c, while the isomorphism of Prop. 13 yields aw arc ar Γn+1(T) = Γn+1(T) ' Γn (T). aw Thus every Γn+1(T) is connected, for the induction hypothesis implies that ar w Γn (T) is connected since ar ∈ VM (T). To prove that graph Γn+1(T) is con- nected, it remains to apply Prop. 16, so we need to prove that the quotient w F aw w graph induced by partition Vn+1(T) = a∈L1(w) Vn+1(T), namely Q(Γn+1(T)), is connected. w F aw We know that, given partition V2M (T) = a∈L1(w) V2M (T), the quotient graph w Q(Γ2M (T)) is connected. Applying Lemma 19 with j = M, s = w and k = 1, w 0 every Q(Γn0 (T)) is connected for n ≥ M, as desired.  Lemma 20 is very interesting: We wanted to characterize the connectivity s of graphs Γn(T) for s ∈ VM (T) and n ≥ M. We have proved that it is enough bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

18 Cassius Manuel, Arndt von Haeseler

w to study a finite number of graphs, namely Q(Γ2M (T)) for s ∈ VM (T), that is, |VM (T)| graphs. Let us state the connectivity results that follow from Lemma 20 and Theorem 15.

Theorem 21 Given a left proper T, the following are equivalent: s a) For any j ≥ M, any s ∈ Vj(T) and any integer n ≥ j, Γn(T) is connected. w b) For any w ∈ VM (T) and any integer n ≥ M, Γn (T) is connected. r c) For any r ∈ lsc(T), ΓM+|r|(T) is connected. r F ar d) For any r ∈ lsc(T), partition VM+|r|(T) = a∈L1(r) VM+|r|(T), induces a r connected P (ΓM+|r|(T)). Proof Implication a) ⇒ b) is obvious, while b) ⇒ c) is consequence of Theorem 15. Moreover, c) ⇒ d) follows from Prop. 17. To prove that d) ⇒ a), we split it into two parts, d) ⇒ b) and b) ⇒ a). Implication b) ⇒ a) follows from s s[1,M] isomorphism Γn+j(T) ' Γn+M (T) and s[1, M] ∈ VM (T). 1 1 It remains to prove d) ⇒ b). Corollary 11 implies L (w) = L (w[1, kw]). 1 For any w ∈ VM (T) and a, b ∈ L (w), we claim that aw and bw are left k-synchronized iff aw[1, kw] and bw[1, kw] are left k-synchronized. Indeed, the implication ⇒) is obvious, so let us prove ⇐). Given a taboo-free string r ∈ Vj(T) such that raw[1, kw] and rbw[1, kw] are taboo-free, we want to prove that also raw and rbw are taboo-free. But if that were not the case, it would be the consequence of either (raw)[c, d] ∈ T or (rbw)[c, d] ∈ T for some integers 1 ≤ c ≤ j < j + 1 + kw ≤ d ≤ j + 1 + M. However, that contradicts the maximality of kw, yielding ⇐).

This claim and Prop. 18 imply that, if we write r = w[1, kw] for some w ∈ r F ar VM (T), given partition VM+|r|(T) = a∈L1(r) VM+|r|(T), it holds that r w Q(Γn(T)) ' Q(Γn (T)).

Theorem 15 implies that, for every w ∈ VM (T), there exists r = w[1, kw] ∈ lsc(T). Applying Lemma 20, finally b) ⇒ d) follows.  It is worth noticing how simpler the connectivity problem has become. s Initially, we were studying whether every Γn(T) for s ∈ Vj(T) and j ≥ M is connected, obtaining in Lemma 20 that this is equivalent to the connectivity w of graphs Γ2M (T) for w ∈ VM (T), which are |VM (T)| graphs. Now we see, using Theorem 21, that we only need to prove the connectivity of | lsc(T)| ≤ r r | suf(T)| ≤ (M − 1)|T| + 1 graphs, namely either P (ΓM+|r|(T)) or ΓM+|r|(T) for r ∈ lsc(T). We give an example. Example 8 Take Σ = {A, C, G, T } and T = {AA, CCC}, which is left proper. Since M = 3 and lsc(T) = suf(T) = {e, A, C, CC}, the connectivity of graphs A C CC Γ3(T), Γ4 (T), Γ4 (T), Γ5 (T) w implies that any Γn (T) with w ∈ suf(T) is connected. Proposition 14 implies s that, for any taboo-free string s and n ≥ |s|, Γn(T) is connected. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 19

s Theorem 21 characterizes the connectivity of every Γn+|s|(T) for |s| ≥ M. We know from Theorem 15 that there exists T s r r ∈ lsc(T) ⊆ suf(VM (T)) suf(T) such that Γn+|s|(T) ' Γn+|r|(T). Since T ssc(T) := suf(VM (T)) suf(T) − lsc(T), to complete our characterization of the connectivity of every Hamming graph with taboos, it remains to consider p the connectivity of graph Γn (T) for p ∈ ssc(T). The techniques employed so far are flexible and give the following.

Proposition 22 Given a left proper T and p ∈ ssc(T), assume that, for every r r ∈ lsc(T), ΓM+|r|(T) is connected. Given k ∈ N, if partition

p G wp V|p|+k+M−1(T) = V|p|+k+M−1(T) w∈Lk(p)

k satisfies that (wp)[1, kwp] ∈ lsc(T) for each w ∈ L (p), and moreover p p Q(Γ|p|+k+M−1(T)) is connected, then Γn (T) is connected for n ≥ |p| + k.

Proof For n ≥ |p| + k, given partition

p G wp Vn (T) = Vn (T), w∈Lk(p)

wp subgraphs Γn (T) are connected due to (wp)[1, kwp] ∈ lsc(T). Moreover, since p p Q(Γ2M−1(T)) is connected, Lemma 19 with s = p implies that Q(Γn (T)) is connected for n ≥ |p| + k. The result follows applying Prop. 16.  p In Prop. 22, one can always take k = M −|p| and just check if Q(Γ2M−1(T)) p or Γ2M−1(T) is connected. However, when | ssc(T)| has just a few elements, it is easier to try a smaller k.

In general, if T is not very restrictive, proving connectivity becomes easier since most of strings are left k-synchronized. In Prop. 23 only previous results are used, while in Prop. 24 we study this case more exhaustively in a self- contained manner. Note that, when taboo-set T is minimal, the assumptions of Prop. 24 are much easier to check.

Proposition 23 Given a left proper T such that every pair of strings w1, w2 ∈ VM (T) with d(w1, w2) = 1 is left 1-synchronized, it holds that:

r a) For any r ∈ lsc(T) and n ∈ N0, Γn+|r|(T) is connected. p p b) For any p ∈ ssc(T) with connected ΓM (T), Γn (T) is connected for n ≥ M.

Proof Prop. 8 implies that every pair w1, w2 ∈ VM (T) is left k-synchronized w 1 for any k ∈ N. Any quotient graph Q(Γn (T)) = (L (w), EL1(w)) induced by w F aw partition Vn (T) = a∈L1(w) Vn (T) is completely connected, hence Theorem p F wp 21 implies a). Similarly with partition Vn (T) = w∈LM−|p|(p) Vn (T), since p p Q(Γn (T)) ' ΓM (T) for n ≥ M, Prop. 22 implies b).  bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

20 Cassius Manuel, Arndt von Haeseler

Example 9 For Σ = {A, C, G, T } and T = {AA, CCC}, the strings T w1 and T w2 are taboo-free for w1, w2 ∈ V3(T), hence they are left 1-synchronized. Since s lsc(T) = suf(T), for any taboo-free string s and n ≥ |s|, Γn(T) is connected. Proposition 24 Given taboo-set and set Ψ( ) := S t[2, |t|], if ev- T T t∈T ery pair of taboo-free strings w1, w2 ∈ Ψ(T) with |w1| ≥ |w2| and  d w1[1, |w2|] , w2 ≤ 1 is left 1-synchronized, then it holds that: a) Every taboo-free string is 1-suffixable. In particular, T is left proper. b) Every two taboo-free strings s1, s2 with d(s1, s2) = 1 are left 1-synchronized. s c) Graph Γn(T) is connected for every taboo-free string s and n ≥ |s|. Proof a) Consider any taboo-free string s. Assume that, for each a ∈ Σ, as is not taboo-free, that is, that for some integer ca ≥ 2, (as)[1, ca] ∈ T. WLOG

assume ca1 ≤ · · · ≤ cam and consider s[2, cam ] ∈ Ψ(T), which is not left 1-synchronized with itself, what contradicts the assumptions of the state- ment, hence s is 1-suffixable. Taking s ∈ VM (T) we see that T is left proper. b) Given taboo-free strings s1, s2 such that d(s1, s2) = 1, assume that they are not 1-synchronized. Then for every a ∈ Σ, either (as1)[1, ca] ∈ T or S (as2)[1, ca] ∈ T for some ca ≥ 2. Denote by C1 ⊂ a∈Σ{ca} those ca such that (as1)[1, ca] ∈ T, and analogously with C2. If C1 were empty, then s2 would not be 1-suffixable, contradicting a). Thus, both C1 and C2 must be nonempty. Consider d1 := max{c : c ∈ C1} and d2 := max{c : c ∈ C2}. It holds that s1[2, d1] ∈ Ψ(T) and s2[2, d2] ∈ Ψ(T). Moreover, we have that the pair s1[2, d1], s2[2, d2] is not left 1-synchronized. Since d(s1, s2) = 1, that contradicts the assumptions of the statement, hence s1 and s2 must be left 1-synchronized, as desired. s c) Clearly Γ|s|(T) is connected, so let us proceed by induction. Assume s s Γn(T) is connected for a fixed n ≥ |s| and consider Γn+1(T). Since s s s s Vn+1(T) ⊂ Σ ◦ Vn (T), if |Vn (T)| = 1, then Γn+1(T) is connected. Oth- s erwise we take different s1, s2 ∈ Vn+1(T); we will prove that they are con- s nected. We know that s1, s2 ∈ Σ ◦ Vn (T), hence let us write s1 = c1w1 and s s2 = c2w2 for ci ∈ Σ and wi ∈ Vn (T). If w1 = w2, the result is obvious, so assume w1 6= w2. s By hypothesis, Γn(T) is connected, thus there exists a path of vertices s of Vn (T), namely y1, ··· , yD, such that d(yi, yi+1) = 1, y1 = w1 and yD = w2. For every j ∈ [1, D − 1], the pair yj, yj+1 is left 1-synchronized, thus there exists bj ∈ Σ such that bjyj and bjyj+1 are taboo-free. Since s d(bjyj, bjyj+1) = 1, bjyj and bjyj+1 are adjacent in Γn+1(T). Moreover ev- ery pair of taboo-free strings contained in Σ◦yi is adjacent for i ∈ [1, D−1]. Since the relation ”being connected” is transitive, vertices s1 ∈ Σ ◦ y1 and s2 ∈ Σ ◦ yD are connected, as desired.  Example 10 If Σ = {A, C, G, T } and T = {AA, CC, GG, T T }, then Ψ(T) = {A, C, G, T }. Every pair of strings in Ψ(T) is left 1-synchronized, s hence for every taboo-free s and n ≥ |s|, Γn(T) is connected. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 21

Now we aim to find a lower bound for the number of taboos needed to s s disconnect one of the graphs Γn(T). Let Γn(T) denote the subgraph induced in Γn(T) by all the taboo-free strings with prefix s. Then the following Corollary of Prop. 24 holds. Corollary 25 Consider an alphabet Σ with m := |Σ| and a taboo-set T. The following holds: s a) If |T[1, 1]| < m, then for any taboo-free string s and n ≥ |s|, Γn(T) is connected. S s b) If | t[|t|, |t|]| < m, then for any taboo-free string s and n ≥ |s|, Γn( ) t∈T T is connected. S c) If | [1, 1]| < m or | t[|t|, |t|]| < m, then Γn( ) is connected for n ≥ 0. T t∈T T Proof 1 T 1 a) Assume that taboo-free strings s1, s2 satisfy L (s1) L (s2) = ∅. That is, for each a ∈ Σ, either as1 or as2 has a taboo as prefix, contradicting |T[1, 1]| < m. Therefore we can apply Prop. 24, implying a). b) It follows from statement a), reversing the order of the symbols composing the strings. e e c) Since Γn(T) = Γn(T), it follows from a) and b).  s The upper bound |T| < m, which guarantees connectivity for every Γn(T), is the best one achievable without making additional assumptions. This is consequence of Corollary 25 and the following examples. Example 11 If Σ = {0, 1} and T = {10, 01}, then T is left proper and |T[1, 1]| = |T[2, 2]| = 2 = |Σ|. For n ≥ 2, Vn(T) = {0 ··· 0, 1 ··· 1}, what makes 0 1 Γn(T) disconnected. The trivial graphs Γn (T) and Γn (T) are both connected.

Example 12 For m ≥ 3, Σ = {a1, ··· , am} and the left proper taboo-set G T = {a3a1, a4a1, a5a1, ··· , ama1} {a1a2, a2a2},

a1 we claim that Γn (T) is disconnected for n ≥ 3. Indeed,

a1 a1a1 G a2a1 Vn (T) = Vn (T) Vn (T) =     a2a1a1 G a1a1a1 G G aia2a1 = Vn (T) Vn (T) Vn (T) , i∈[3,m]

a2a1a1 F a1a1a1 F aia2a1 so take s ∈ Vn (T) Vn (T) and r ∈ i∈[3,m] Vn (T). It holds that a1 d(s, r) ≥ 2, hence we found two disconnected components in graph Γn (T). This is coherent with |T[1, 1]| = m. Since |T[2, 2]| = |{a1, a2}| ≤ m, Prop. 25 implies that Γn(T) is connected for any n ∈ N. i) To generalize this example, for i ∈ N0, denote by si := a1···a1 the concatena- tion of i a1’s. The taboo-set G Ti = {a3si, a4si, ··· , amsi} {a1a2si−1, a2a2si−1}

si satisfies that graph Γn (Ti) is disconnected for n ≥ i + 2. It holds that T = T1, where s0 = e and s1 = a1. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

22 Cassius Manuel, Arndt von Haeseler

10 Connectivity of the examples of Section 2

A permutation of the symbols of alphabet Σ does not alter any of the re- sults that we proved along this work. Moreover, by reversing the order of the symbols, any statement regarding e.g. left-properness and suffixes has an anal- ogous one in which right-properness and suffixes are involved. On the other hand, taboo-sets induced by restriction enzymes remain invariant when we interchange every recognition sequence by its reverse complement. Therefore, s note that, for a bacterial taboo-set T, if we prove that every graph Γn(T) is s connected, then also every graph Γn(T) is connected.

10.1 Turneriella parva

s Since |TT.pa[1, 1]| < m, Corollary 25 implies that every graph Γn(TT.pa) is connected. Therefore the evolution of the DNA sequences can potentially reach any other taboo-free DNA sequence, no matter which suffix was conserved along this process.

10.2 Helicobacter pylori

We want to apply Prop. 24. Take any w1, w2 ∈ Ψ(TH.py) and as- sume that they are not left 1-synchronized. In particular WLOG we 1 1 can assume that T/∈ L (w1), implying w1 = GCA. If C/∈ L (w1), then w1 ∈ {CGG, CT C, AT G}, which is absurd since it contradicts w1 = GCA. 1 Therefore it must be C/∈ L (w2), yielding w2 ∈ {CGG, CT C, AT G}. In any case, d(w1, w2) ≥ 2, so Prop. 24 can be applied, therefore every graph s Γn(TH.py) is connected. In particular, Γn(TH.py) is connected.

10.3 An imaginary bacterium

Proposition 24 cannot be applied because CCC and GCC are not left 1- CCC synchronized. Consistently, the graph Γn (TB.im) is disconnected for n ≥ 5. CCC To prove this, take Vn (TB.im), which satisfies

CCC GCCC [ CCCC Vn (TB.im) = Vn (TB.im) Vn (TB.im) =

 AGCCC [ T GCCC  [  GCCCC [ CCCCC  Vn (TB.im) Vn (TB.im) Vn (TB.im) Vn (TB.im) ,

GCCC CCCC implying that, for any strings s1 ∈ Vn (TB.im) and s2 ∈ Vn (TB.im), it holds that d(s1, s2) ≥ 2. Thus we found two disconnected components in CCC GCCC CCCC Γn (TB.im), namely Γn (TB.im) and Γn (TB.im). bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 23

11 Concluding remarks

Using the results proven in this work, it is possible to decide whether every s Hamming graph Γn(T) is connected. In most of biological cases, with just a small number of recognition sequences, it seems very likely that all these graphs are connected. This can be checked for most of instances using Prop. 24. In this sense, our work supports the assumption tacitly made when studying DNA- sequence evolution – namely, that every DNA sequence can reach any other sequence by single-point substitutions. On the other hand, our formal framework is a first and necessary step to study the effect of restriction enzymes in the DNA composition of bacteria and virus. Consider, for example, the phylogenetic studies in [1], where our H. pylori taboo-set TH.py was taken from. Is the inferred evolutionary time between two H. pylori populations affected by TH.py? Has their GC content varied due to the taboos of restriction enzymes? To give an answer, the random mutation of DNA can be modelled in a Monte Carlo framework, although the s connectivity of graphs Γn(TH.py) is of fundamental importance. More in general, recent progress has been made in the usage of restriction enzymes for the treatment of viral infections ([23]). Since one or just a few SNPs can significantly alter the symptoms or even the mortality associated to a virus infection (cfr. [4], [25], [15]), the techniques employed here could help to identify which adjacent SNPs hinder the harmful ones by inducing the cleave of the restriction enzymes. Although such strategies are currently unexplored, this work could be a first theoretical guide to a successful treatment.

12 Competing interests

The authors declare that they have no competing interests.

13 Authors’ contributions

Cassius Manuel wrote this manuscript with the guidance of Arndt von Hae- seler. All authors read and approved the manuscript.

14 Acknowledgements

We would like to thank Michael Charleston (University of Tasmania) for his constructive criticism. In fact the idea of studying taboo-free sequences origi- nated from discussions with Mike while Arndt received a visiting fellowship at the University of Tasmania. This work was supported by the Austrian Science Fund (FWF, grant number I-1824-B22) to Arnd von Haeseler. bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

24 Cassius Manuel, Arndt von Haeseler

References

1. Ailloud, F., Didelot, X., Woltemate, S., Pfaffinger, G., Overmann, J., Bader, R.C., Schulz, C., Malfertheiner, P., Suerbaum, S.: Within-host evolution of Helicobacter py- lori shaped by niche-specific adaptation, intragastric migrations and selective sweeps. Nature Communications 10(1), 2273 (2019). DOI 10.1038/s41467-019-10050-1. URL https://doi.org/10.1038/s41467-019-10050-1 2. Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., Watson, J.: Molecular Biology of the Cell, 5th edn., chap. 8, pp. 532–534. Garland (2004) 3. Bader, D., Meyerhenke, H., Sanders, P., Wagner, D.: Graph partitioning and graph clustering. Proceedings of the 10th DIMACS implementation challenge workshop (2013). DOI 10.1090/conm/588 4. Collery, M.M., Kuehne, S.A., McBride, S.M., Kelly, M.L., Monot, M., Cockayne, A., Dupuy, B., Minton, N.P.: What’s a SNP between friends: The influence of single nu- cleotide polymorphisms on virulence and phenotypes of Clostridium difficile strain 630 and derivatives. Virulence 8(6), 767–781 (2017). DOI 10.1080/21505594.2016.1237333 5. Ellis-Monaghan, J.A., Pike, D.A., Zou, Y.: Decycling of Fibonacci cubes. Australasian Journal of Combinatorics 35, 31 (2006) 6. Fitch, W.M., Margoliash, E.: A method for estimating the number of invari- ant amino acid coding positions in a gene using cytochrome c as a model case. Biochemical Genetics 1(1), 65–71 (1967). DOI 10.1007/BF00487738. URL https://doi.org/10.1007/BF00487738 7. Gelfand, M., Koonin, E.: Avoidance of palindromic words in bacterial and archaeal genomes: A close connection with restriction enzymes. Nucleic acids research 25, 2430– 9 (1997). DOI 10.1093/nar/25.24.0 8. Goulden, P.I., Jackson, D.: An inversion theorem for cluster decompositions of sequences with distinguished subsequences. Journal of The London Mathematical Society-second Series 20, 567–576 (1979). DOI 10.1112/jlms/s2-20.3.567 9. Hsu., W.J.: Fibonacci cubes - a new interconnection topology. IEEE Trans. Parallel Distrib. Syst. 4(1), 129–132 (1993). DOI 10.1109/71.205649 10. Hsu, W.J., Chung, M.J.: Generalized Fibonacci Cubes. 1993 International Conference on Parallel Processing - ICPP’93 1, 299–302 (1993). DOI 10.1109/ICPP.1993.95 11. Ili´c,A., Klavˇzar,S., Rho, Y.: Generalized Fibonacci cubes. Discrete Mathematics 312, 2 – 11 (2012) 12. Klavˇzar.,S.: Structure of Fibonacci cubes: a survey. Journal of Combinatorial Opti- mization 25, 505–522 (2013). DOI 10.1007/s10878-011-9433-z 13. Klavˇzar,S., Mollard, M., Petkovˆsek,M.: The degree sequence of Fibonacci and Lucas cubes. Discrete Mathematics 311(14), 1310 – 1322 (2011) 14. Kommireddy, V., Nagaraja, V.: Diverse functions of restriction-modification systems in addition to cellular defense. Microbiology and molecular biology reviews : MMBR 77, 53–72 (2013). DOI 10.1128/MMBR.00044-12 15. Kondoh, T., Letko, M., Munster, V.J., Manzoor, R., Maruyama, J., Furuyama, W., Miyamoto, H., Shigeno, A., Fujikura, D., Takadate, Y., Yoshida, R., Igarashi, M., Feld- mann, H., Marzi, A., Takada, A.: Single-nucleotide polymorphisms in human NPC1 influence filovirus entry into cells. The Journal of Infectious Diseases 218(5), S397– S402 (2018). DOI 10.1093/infdis/jiy248 16. Munarini, E., Salvi, N.Z.: Structural and enumerative properties of the Fibonacci cubes. Discrete Mathematics 255(1-3), 317–324 (2002) 17. Noonan, J., Zeilberger, D.: The GouldenJackson cluster method: extensions, applica- tions and implementations. Journal of Difference Equations and Applications 5(4-5), 355–377 (1999). DOI 10.1080/10236199908808197 18. Roberts, R.J., Vincze, T., Posfai, J., Macelis, D.: REBASEa database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Research 43(D1), D298–D299 (2014). DOI 10.1093/nar/gku1046. URL https://doi.org/10.1093/nar/gku1046 19. Rocha, E., Danchin, A., Viari, A.: Evolutionary role of restriction/modification sys- tems as revealed by comparative genome analysis. Genome Research 11 (2001). DOI 10.1101/gr.GR-1531RR bioRxiv preprint doi: https://doi.org/10.1101/824847; this version posted October 30, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Structure of the space of taboo-free sequences 25

20. Shoemaker, J.S., Fitch, W.M.: Evidence from nuclear sequences that invariable sites should be considered when sequence divergence is calculated. Molecular Biology and Evolution 6(3), 270–289 (1989). DOI 10.1093/oxfordjournals.molbev.a040550. URL https://doi.org/10.1093/oxfordjournals.molbev.a040550 21. Strimmer, K., von Haeseler, A.: Genetic distances and nucleotide substitution models. In: P. Lemey, M. Salemi, V. Anne-Mieke (eds.) The Phylogenetic Handbook: a Prac- tical Approach to Phylogenetic Analysis and Hypothesis Testing, 2 edn., pp. 111–141. Cambridge University Press, Cambridge (2009). DOI 10.1017/CBO9780511819049.006 22. Ussery, D.W., Wassenaar, T.M., Borini, S.: Computing for Comparative Microbial Genome: Bioinformatics for Microbiologists, 1st edn. Springer. (2008) 23. Weber, N.D., Aubert, M., Dang, C.H., Stone, D., Jerome, K.R.: Dna cleavage en- zymes for treatment of persistent viral infections: recent advances and the pathway forward. Virology 454-455, 353–361 (2014). DOI 10.1016/j.virol.2013.12.037. URL https://www.ncbi.nlm.nih.gov/pubmed/24485787 24. Wilson, R.J.: Introduction to Graph Theory. John Wiley & Sons, Inc., New York, NY, USA (1986) 25. Yuan, L., Huang, X.Y., Liu, z.y., Zhang, F., Zhu, X.L., Yu, J.Y., Ji, X., Xu, Y., Li, G., Li, C., Wang, H.J., Deng, Y.Q., Wu, M., Cheng, M.L., Ye, Q., Xie, D.Y., Li, X.F., Wang, X., Shi, W., Qin, C.F.: A single mutation in the prM protein of Zika virus contributes to fetal microcephaly. Science 358, eaam7120 (2017). DOI 10.1126/science.aam7120