<<

Oliveira et al. Algorithms Mol Biol (2019) 14:21 https://doi.org/10.1186/s13015-019-0156-5 Algorithms for Molecular Biology

RESEARCH Open Access Super short operations on both order and intergenic sizes Andre R. Oliveira1* , Géraldine Jean2 , Guillaume Fertin2 , Ulisses Dias3 and Zanoni Dias1

Abstract Background: The evolutionary distance between two can be estimated by computing a minimum length sequence of operations, called rearrangements, that transform one genome into another. Usually, a genome is modeled as an ordered sequence of , and most of the studies in the genome rearrangement literature consist in shaping biological scenarios into mathematical models. For instance, allowing diferent genome rearrangements operations at the same time, adding constraints to these rearrangements (e.g., each rearrangement can afect at most a given number of genes), considering that a rearrangement implies a cost depending on its length rather than a unit cost, etc. Most of the works, however, have overlooked some important features inside genomes, such as the pres- ence of sequences of nucleotides between genes, called intergenic regions. Results and conclusions: In this work, we investigate the problem of computing the distance between two genomes, taking into account both gene order and intergenic sizes. The genome rearrangement operations we consider here are constrained types of reversals and transpositions, called super short reversals (SSRs) and super short transpositions (SSTs), which afect up to two (consecutive) genes. We denote by super short operations (SSOs) any SSR or SST. We show 3-approximation algorithms when the orientation of the genes is not considered when we allow SSRs, SSTs, or SSOs, and 5-approximation algorithms when considering the orientation for either SSRs or SSOs. We also show that these algorithms improve their approximation factors when the input permutation has a higher number of inversions, where the approximation factor decreases from 3 to either 2 or 1.5, and from 5 to either 3 or 2. Keywords: Genome rearrangements, Intergenic regions, Super short operations, Approximation algorithms

Background times in a genome. In the latter case, genomes are mod- Given two genomes G1 and G2 , one way to estimate their eled as strings, while in the former case they are modeled evolutionary distance is to compute the minimum num- as permutations. Besides, genomes modeled as permuta- ber of large scale events, called genome rearrangements, tions may be signed or unsigned (the sign of an element that are needed to transform G1 into G2 . Te minimality is represents the orientation of that gene in the DNA strand required due to the commonly accepted parsimony prin- it lies on). ciple, while the allowed genome rearrangements depend Concerning genome rearrangements, the most com- on the model, i.e. on the classes of events that supposedly monly studied events are reversal, which consists in tak- happen during evolution. ing a continuous sequence in the genome, reversing it, Prior to counting rearrangement events, one needs and putting it back at the same location [4], and trans- to model the input genomes. Previous works [1–3] position, which consists in taking a continuous sequence have defned genomes as ordered sequences of ele- in the genome and putting it back in a diferent location ments (genes). Variants within this setting can occur. For [5]. A more recent and general type of genome rearrange- instance, each gene may appear either once or several ment is the DCJ (Double-Cut and Join) [3], that cuts a genome between adjacent genes a and b, and adjacent *Correspondence: [email protected] genes c and d, and joins either a to c and b to d, or a to d 1 Institute of Computing, University of Campinas, Campinas, Brazil and b to c. Full list of author information is available at the end of the article

© The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat​iveco​mmons​.org/licen​ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 2 of 17

Since the mid-nineties, a very large amount of work to determine the minimum number of SSRs/SSTs/SSOs has been done for computing distances between pairs that transform one genome into another. of genomes, depending on the genome model and the Tis paper is organized as follows. In Section 2 we pro- allowed set of rearrangements. We refer the reader to vide the notations that we will use throughout the paper, Fertin et al. book [6] for a survey of algorithmic aspects. and we introduce novel ideas that will prove useful for In populations where the number of rearrangement studying the problem. In sections 3-7 we derive lower events that afect a very large portion of the genes are and upper bounds on the sought distance for fve dif- rare, we can restrict events to span at most k genes, for ferent variants, which help us design an approximation some value of k [7, 8]. During an analysis with closely- algorithm of constant factor for each of these fve prob- related pairs of bacterial genomes, the number of short lems. Section 8 presents a practical analysis of the fve inversions (inversions afecting up to three genes) was algorithms on simulated instances. Section 9 concludes discovered to be very high, especially inversions of a the paper. single gene (which we call 1-reversal) [9]. Tere are also other works showing the prevalence of short inversions Defnitions in bacterial genomes [10] and genomes [11, We represent a genome G with n genes as an instance 12]. with (i) an n-tuple and (ii) n + 1 intergenic regions. If As previously mentioned, most of the works have there is no duplicated genes, the n-tuple is a (possi- assumed that a genome is an ordered sequence of genes. bly signed) permutation π = (π1π2 ··· πn−1πn) , with It has been argued that this model could underestimate |πi|∈{1, 2, ..., (n−1), n} , for 1 ≤ i ≤ n , and |πi|=|πj| if, the “true” evolutionary distance, and that other genome and only if, i = j . If gene orientation is known, each ele- π + features should be taken into account to circumvent this ment from has a or − sign that indicates the gene problem [13, 14]. Indeed, genomes carry more infor- orientation it represents, and we say that π is a signed mation than just their ordered sequences of genes. In permutation; π is an unsigned permutation otherwise. particular, consecutive genes are separated by DNA We denote by ι the identity permutation, in which all sequences called intergenic regions, each having diferent elements are in ascending order and with positive signs. lengths in terms of number of nucleotides. Tese lengths Te extended permutation is obtained from π by adding may be used along with gene order to generate a more two new elements: π0 = 0 and πn+1 = (n+1). π realistic model for genomes. Te intergenic region ri is located before element πi Tis recently led some authors to model a genome as from the extended permutation π , for 1 ≤ i ≤ n + 1 . π π an ordered sequence of genes, together with an ordered We denote by ℓ(ri ) the length of intergenic region ri , π π N list of its intergenic sizes, and to consider the problem i.e., the number of nucleotides in ri , with ℓ(ri ) ∈ for π π π of computing the DCJ distance, either in the case where 1 ≤ i ≤ n + 1 . Let r = (ℓ(r1 ), ..., ℓ(rn+1)) . An instance π insertions and deletions of nucleotides are forbidden here is then formed by (π, r ). π [15], or allowed [16]. A reversal ρ(i, j, x, y) applied over an instance (π, r ) , π π Biller and coauthors [13] used the intergenic regions with 1 ≤ i ≤ j ≤ n , 0 ≤ x ≤ ℓ(ri ) , 0 ≤ y ≤ ℓ(rj+1) , and N ′ π′ to defne what they called fragile regions, regions where {x, y}∈ , is an operation that generates (π , r ) by (i) rearrangements are more likely to act. After identify- reversing the order and the orientation of the elements ing these fragile regions, practical tests showed that in the subset of adjacent elements {πi, ..., πj} ; (ii) revers- considering rearrangements on non-fragile regions can ing the order of intergenic regions in the subset of adja- π π yield incoherent distance estimations. When using the cent intergenic regions {ri+1, ..., rj } when j > i + 1 ; (iii) π equiprobable model (i.e., rearrangements can occur in cutting two intergenic regions: ri after x nucleotides π π′ any position with the same probability), practical tests and rj+1 after y nucleotides such that ℓ(ri ) = x + y and π′ π π [16] showed that statistical properties of the inferred sce- ℓ(rj+1) = (ℓ(ri )−x) + (ℓ(rj+1)−y). narios for DCJs using intergenic regions are closer to the A reversal ρ(i, j, x, y) is also called a g-reversal, where true ones than scenarios which do not use them. g = (j − i) + 1 . A super short reversal is a 1-reversal or a In this work, we also consider genomes as ordered 2-reversal, i.e. a reversal that afects only one or two ele- sequences of genes together with their intergenic sizes, in ments from π. cases where the gene sequence is an unsigned or signed A transposition τ(i, j, k, x, y, z) applied over an instance π π permutation and the considered rearrangement opera- (π, r ) , with 1 ≤ i < j < k ≤ n + 1 , 0 ≤ x ≤ ℓ(ri ) , π π N tions are super short reversal (or SSR, i.e. a reversal of 0 ≤ y ≤ ℓ(rj ) , 0 ≤ z ≤ ℓ(rk ) , and {x, y, z}∈ , is an oper- ′ π′ (gene) length at most two), super short transposition (or ation that generates (π , r ) by (i) exchanging subsets of SST, i.e. a transposition afecting only two genes), or both adjacent elements {πi, ..., πj−1} and {πj, ..., πk−1} ; (ii) mov- π π (super short operation or SSO). In this context, our goal is ing subsets of adjacent intergenic regions {rj+1, ..., rk−1} Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 3 of 17

π π and {ri+1, ..., rj−1} to start at positions (i + 1) and From now on, we will consider that (i) the target per- (i + k − j + 1) , respectively; (iii) cutting three intergenic mutation α is such that α = ι ; (ii) π and ι have the same π π π π′ π m π m α regions: ri , rj , and rk such that ℓ(ri ) = x + ℓ(rj )−y , number of elements; and (iii) i=1 ℓ(ri ) = i=1 ℓ(ri ) . π′ π π′ π ℓ(ri+k−j) = ℓ(ri )−x + z , and ℓ(rk ) = ℓ(rk )−z + y. By doing this, we can compute the distance of π , denoted τ(i, j, k, x, y, z) by d(π) , that consists in fnding the minimum number of A transposition is called a g-transposi- π g = k − i super short operations that sorts π and transforms r into tion, where , and we say that a g-transposition is ι g = 2 r . super short if . π ι Figure 1 shows a sequence of two super short rever- Let (π, r ) and (ι, r ) be two instances such that sals and one super short transposition that transforms π and ι have the same number of elements and π m ℓ( π ) = m ℓ( ι) the permutation π = (13425) with r = (3, 5, 2, 1, 2, 8) i=1 ri i=1 ri . Te intergenic graph, denoted ι π ι into ι = (12345) with r = (3, 2, 6, 4, 5, 1). by I(π, r , r ) = (V , E) , is such that V is composed by π Given an instance (π, r ) , a pair of elements (πi, πj) two sets of vertices: intergenic vertices (one for each π ∈ π π from π is called an inversion if πi >πj and i < j , with ri r ), and permutation vertices (one for each i of {i, j}∈[1..n] . We denote the number of inversions in a the extended permutation π ). Te set E is composed by = ( π π ) ∈ permutation π by inv(π) . For the example in Fig. 1a, pairs inversion edges: an edge e ri , ri+2 E if there is a (3, 2) and (4, 2) are the only inversions, thus inv(π) = 2. j = i such that (πi, πj) or (πj, πi+1) is an inversion, with π α Given two instances (π, r ) and (α, r ) representing 1 ≤ i ≤ n−1 and 1 ≤ j ≤ n. (π π ι) genomes G1 and G2 respectively such that π and α have We divide vertices of an intergenic graph I , r , r π α the same number of elements, (ℓ(ri ) − ℓ(ri )) is the into components. A component starts and ends with per- π α imbalance between intergenic regions ri and ri , with mutation vertices. Besides, the frst component starts 1 ≤ i ≤ m. with the permutation vertex π0 , and the last compo- π α Given two instances (π, r ) and (α, r ) such nent ends with the permutation vertex πn+1 . Consecu- that (i) π and α have the same number of ele- tive components share exactly one permutation vertex, m π m α π ments and (ii) i=1 ℓ(ri ) = i=1 ℓ(ri ) , let i.e., the last permutation vertex i of a component is the π α j π α �j(r , r ) = i=1(ℓ(ri ) − ℓ(ri )) denote the cumu- frst permutation vertex of its adjacent component to the lative sum of imbalances between intergenic regions right. of π and α from positions 1 to j, with 1 ≤ j ≤ m . Since If a component c starts with vertex πi and ends with m π m α π α π i < j rπ ∈ c i < k ≤ j π ∈ c i=1 ℓ(ri ) = i=1 ℓ(ri ) , we have that �m(r , r ) = 0. vertex j , with , then k for and k

a

b

c

d Fig. 1 A sequence of two super short reversals and one super short transposition that transforms π = (13425) , with rπ = (3, 5, 2, 1, 2, 8) into ι = (1, 2, 3, 4, 5) , with rι = (3, 2, 6, 4, 5, 1) . Intergenic regions are represented by rectangles, whose dimensions vary according to their sizes. The ′ rπ rπ rπ′ 1-reversal ρ(5, 5, 2, 7) applied in a transforms π into π = π , and it cuts π after position 2 at 5 and after position 7 at 6 , resulting in ℓ( 5 ) = 9 , rπ′ rπ′ ′ ′′ ′ ℓ( 6 ) = 1 , and = (3, 5, 2, 1, 9, 1) . The 2-reversal ρ(3, 4, 1, 5) applied in b transforms π into π = (13245) , and it cuts π after position 1 at rπ′ rπ′ rπ′′ rπ′′ rπ′′ 3 and after position 5 at 5 , resulting in ℓ( 3 ) = 6 , ℓ( 5 ) = 5 , and = (3, 5, 6, 1, 5, 1) . Finally, the 2-transposition τ(2, 3, 4, 0, 4, 1) applied in c ′′ ′′ rπ′′ rπ′′ rπ′′ rπ′′ rπ′′ transforms π into ι , and it cuts π in position 0 at 2 , after position 4 at 3 , and after position 1 at 4 , resulting in ℓ( 3 ) = 6 , ℓ( 5 ) = 5 , and ′′ rπ = (3, 5, 6, 1, 5, 1) . as shown in d Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 4 of 17

π ι for i < k < j . Besides, any two intergenic vertices that Ceven(I(π, r , r )) . Figure 2 shows three examples of inter- are connected to each other by an inversion edge must genic graphs. belong to the same component. Tus, if c ends with πj , In the next two lemmas, we analyze the impact of π π then e = (rj , rj+2) �∈ E. applying super short operations on the number of π Te idea is that components break (π, r ) according components. π ι to I(π, r , r ) into smaller pieces, where it is possible to π make a local redistribution of intergenic regions and ele- Lemma 1 Given an instance (π, r ) and a tar- ι ′ π′ ments from π (with no need to exchange them between get instance (ι, r ) , let (π , r ) be the resulting π ι components) transforming (π, r ) into (ι, r ) . Tis instance after applying a 1-reversal. It follows that ′ π′ ι π ι requires that any component c starting with πi and end- C(I(π , r , r )) ≤ C(I(π, r , r )) + 1. j π ι ing with πj must have k=i+1 ℓ(rk ) − ℓ(rk ) = 0. π ι Formally, given an intergenic graph I(π, r , r ) , a com- Proof Recall that a 1-reversal ρ(i, i, x, y) is applied over π π ponent c is a minimal set of vertices from V in which: intergenic regions ri and ri+1 , with 1 ≤ i ≤ n . Besides, (i) any two intergenic vertices that are connected to each since 1-reversals do not create nor remove inversions ′ π′ ι ′ ′ other by an inversion edge must belong to the same com- from π , intergenic graphs I(π , r , r ) = (V , E ) and π π π ι ′ ponent, (ii) if (ri , rj ) ∈ g , with i < j , then {πi−1, πj}∈c I(π, r , r ) = (V , E) satisfy E = E . < < { π π }∈ π π and for any i k j rk , k c , and (iii) the sum of If ri ∈ c and ri+1 �∈ c , this 1-reversal is applied over π π imbalances of its intergenic regions from r with respect two diferent components, which means that ri is the last ι π ι π π ℓ( ) − ℓ( ) = π ι to r is equal to zero, i.e. ∀ ks.t. r ∈c rk rk 0. intergenic region of c, so �i(r , r ) = 0 . If x + y �= ℓ(ri ) , k ′ π′ ι π ι A component with one intergenic vertex is called we have that C(I(π , r , r )) = C(I(π, r , r )) − 1 , as shown in Fig. 3a. trivial, and is called non-trivial otherwise. Te number π π Consider now that {ri , ri+1}∈c . If i < n and of intergenic vertices in a component c is denoted by π′ π′ ′ π′ π′ ′ cr cr (ri , ri+2) ∈ E , or i > 1 and (ri−1, ri+1) ∈ E , then . A component c is odd if is odd, and it is even oth- ′ π′ ι π ι erwise. Te number of components in an intergenic C(π , r , r ) = C(π, r , r ) . Otherwise, we have two cases π ι π ι ′ π′ ι π ι π′ ι I(π, r , r ) C(I(π, r , r )) C(π , r , r ) = C(π, r , r ) �i(r , r ) �= 0 graph is denoted by , the num- to consider: ′ , if π ι (π ′ π ι) = (π π ι) + ber of odd components is denoted by Codd(I(π, r , r )) , (as shown in Fig. 3b); and C , r , r C , r , r 1 π′ ι and the number of even components is denoted by if �i(r , r ) = 0 (as shown in Fig. 3c).

a

b

c ′ Fig. 2 Intergenic graphs I(π, rπ , rι) , I(π′, rπ , rι) , and I(ι, rι, rι) , with π = (3124576) , rπ = (15, 6, 4, 12, 8, 13, 9, 2), π ′ = (1324576) , ′ rπ = (10, 6, 9, 12, 8, 13, 9, 2) , ι = (1234567) , and rι = (10, 15, 8, 7, 5, 9, 13, 2) . Black squares represent intergenic vertices, and the number inside it indicate their sizes. Rounded rectangles in blue represent components. Note that in (a) there are three edges in I(π, rπ , rι) , and C(I(π, rπ , rι)) = 2 , π ι and Codd(I(π, r , r )) = 2 since there are fve intergenic vertices in c1 and three intergenic vertices in c2 . We also have in (a) all values for π ι π ι ′ π′ π (ℓ(ri ) − ℓ(ri )) and �i(r , r ) , with 1 ≤ i ≤ 8 . The instance (π , r ) is the result of applying ρ(1, 2, 8, 2) to (π, r ) . In (b) we can see that compared to π ι ′ π′ ι ι I(π, r , r ) , I(π , r , r ) has one more component, and e1 was removed. In (c) we can see that when we reach the target instance (ι, r ) the number of components is equal to the number of intergenic regions in ι (i.e., C(I(ι, rι, rι)) = n + 1 = 8) Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 5 of 17

a

b

c ′ ′ Fig. 3 Example of intergenic graphs for all possible values of C(I(π′, rπ , rι)) with respect to C(I(π, rπ , rι)) where (π′, rπ ) is the resulting π π′ π instance after applying a 1-reversal to (π, r ) . When the 1-reversal is applied over two components at the same time and ri �= ri , ′ ′ C(I(π′, rπ , rι)) = C(I(π, rπ , rι)) − 1 , as shown in a. Otherwise, we have that C(I(π′, rπ , rι)) = C(I(π, rπ , rι)) + k , k ∈{0, 1} , as shown in b and c

π Lemma 2 Given an instance (π, r ) and a target In the following sections, we will explore fve difer- ι ′ π′ instance (ι, r ) , let (π , r ) be the resulting instance after ent problems concerning super short operations but also applying either a 2-reversal or a 2-transposition. It follows considering intergenic regions, namely Sorting by Super ′ π′ ι π ι that C(I(π , r , r )) ≤ C(I(π, r , r )) + 2. Short Reversals (SbSSR), Sorting by Super Short Trans- positions (SbSST), Sorting by Super Short Reversals and Super Short Transpositions (SbSSO), Sorting by Signed Proof If a 2-reversal or 2-transposition is applied Super Short Reversals (SbSigSSR), and Sorting by Signed to intergenic regions of two diferent components Super Short Reversals and Super Short Transpositions π ι in I(π, r , r ) , then we are necessarily creating a (SbSigSSO). Table 1 summarizes our results concerning ′ π′ ι new inversion, and the graph I(π , r , r ) has either general permutations (GP), permutations with nℓ inver- ′ π′ ι π ι C(I(π , r , r )) = C(I(π, r , r )) − 2 (as shown in Fig. 4a) sions for some ℓ ≥ 1 ( ℓIP), permutations with at least n ′ π′ ι π ι or C(I(π , r , r )) = C(I(π, r , r )) − 1 (as shown in inversions (1IP), and permutations with at least 2n inver- Fig. 4b). sions (2IP). Consider now that this operation is applied to inter- π ι genic regions of a same component in I(π, r , r ) , and Sorting by Super Short Reversals πi πi+1 1 ≤ i < n − 1 exchanges elements and , ′ with . In this section, we analyze the version of the problem I(π ′, rπ , rι) = (V ′, E′) If ′ the′ intergenic graph ′ has when only super short reversals (i.e., 1-reversals and (rπ , rπ ) ∈ E′ C(I(π ′, rπ , rι)) = C(I(π, rπ , rι)) π ι i i+2 , then . 2-reversals) are allowed to transform (π, r ) into (ι, r ) . Otherwise, we have three cases to consider: First, we state that if a non-trivial component c of an I(π, rπ , rι) ′ π′ ι π ι π′ ι intergenic graph has no edge (i.e., there is no 1. C(I(π , r , r )) = C(I(π, r , r )) , if �i(r , r ) �= 0 π′ ι inversion inside c), then it is always possible to split c into �i+1(r , r ) �= 0 and ′ (as shown in Fig. 4c); two components with a 1-reversal. C(I(π ′, rπ , rι)) = C(I(π, rπ , rι)) + 1 2. ′ ′ if either � (rπ , rι) = 0 � (rπ , rι) = 0 i or i+1 (as shown in Lemma 3 If a component c of an intergenic graph π ι Fig. 4d); ′ I(π, r , r ) with cr ≥ 2 contains no edge, then there is C(I(π ′, rπ , rι)) = C(I(π, rπ , rι)) + 2 3. otherwise (as always a pair of consecutive intergenic regions to which we ′ shown in Fig. 4e). can apply a 1-reversal that splits c into two components c ′′ ′ ′′ and c such that cr + cr = cr. Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 6 of 17

a

b

c

d

e ′ ′ Fig. 4 Example of intergenic graphs for all possible values of C(I(π′, rπ , rι)) with respect to C(I(π, rπ , rι)) where (π′, rπ ) is the resulting instance after applying a 2-reversal or a 2-transposition to (π, rπ ) . When the 2-reversal or 2-transposition is applied over two components at the same time ′ ′ C(I(π′, rπ , rι)) < C(I(π, rπ , rι)) , as shown in a and b. Otherwise, we have that C(I(π, rπ , rι)) ≤ C(I(π′, rπ , rι)) ≤ C(I(π, rπ , rι)) + 2 , as shown in c–e

π Proof Let pi be the index in r of the i-th intergenic exists. Let k be the index of element from π located right r ρ(k, k, ℓ(rπ ), −� (rπ , rι)) region inside component c. Te last intergenic region after pi . Apply the reversal pi pi . π ′ of c is at position pcr . By defnition of a component, and In both cases, the resulting permutation has π′ ι π′ ι π ι π ι p1 ≤ j < pc �p (r , r ) = 0 �p (r , r ) = �p (r , r ) + �p (r , r ) since c contains no edge, for any r we have i , i+1 i+1 i ′ , π ι + ≤ ≤ � ( π , ι) = that �j(r , r ) = 0 . Note that since cr > 1 we have that and for any i 2 j cr we have that pj r r π ι π ι π ι π′ �p (r , r ) = (ℓ(r ) − ℓ(r )) �=0 �p (r , r ) r 1 p1 p1 . j ′ ; thus, as before, all intergenic regions from pi+1 � (rπ , rι)>0 p = p rπ If p1 , let i 1 and let k be the index of to pcr must belong to the same component. π rπ c′ element from located right after p1 . Apply the reversal Tis 1-reversal splits c into two components: with all ρ(k k ℓ(rι ) ) p p c′′ , , p1 ,0. intergenic regions in positions 1 to i , and with all π ι Otherwise, we have that �p1 (r , r )<0 , and we need to intergenic regions in positions pi+1 to pcr . rπ rπ 1 ≤ i < c fnd two intergenic regions pi and pi+1 for r such π ι π ι that �pi (r , r )<0 and �pi+1 (r , r ) ≥ 0 . Since, by def- � ( π ι) = nition of a component, pcr r , r 0 , such pair always Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 7 of 17

π ι n+1 π ι k = C(I(π, rπ , rι)) ϕr = 0 Let �odd(r , r ) = i=1, i (mod 2)=1(ℓ(ri ) − ℓ(ri )) Proof Let , and let , if � (rπ , rι) = 0 ϕr = 1 m−k ≥ inv(π) + ϕr denote the cumulative sum of imbalances of intergenic odd , or otherwise. If 2 d(π) ≥ m−k regions from π and ι in odd positions only. Using Lem- then, by Lemma 4, 2 , and, by Lemma 5, d(π) ≤ m − k + inv(π) ≤ m − k + m−k ≤ 3 m−k mas 1, 2 and 3, we show in the following two lemmas the 2 2 . Oth- m−k < inv(π) + ϕr m − k < 2inv(π) + 2ϕr minimum and maximum number of super short reversals erwise, 2 , so . π ι d(π) ≥ inv(π) + ϕr needed to transform π into ι and r into r . By Lemma 4, , and, by Lemma 5, d(π) ≤ m − k + inv(π) ≤ 2inv(π) + 2ϕr + inv(π) ≤ π ι 3inv(π) + 2ϕr Lemma 4 Let (π, r ) be an instance, (ι, r ) be the target . π ι instance, m the number of intergenic regions in r and r , r π ι r and let ϕ = 0 if �odd(r , r ) = 0 and ϕ = 1 otherwise. It Algorithm 1 describes a 3-approximation algo- m−C(I(π,rπ ,rι)) r (π, rπ ) (ι, rι) follows that d(π) ≥ max{ 2 , inv(π) + ϕ }. rithm that transforms an instance into using super short reversals. Computing inv(π) takes O(n log n) O(n2) I(π, rπ , rι) Proof In order to sort π , we need to remove all inver- and it takes up to to build . � (rπ , rι) C(I(π, rπ , rι)) sions, and since a 2-reversal can remove only one inver- Computing i and take O(n), and sion, we have that d(π) ≥ inv(π) . Besides, since 2-rever- it takes constant time to update them. Te while loop O(n2) sals exchange material between intergenic regions of in line 5 (resp. line 14) iterates up to times, so the r r O(n2) same parity only, then d(π) ≥ inv(π) + ϕ , with ϕ = 1 overall complexity of Algorithm 1 is . π ι if �odd(r , r ) = 0 (in this case we will need at least one 1-reversal to exchange material between an intergenic region located at an odd position and an intergenic r region located at an even position), and ϕ = 0 otherwise.

On the other hand, by Lemmas 1 and 2, we can increase the number of components by at most two with a super short reversal, so to reach m trivial components we need m−C(I(π,rπ ,rι)) at least 2 super short reversals.

π ι Lemma 5 Let (π, r ) be an instance, (ι, r ) be the target instance, and let m be the num- π ι ber of intergenic regions in r and r . We have that π ι d(π) ≤ inv(π) + m − C(I(π, r , r )).

Proof While π = ι , π has at least one pair of con- secutive elements (πi, πi+1) that is an inversion. Sup- pose that we frst remove all inversions from π using π inv(π) 2-reversals of type ρ(i, i + 1, ℓ(ri ),0) i.e., without ′ modifying its intergenic regions lengths. Let π be the π′ π resulting permutation, that has r = r . Te number ′ π′ ι of components in I(π , r , r ) cannot be smaller than π ι C(I(π, r , r )) , since any 2-reversal removing an inver- sion is applied inside a same component. By Lemma 3, ′ π′ ι we can go from C(I(π , r , r )) to m components using ′ π′ ι m − C(I(π , r , r )) 1-reversals, which results in no more π ι than m−C(I(π, r , r )) 1-reversals. Let δn denote the set of all permutations π with n ele- ments, and let δn,k denote the number of all permutations Finally, using Lemmas 4 and 5, we prove that it is pos- π ∈ δn such that inv(π) ≤ k . For n = 12 there are 762,007 sible to obtain 3-approximation for this problem. permutations in δ12,12 , which corresponds to 0.16% of the 12! permutations from δ12 , and for n > 12 the number π ι Teorem 1 Let (π, r ) be an instance, (ι, r ) be the of permutations in δn,n never corresponds to more than target instance, and let m = n + 1 be the number of 0.05% of the n! permutations from δn [17]. Besides, for π ι intergenic regions in r and r . Te value of d(π) is n > 18 the number of permutations in δn,2n never exceeds 3-approximable. 0.03% of the n! permutations from δn [17]. Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 8 of 17

′ Algorithm 1 has a better approximation factor when the remaining intergenic regions from c. Note that c and ′′ ′′′ the number of inversions is at least n, as explained in the c are odd, and c has the same parity as c. following theorem. Otherwise, we can fnd a pair of consecutive intergenic π ι regions rpi and rpi+1 inside c such that �pi (r , r )<0 (π, rπ ) (ι, rι) � ( π ι) ≥ � ( π ι) = Teorem 2 Let be an instance, be the target and pi+1 r , r 0 , and since pcr r , r 0 , instance, and let m = n + 1 be the number of intergenic such pair always exists. If cr is even or if cr is odd π ι regions in r and r . If inv(π) ≥ n , Algorithm 1 has an but pi is even, apply τ(pi−1, pi, pi+1, x, y, 0) such that (1 + 1 ) ℓ = inv(π) ≥ 1 x = ℓ(rπ ) y = ℓ(rπ ) + � (rπ rι) approximation factor of ℓ , where n . pi−1 and pi pi−1 , , followed τ(p , p , p , x′, 0, y′) x′ = ℓ(rι ) by i−1 i i+1 such that pi−1 and k = C(I(π, rπ , rι)) ϕr = 0 y′ = ℓ(rι ) Proof Let , and let if pi . π ι r �odd(r , r ) = 0 and ϕ = 1 otherwise. Suppose now that If cr and pi are odd, apply τ(pi, pi+1, pi+2, x, y,0) such inv(π) = nℓ ℓ ≥ 1 m−k < n x = ℓ(rπ ) y = ℓ(rπ ) + � (rπ rι) for some . Since 2 , by Lemma 4 that pi and pi+1 pi , , followed d(π) ≥ nℓ nℓ τ(p p p x′ y′) x′ = ℓ(rι ) we have that . Algorithm 1 applies 2-rever- by i−1, i, i+1, , 0, such that pi and − < ′ = ℓ( ι ) sals and up to m k n 1-reversals, which results in no y rp + . Tese two 2-transpositions split c into three i 1 more than nℓ + n − 1 < n(ℓ + 1) super short reversals. components so if cr is even then we will end up with two odd components and one even component, and if c is π ι Corollary 2.1 Let (π, r ) be an instance, (ι, r ) be odd we will end up with three odd components due to the target instance, and let m = n + 1 be the number the choice of the position defned above. π ι of intergenic regions in r and r . If inv(π) ≥ n (resp. If cr = 2 and p1 > 1 we apply τ(p1 − 1, p1, p2, x, y,0) (π) ≥ x = ℓ(rπ ) y = ℓ(rπ ) inv 2n ) Algorithm 1 has an approximation factor of such that p1−1 and p1 , followed by τ(p − 1, p , p , x′, 0, y′) x′ = x y′ = ℓ(rι ) at most 2 (resp. 1.5). 1 1 2 such that and p1 . c = p = τ(p p p + ℓ(rπ ) ) If r 2 and 1 1 we apply 1, 2, 2 1, p1 ,0,0 ι followed by τ(p1, p2, p2 + 1, ℓ(rp ),0,0) . Tese two trans- 1 Sorting by Super Short Transpositions positions transform c into two trivial components. In this section, we analyze the version of the problem Te following lemma gives the number of transposi- when only Super Short Transpositions are allowed. First, tions needed to transform a permutation π and its inter- π ι we investigate how 2-transpositions split non-trivial genic regions r into ι with its intergenic regions r when π ι components from an intergenic graph I(π, r , r ). inv(π) = 0.

π ι Lemma 6 If a component c of an intergenic graph Lemma 7 Let (π, r ) be an instance, (ι, r ) be the π ι I(π, r , r ) with cr > 2 (resp. cr = 2 ) has no edge, then we target instance, and let m = n + 1 be the number of π ι can apply two 2-transpositions that split c into three com- intergenic regions in r and r . If inv(π) = 0 , then ′ ′′ ′′′ ′ ′′ ′′′ (π) = − ( (π π ι)) + ( (π π ι)) ponents c , c , and c such that cr + cr + cr = cr (resp. d m C I , r , r Ceven I , r , r . ′ ′′ ′ ′′ two components c and c such that cr = cr = 1). Proof If a 2-transposition applied on a component c of π ι Proof Note that any 2-transposition will increase or I(π, r , r ) increases the number of components by two decrease the number of inversions by one. By Lemma 2, units, we can assume by the proof of Lemma 6 that it ′ ′′ ′′′ a 2-transposition that removes an inversion can increase transforms c into three components c , c , and c such the number of components by at most two units, and a that two of them are odd components and the other has 2-transposition creating an inversion cannot increase the the same parity as cr. number of components. Since there is no inversion in c, If c is odd, and if we can always increase the number for each 2-transposition removing an inversion from c we of components by two units, we end up with a compo- have a 2-transposition creating that inversion before. nent with only one intergenic region, but if c is even, Now we explain how to increase the number of com- at some point we will have to increase the number of ponents by two units when cr ≥ 3 . Let pi be the index components by one unit, creating two odd compo- π in r of the i-th intergenic region inside component c. nents. Tis means that for each even component we If there is no intergenic region rj inside c in which the need to apply two 2-transpositions that increase the cumulative sum is negative, apply τ(p1, p2, p3, x, y,0) number of components by one unit only. Since we x = {ℓ(rι ) + ℓ(rι ) ℓ(rπ )} in such a way that min p1 p2 , p1 can always apply pairs of transpositions that do not y = ℓ(rπ ) + ℓ(rι ) + ℓ(rι ) − x and p2 p1 p2 . Now apply increase the number of even components, it follows that ι π ι π ι τ(p p p ℓ(r ) ) d(π) = m − C(I(π, r , r )) + Ceven(I(π, r , r )) 1, 2, 3, p1 ,0,0 . Tese two 2-transpositions split c . ′ ′′ ′′′ into three components: c with rp1 , c with rp2 and c with Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 9 of 17

π ι π ι Lemmas 8 and 9 respectively show the lower and upper Proof Let k = C(I(π, r , r )) − Ceven(I(π, r , r )) . If (π) m−k m−k bounds for fnding d using super short transpositions. 2 ≥ inv(π) then, by Lemma 8, d(π) ≥ 2 , and, by m−k m−k Lemma 5, d(π) ≤ m − k + inv(π) ≤ m − k + 2 ≤ 3 2 . π ι m−k Lemma 8 Let (π, r ) be an instance, (ι, r ) be the Otherwise, 2 < inv(π) , so m − k < 2 inv(π) . By target instance, and let m = n + 1 be the num- Lemma 8, d(π) ≥ inv(π) , and, by Lemma 9, rπ rι d(π) ≤ m − k + inv(π) ≤ 2 inv(π) + inv(π) ≤ 3 inv(π) ber of intergenic regionsπ ι in andπ ι . It follows that . m−C(I(π,r ,r ))+Ceven(I(π,r ,r )) d(π) ≥ max{ 2 , inv(π)}.

Proof In order to sort π we need to remove all inver- Algorithm 2 describes a 3-approximation algorithm π ι sions, and since a 2-transposition can remove only that transforms an instance (π, r ) into (ι, r ) using Super one inversion, we necessarily have that d(π) ≥ inv(π) . Short Transpositions. Similarly to Algorithm 1, Algo- 2 Besides, by Lemma 2, we can increase the number of rithm 2 has a time complexity of O(n ). components by at most two with a super short trans- π ι π ι position. Let k = C(I(π, r , r )) − Ceven(I(π, r , r )) . To reach m trivial components, and considering also m+k Lemma 7, we need at least 2 super short transposi- m+k tions. Tus, d(π) ≥ max{ 2 , inv(π)} .

π ι Lemma 9 Let (π, r ) be an instance, (ι, r ) be the target instance, and let m = n + 1 be the num- π ι ber of intergenic regions in r and r . We have that π ι π ι d(π) ≤ inv(π) + m − C(I(π, r , r )) + Ceven(I(π, r , r )).

Proof Suppose that we frst remove all inver- sions of π using inv(π) 2-transpositions of type π ′ τ(i, i + 1, i + 2, ℓ(ri ),0,0) , and let π be the resulting ′ π′ ι permutation. Te value of C(I(π , r , r )) cannot be π ι smaller than C(I(π, r , r )) since any 2-transposition removing an inversion is applied inside a same compo- π ι π ι nent. Let k = C(I(π, r , r )) − Ceven(I(π, r , r )) and let ′ ′ π′ ι ′ π′ ι k = C(I(π , r , r )) − Ceven(I(π , r , r )) Let us analyze the parity of any component that a 2-transposition breaks: (i) if it transforms an odd com- ponent into two, then one component must be odd; (ii) if it transforms an even component into two, then both components are odd or even; (iii) if it transforms an even component into three, then two components must be odd; (iv) if it transforms an odd component into three, then either two components are even or the three com- k′ ≥ k ponents are odd. Tis means that . ′ Algorithm 2 has a better approximation factor when C(I(π ′, rπ , rι)) By Lemma 7, we can go from to m com- the number of inversions is strictly greater than n, as m − k′ ponents using 2-transpositions, which results, by stated in the following theorem. the analysis above, in no more than m − k 2-transposi- tions. π ι Teorem 4 Let (π, r ) be an instance, (ι, r ) be the target instance, and let m = n + 1 be the number of intergenic π ι Finally, using Lemmas 8 and 9, we prove that it is possi- regions in r and r . If inv(π) > n , Algorithm 2 has an ble to obtain a 3-approximable solution for this problem. 1 inv(π) approximation factor of (1 + ℓ ) , where ℓ = n ≥ 1. (π, rπ ) (ι, rι) Teorem 3 Let be an instance, be the target Proof Similar to proof of Teorem 2, given that instance, and let m = n + 1 be the number of intergenic m − C(I(π, rπ , rι)) + C (I(π, rπ , rι)) ≤ n + 1 π ι even . regions in r and r . Te value of d(π) is 3-approximable. Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 10 of 17

π ι Let (π, r ) be an instance, (ι, r ) be the tar- Algorithm 3 describes a 3-approximation algorithm Corollary 4.1 π ι ι ι get instance, and let m = n + 1 be the number of intergenic that transforms an instance (π, r , r ) into (ι, r , r ) using π ι regions in r and r . If inv(π) ≥ n (resp. inv(π) ≥ 2n ) Algo- both super short reversals and super short transpositions. 2 rithm 2 has an approximation factor of at most 2 (resp. 1.5). As algorithms 1 and 2, it has a time complexity of O(n ).

Sorting by Super Short Reversals and Super Short Transpositions In this section we analyze the version of the problem when both super short reversals and Super Short Trans- π ι positions are allowed to transform any (π, r ) into (ι, r ).

π ι Lemma 10 Let (π, r ) be an instance, (ι, r ) be the target instance, and let m = n + 1 be the num- π ι ber of intergenic regions in r and r . It follows that m−C(I(π,rπ ,rι)) d(π) ≥ max{ 2 , inv(π)}. Proof Directly from Lemmas 2, 4, and 8.

π ι Lemma 11 Let (π, r ) be an instance, (ι, r ) be the target instance, and let m = n + 1 be the num- π ι ber of intergenic regions in r and r . We have that π ι d(π) ≤ inv(π) + m − C(I(π, r , r )).

Proof Suppose that frst we remove all inversions of π π using inv(π) 2-reversals of type ρ(i, i + 1, ℓ(ri ),0) , ′ π′ and let π (resp. r ) ) be the resulting permutation π ι (resp. intergenic regions). Let k = C(I(π, r , r )) and let ′ ′ π′ ι ′ k = C(I(π , r , r )) . We have that k ≥ k , since 2-rever- sals removing inversions are always applied inside a same component. ′ Analogous to Lemma 9, and assuming that k = k + ℓ for ′ π′ ι π ι some ℓ ≥ 0 , then Ceven(I(π , r , r )) ≤ Ceven(I(π, r , r )) + ℓ . We use the procedure described in Lemma 9 on com- ponents c with cr ≥ 3 , applying two 2-transpositions that increase the number of components by two units. For components c with cr = 2 , we apply a 1-reversal as described in Lemma 1, breaking them into two odd As in the previous algorithms, Algorithm 3 has a better components. approximation factor when the number of inversions is at Te above procedure applies inv(π) 2-reversals, least n, as explained in the following theorem. ′ ′ π′ ι n − k − Ceven(I(π , r , r )) 2-transpositions, and ′ π′ ι (π, rπ ) (ι, rι) Ceven(I(π , r , r )) 1-reversals, which results in no more Teorem 6 Let be an instance, be the target inv(π) + m − C(I(π), rι)) instance, and let m = n + 1 be the number of intergenic than . π ι regions in r and r . If inv(π) ≥ n Algorithm 3 has an (1 + 1 ) ℓ = inv(π) ≥ 1 Now we prove that it is possible to obtain a 3-approx- approximation factor of ℓ , where n . imable solution for this problem. Proof Analogous to proof of Teorem 2. π ι Teorem 5 Let (π, r ) be an instance, (ι, r ) be the target (π, rπ ) (ι, rι) instance, and let m = n + 1 be the number of intergenic Corollary 6.1 Let be an instance, be rπ rι d(π) the target instance, and let m = n + 1 be the number regions in and . Te value of is 3-approximable. π ι of intergenic regions in r and r . If inv(π) ≥ n (resp. inv(π) ≥ 2n Proof Similar to proof of Teorem 1, using Lem- ) Algorithm 3 has an approximation factor of mas 10 and 11. at most 2 (resp. 1.5). Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 11 of 17

Sorting by Signed Super Short Reversals Using Lemmas 13 and 14, we prove that the value of In this section, we analyze the version of the problem d(π) is 5-approximable. when super short reversals are allowed to transform π ι π ι (π, r ) into (ι, r ) , where π and ι are signed permutations. Teorem 7 Let (π, r ) be an instance, (ι, r ) be the target even− Given a signed permutation π , let Sπ be the set of ele- instance, and let m = n + 1 be the number of intergenic π ι ments from π such that ||πi|−i| is even and πi < 0 , and let regions in r and r . Te value of d(π) is 5-approximable. odd+ S π ||πi|−i| π be the set of elements− from + such that is even odd π ι r neg odd and πi > 0 . Sets Sπ and Sπ capture the negative Proof Let k = C(I(π, r , r )) , and let ℓ = max{ϕ , ϕ } . π m−k m−k and positive elements from that end with negative signs If 2 ≥ inv(π) + ℓ then, by Lemma 12, d(π) ≥ 2 , after any sequence of 2-reversals that puts all elements in and, by Lemma 14, d(π) ≤ 2(m − k) + inv(π) + ℓ ≤ neg 2 m−k 5 m−k their correct positions (i.e., remove all inversions). Let ϕ (m − k) + 2 ≤ 2 . even− odd+ Sπ ∪ Sπ m−k be the number of elements in . Otherwise, 2 < inv(π) + ℓ , so 2(m − k)<4(inv(π) + ℓ) . Te following lemma, proved by Galvão et al. [8], gives By Lemma 12, d(π) ≥ inv(π) + ℓ , and, by Lemma 14, the exact number of super short reversals needed to d(π) ≤ 2(m − k) + inv(π) + ℓ ≤ 4(inv(π) + ℓ) + inv(π) π ι transform into . +ℓ ≤ 5(inv(π) + ℓ) . π Lemma 12 Given a signed permutation , Algorithm 4 describes a 5-approximation algorithm d(π) = inv(π) + ϕneg π ι ι ι . that transforms a signed instance (π, r , r ) into (ι, r , r ) using Signed Super Short Reversals. As in previous algo- 2 Tis lemma helps us to state the following lower bound rithms, the time complexity of Algorithm 4 is O(n ). for our problem.

π ι Lemma 13 Let (π, r ) be an instance, (ι, r ) be the target instance, and let m = n + 1 be the num- π ι ber of intergenic regions in r and r . We have that r neg d(π) ≥ inv(π) + max{ϕ , ϕ }. Proof Directly from Lemmas 4 and 12.

Te following lemma states an upper bound for this problem.

π ι Lemma 14 Let (π, r ) be an instance, (ι, r ) be the target instance, and let m = n + 1 be the num- π ι ber of intergenic regions in r and r . We have that r neg π ι d(π) ≤ inv(π) + max{ϕ , ϕ }+2(m − C(I(π, r , r )).

π ι r neg Proof Let k = C(I(π, r , r )) and let ℓ = max{ϕ , ϕ } . Suppose that we frst remove all inversions of π using inv(π) π ′ π′ 2-reversals of type ρ(i, i + 1, ℓ(ri ),0) , and let π (resp. r ) be the resulting permutation (resp. intergenic regions). ′ ′ π′ ι ′ Let k = C(I(π , r , r ) . We have that k ≥ k . We ′ apply m − k ≤ m − k 1-reversals that split every non- ′ π′ ι trivial component from I(π , r , r ) into two compo- ′′ π′′ nents according to Lemma 3, and let π (resp. r ) be the resulting permutation (resp. intergenic regions). ′′ At this point, we have a permutation π π′′ ι ′′ such that r = r , and π has no more than ′ ℓ + (m − k ) ≤ ℓ + (m − k) negative elements. We just ′ need to apply up to ℓ + (m − k ) 1-reversals of type π′′ ρ(i, i, ℓ(ri ),0) (i.e., without modifying the length of its ′ intergenic regions) to each negative element from π , and the lemma follows. Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 12 of 17

π ι As in the previous algorithms, Algorithm 4 has a bet- Lemma 16 Let (π, r ) be an instance, (ι, r ) be ter approximation factor when the number of inver- the target instance, and let m = n + 1 be the num- π ι sions is at least n, as explained in the following theorem. ber of intergenic regions in r and r . We have that odd π ι d(π) ≤ inv(π) + ϕ + 2(m − C(I(π, r , r ))). π ι Teorem 8 Let (π, r ) be an instance, (ι, r ) be the target instance, and let m = n + 1 be the number of intergenic Proof Suppose that we frst remove all inversions of π π ι regions in r and r . If inv(π) ≥ n , Algorithm 3 has an using the polynomial algorithm presented in [8], that 2 inv(π) odd approximation factor of (1 + ℓ ) , where ℓ = n ≥ 1. uses inv(π) + ϕ super short operations such that all π 2-reversals are of type ρ(i, i + 1, ℓ(ri ),0) , all 2-transpo- π ι π Proof Let k = C(I(π, r , r )) . Suppose now that sitions are of type τ(i, i + 1, i + 2, ℓ(ri ),0,0), and all the m−k odd ′ inv(π) = nℓ for some ℓ ≥ 1 . Since 2 < n , by Lemma 4 ϕ 1-reversals are ignored (i.e., not applied), and let π we have that d(π) ≥ nℓ . Algorithm 4 applies nℓ 2-rever- be the resulting permutation. ′ π′ ι ′ sals, up to m − k < n 1-reversals, and up to n 1-reversals Te number of components I(π , r , r ) in π cannot π ι to fip the sign of each negative element, which results be smaller than C(I(π, r , r )) , since the 2-reversals and in no more than nℓ + n − 1 + n < n(ℓ + 2) super short 2-transpositions are applied inside a same component ′ ′ π′ ι π ι reversals. only. Let k = C(I(π , r , r )) ≥ C(I(π, r , r )). ′ By Lemma 3, we can go from k to m components π ι ′ Let (π, r ) be an instance, (ι, r ) be using m − k 1-reversals, which results in no more than Corollary 8.1 π ι the target instance, and let m = n + 1 be the number m−C(I(π, r , r )) 1-reversals. After that, we will have π ι ′′ ′ odd of intergenic regions in r and r . If inv(π) ≥ n (resp. a permutation π with up to min{n, m − k + ϕ } inv(π) ≥ 2n ) Algorithm 4 has an approximation factor of negative elements, so we can apply up to ′ odd ι at most 3 (resp. 2). min{n, m − k + ϕ } 1-reversals of type ρ(i, i, ℓ(ri ),0) to ′′ each negative element of π .

Sorting by Signed Super Short Reversals and Super Using Lemmas 15 and 16, we prove that it is possible Short Transpositions to obtain a 5-approximable solution for this problem. In this section, we analyze the version of the problem π ι when both super short reversals and Super Short Trans- Teorem 9 Let (π, r ) be an instance, (ι, r ) be the target positions are allowed to sort signed permutations. instance, and let m = n + 1 be the number of intergenic π ι Let H(π) be the inversion graph [18] of the signed regions in r and r . Te value of d(π) is 5-approximable. permutation π , such that V (H(π)) ={π1, π2, ..., πn} and π ι odd E(H(π)) is formed by pairs of elements from π that are Proof Let k = C(I(π, r , r )) , and let ℓ = ϕ . If (π) m−k m−k inversions. In H , a component is defned as a maximal 2 ≥ inv(π) + ℓ then, by Lemma 15, d(π) ≥ 2 , subgraph in which any two vertices are connected to each and, by Lemma 16, d(π) ≤ 2(m − k) + inv(π) + ℓ ≤ 2 m−k 5 m−k other by paths. A component from H(π) is negative if it (m − k) + 2 ≤ 2 . m−k 2 contains an odd number of negative elements (vertices), Otherwise, 2 < inv(π) + ℓ , so (m − k)< and it is positive otherwise. 4(inv(π) + ℓ) . By Lemma 15, d(π) ≥ inv(π) + ℓ , and, by odd Let ϕ be the number of negative components of Lemma 16, d(π) ≤ 2(m − k) + inv(π) + ℓ ≤ 4(inv(π) + ℓ) + inv(π) H(π) . Te following lemma, proved by Galvão et al. [8], +ℓ ≤ 5(inv(π) + ℓ) . gives the exact number of super short reversals and Super Short Transpositions needed to transform π into ι , which Algorithm 5 describes a 5-approximation algorithm π ι ι ι is a lower bound for our problem. that transforms a signed instance (π, r , r ) into (ι, r , r ) using both signed super short reversals and Super Lemma 15 Given a signed permutation π , Short Transpositions. Regarding the complexity, by odd inv(π) + ϕ super short operations are required to previous algorithms we know that lines 1-3 and 6-17 2 transform π into ι. take up to O(n ) , and according to [8] the while loop in 3 line 4 takes O(n ) , which is then the time complexity of Now we state in the following lemma an upper bound Algorithm 5. for this problem. Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 13 of 17

Experimental tests We implemented the fve proposed algorithms and tested them unsing simulated permutations, in order to observe their performances. We generated two difer- ent permutation datasets, which we call fully-random instances (FRI) and almost random instances (ARI). π Each dataset has 1,000,000 instances (π, r ) , π is a per- π mutation with 100 elements and r is a sequence of 101 intergenic regions sizes. Te dataset FRI was generated in the following ι way: (i) let (ι, r ) be an initial instance, being ι with ι 100 elements, and each ri received a random inte- π i i+1 ger k ∈[0..100] . (ii) Generate (π, r ) by applying w ι consecutive super short operations to (ι, r ) , with randomly generated indices for both positions and intergenic sizes, always respecting the current val- ues. We created 10,000 instances for each value of w ∈{10, 20, 30, ..., 990, 1000}. For Sorting by (Signed) Super Short Reversals we applied 0.8w 2-reversals and 0.2w 1-reversals, and at each step one of them was chosen at random while both were available. For Sorting by Super Short Transpositions we applied w 2-transpositions. For Sorting by (Signed) Super Short Operations we applied 0.5w 2-transpositions, 0.4w 2-reversals, and 0.1w 1-reversals, and at each step one of them was chosen at random while more than one were available. Te dataset ARI was generated in a similar way as FRI, but when the algorithm had to apply either a 2-reversal or a 2-transposition we randomly chose a pair among As for previous algorithms, Algorithm 5 also has a bet- all pairs of adjacent elements that were not an inversion. ter approximation factor when the number of inversions n(n−1) Since w < max{inv(π), π ∈ δn}= 2 , at least one pair is at least n. Tis is the purpose of the following theorem. always exists. π π ι Given any instance (π, r ) from ARI created using (π, r ) (ι, r ) π Teorem 10 Let be an instance, be the tar- w SSOs, we know exactly how many inversions (π, r ) get instance, and let m = n + 1 be the number of inter- π ι has—it is the number of 2-reversals and 2-transpositions genic regions in r and r . If inv(π) ≥ n , Algorithm 5 has (1 + 2 ) ℓ = inv(π) ≥ 1 applied. Te number of inversions on instances from FRI, an approximation factor of ℓ , where n . however, is not known, but we can compute the expected number of inversions in a permutation with n elements k = C(I(π, rπ , rι)) Proof Let . Suppose now that after k random swaps (i.e., 2-reversals and 2-transposi- inv(π) = nℓ m−k < n . Since 2 , by Lemma 4 we have that tions) applied to the identity permutation [19]: d(π) ≥ ℓ . Algorithm 4 applies nℓ operations between m − k < n n 2 2-reversals and 2-transpositions, up to n(n + 1) 1 (cj + ci) k E[in,k ]= − x , 1-reversals, and up to n 1-reversals to fip the sign of 4 8(n + 1)2 2 2 ji = sk si each negative element, which results in no more than i,j 0 nℓ + n − 1 + n < n(ℓ + 2) super short operations. 4 where ca = cos αa , sa = sin αa , xab = 1 − n (1 − cacb) , (2a+1)π π ι and αa = + . Corollary 10.1 Let (π, r ) be an instance, (ι, r ) be 2n 2 Figures 5 and 6 show the experimental results for the target instance, and let m = n + 1 be the number π ι instances of type FRI and ARI, respectively. We show the of intergenic regions in r and r . If inv(π) ≥ n (resp. average distance returned for each algorithm described inv(π) ≥ 2n ) Algorithm 5 has an approximation factor of in this paper, plus the average and maximum approxima- at most 3 (resp. 2). tion factors calculated based on the lower bound of each problem for each instance. Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 14 of 17

a

b

c Fig. 5 a average returned distance, b maximum returned approximation, and c average returned approximation for instances in FRI. In a dotted curves represent the expected number of inversions, dashed curves represent the algorithms for signed permutations, and line curves represent the algorithms for unsigned permutations. Colors relate problems having the expected number of inversions given by the dotted line of same color. This means that SbSSR and SbSigSSR are expected to have inversions as in Inv. 1, SbSST is expected to have inversions as in Inv. 2, and SbSSO and SbSigSSO are expected to have inversions as in Inv. 3. All the approximations in b and c were calculated based on the lower bound of each problem

On Figs. 5 and 6, Algorithm 1 is denoted by SbSSR, curves Inv. 1, 2, and 3 follow the same idea as the curves Algorithm 4 is denoted by SbSigSSR, Algorithm 2 is in Fig. 5, but instead of expected number of inversions denoted by SbSST, Algorithm 3 is denoted by SbSSO, and they represent the exact number of inversions. Algorithm 5 is denoted by SbSigSSO. In Fig. 5, the curve As distance is directly related to the number of inver- Inv. 1 represents the expected number of inversions for sions, in Fig. 5a we see that, although in practice we have SbSSR and SbSigSSR, the curve Inv. 2 represents the applied up to 1000 operations, the distance values were expected number of inversions for SbSST, and the curve never greater than 300 on average—the average returned Inv. 3 denotes the expected number of inversions for distances for each algorithm follow the trend dotted line SbSSO and SbSigSSO. Tese three curves were generated that represents the expected number of inversions of that using the formula E[in,k ] described above. In Fig. 6 the instances. Algorithms for signed permutations returned Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 15 of 17

a

b

c Fig. 6 a average returned distance, b maximum returned approximation, and c average returned approximation for instances in ARI. In a dotted curves represent the exact number of inversions, dashed curves represent the algorithms for signed permutations, and line curves represent the algorithms for unsigned permutations. Colors relate problems having the expected number of inversions given by the dotted line of same color. This means that SbSSR and SbSigSSR have inversions as in Inv. 1, SbSST has inversions as in Inv. 2, and SbSSO and SbSigSSO have inversions as in Inv. 3. All the approximations in b and c were calculated based on the lower bound of each problem Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 16 of 17

Table 1 Summary of the approximation factor of the We defned some bounds and a graph structure that approximation algorithms presented in this manuscript allowed us to build fve algorithms (one for each consid- nℓ for general permutations (GP), permutations with ered problem) that guarantee approximation factors of 3 inversions for some ℓ ≥ 1 ( ℓIP), permutations with at least for unsigned permutations (using either SSRs, SSTs, or n inversions (1IP), and permutations with at least 2n inversions (2IP) both) and 5 for signed permutations (using either SSRs or SSOs). Tese algorithms have better approximation Sorting problem GP ℓIP 1IP 2IP factors for instances for which the number of inver- SbSSR 3 1 + 1 2 1.5 sions is at least n or 2n. In the former case, it is equal to ℓ SbSST 3 1 + 1 2 1.5 2 for unsigned permutations (using either SSRs, SSTs, ℓ or both), and to 3 for signed permutations (using SSRs SbSSO 3 1 + 1 2 1.5 ℓ or SSOs); in the latter case it is equal to 1.5 for unsigned SbSigSSR 5 1 + 2 3 2 ℓ permutations, and to 2 for signed permutations. All of SbSigSSO 5 1 + 2 3 2 ℓ these algorithms were tested in simulated instances, showing that, on average, they behave better than their theoretical approximation factors predict. distances with a slightly higher value than the same algo- Some questions remain open. For instance, what is the rithms for unsigned permutations, which is expected computational complexity of each of these fve problems? given that in addition to inversions and intergenic sizes, Besides, how can we incorporate indels (insertions and they also need to take care of elements with negative deletions) of intergenic regions to these problems, to be signs. able to compare two genomes that share the same set Concerning the approximation factors in Fig. 5b, c, it of genes but may difer on their total intergenic regions can be noted that despite the theoretical approximation length? factors of 3 and 5, the average approximation factors of Acknowledgements instances in FRI were between 1 and 2.2. Furthermore, This work was supported by the National Council for Scientifc and Tech- nological Development-CNPq (Grants 400487/2016-0, 425340/2016-3, in our tests, no instance for SbSSR and SbSSO, whose and 140466/2018-5), the São Paulo Research Foundation-FAPESP (Grants theoretical approximation factors are 3, had approxima- 2013/08293-7, 2015/11937-9, 2017/12646-3, and 2017/16246-0), the Brazilian tion factor above 2.5, and no instance for SbSigSSR and Federal Agency for the Support and Evaluation of Graduate Education-CAPES, and the CAPES/COFECUB program (Grant 831/15). We also thank the anony- SbSigSSO, whose theoretical approximation factors are 5, mous reviewers for their helpful suggestions. obtained approximation above 3.3 and 3, respectively. In Fig. 6a we have another scenario where the returned Authors’ contributions First draft: ARO. Proofs: ARO, GJ, and GF. Experiments: ARO, UD, and ZD. Final distance follows the number of applied SSOs, but this manuscript: ARO, GJ, GF, UD, and ZD. All authors read and approved the fnal behavior is due to our choice of applying only opera- manuscript. tions that do not destroy previously created inversions. Competing interests One interesting thing about this fgure is that distances The authors declare that they have no competing interests. returned by the algorithm for SSOs were very close to the number of inversions, especially when w ≥ 400 , some- Author details 1 Institute of Computing, University of Campinas, Campinas, Brazil. 2 LS2N, thing that did not happened on FRI. In Fig. 6b, c, we see UMR CNRS 6004, University of Nantes, Nantes, France. 3 School of Technology, that this dataset returned maximum and average approx- University of Campinas, Limeira, Brazil. imations systematically better than for dataset FRI: no instance had an approximation factor greater than 2.5, Received: 16 February 2019 Accepted: 14 October 2019 and on average all algorithms have average approxima- tion factors less than 1.7. For w ≥ 120 (resp. w ≥ 240 ), where we expect to have around n (resp. 2n) inversions, References none of the instances had approximation factor above 1. Bafna V, Pevzner PA. Sorting by transpositions. SIAM J Discrete Math. 2 (resp. 1.5), as we expected given Lemmas 2, 4, 6, 8, 1998;11(2):224–40. https​://doi.org/10.1137/S0895​48019​52828​0X. and 10. 2. Kececioglu JD, Sankof D. Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement. Algorithmica. 1995;13:180–210. https​://doi.org/10.1007/BF011​88586​. Conclusion 3. Yancopoulos S, Attie O, Friedberg R. Efcient sorting of genomic permu- In this paper, we analyzed the minimum number of tations by translocation. Inversion and block interchange. Bioinformatics. 2005;21(16):3340–6. https​://doi.org/10.1093/bioin​forma​tics/bti53​5. super short reversals and/or Super Short Transpositions 4. Hannenhalli S, Pevzner PA. Transforming men into mice (polynomial needed to sort a signed or unsigned permutation π and, algorithm for genomic distance problem). In: Proceedings of the 36th at the same time, transform its intergenic regions lengths annual symposium on foundations of computer science (FOCS’1995). π ι r according to r . Oliveira et al. Algorithms Mol Biol (2019) 14:21 Page 17 of 17

Washington, DC: IEEE Computer Society Press; 1995. https​://doi. 13. Biller P, Guéguen L, Knibbe C, Tannier E. Breaking good: accounting org/10.1109/SFCS.1995.49258​8. p. 581–92. for fragility of genomic regions in rearrangement distance estimation. 5. Elias I, Hartman T. A 1.375-approximation algorithm for sorting by trans- Genome Biol Evol. 2016;8(5):1427–39. https​://doi.org/10.1093/gbe/evw08​ positions. IEEE/ACM Trans Comput Biol Bioinform. 2006;3(4):369–79. https​ 3. ://doi.org/10.1109/TCBB.2006.44. 14. Biller P, Knibbe C, Beslon G, Tannier E. Comparative genomics on artifcial 6. Fertin G, Labarre A, Rusu I, Tannier E, Vialette S. Combinatorics of genome life. In: Beckmann A, Bienvenu L, Jonoska N, editors. Pursuit of the uni- rearrangements. Computational molecular biology. London: The MIT versal lecture notes in computer science. Cham: Springer International Press; 2009. Publishing; 2016. p. 35–44. https​://doi.org/10.1007/978-3-319-40189​-8_4. 7. Chen T, Skiena SS. Sorting with fxed-length reversals. Discrete Appl Math. 15. Fertin G, Jean G, Tannier E. Algorithms for computing the double cut and 1996;71(1–3):269–95. https​://doi.org/10.1016/S0166​-218X(96)00069​-8. join distance on both gene order and intergenic sizes. Algor Mol Biol. 8. Galvão GR, Lee O, Dias Z. Sorting signed permutations by short 2017;12:16. https​://doi.org/10.1186/s1301​5-017-0107-y. operations. Algor Mol Biol. 2015;10:12. https​://doi.org/10.1186/s1301​ 16. Bulteau L, Fertin G, Tannier E. Genome rearrangements with indels in 5-015-0040-x. intergenes restrict the scenario space. BMC Bioinform. 2016;17(S14):225– 9. Lefebvre J-F, El-Mabrouk N, Tillier ERM, Sankof D. Detection and valida- 31. https​://doi.org/10.1186/s1285​9-016-1264-6. tion of single gene inversions. Bioinformatics. 2003;19(1):190–6. https​:// 17. Knuth DE. The art of computer programming, Volume 3: Sorting and doi.org/10.1093/bioin​forma​tics/btg10​25. searching. Reading: Addison-Wesley Publishing Company; 1998. 10. Dalevi DA, Eriksen N, Eriksson K, Andersson SGE. Measuring genome 18. Rotem D, Urrutia J. Circular permutation graphs. Networks. divergence in bacteria: a case study using Chlamydian Data. J Mol Evol. 1982;12(4):429–37. https​://doi.org/10.1002/net.32301​20407​. 2002;55(1):24–36. https​://doi.org/10.1007/s0023​9-001-0087-9. 19. Bousquet-Melou M. The expected number of inversions after n adjacent 11. Seoighe C, Federspiel N, Jones T, Hansen N, Bivolarovic V, Surzycki R, transpositions. Discrete Math Theor Comput Sci. 2010;12(2):65–88. Tamse R, Komp C, Huizar L, Davis RW, Scherer S, Tait E, Shaw DJ, Harris D, Murphy L, Oliver K, Taylor K, Rajandream M-A, Barrell BG, Wolfe KH. Prevalence of small inversions in yeast gene order evolution. Proc Natl Publisher’s Note Acad Sci. 2000;97(26):14433–7. https​://doi.org/10.1073/pnas.24046​2997. Springer remains neutral with regard to jurisdictional claims in pub- 12. McLysaght A, Seoighe C, Wolfe KH. High frequency of inversions during lished maps and institutional afliations. gene order evolution. In: Sankof D, Nadeau JH, editors. Com- parative genomics: empirical and analytical approaches to gene order dynamics, map alignment and the evolution of gene families. New York: Springer; 2000. p. 47–58. https​://doi.org/10.1007/978-94-011-4309-7_6.

Ready to submit your research ? Choose BMC and benefit from:

• fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations • maximum visibility for your research: over 100M website views per year

At BMC, research is always in progress.

Learn more biomedcentral.com/submissions