Doerr et al. Algorithms Mol Biol (2017) 12:14 DOI 10.1186/s13015-017-0106-z Algorithms for Molecular Biology

RESEARCH Open Access The family‑free median of three Daniel Doerr1,2* , Metin Balaban1, Pedro Feijão2 and Cedric Chauve3

Abstract Background: The gene family-free framework for comparative genomics aims at providing methods for gene order analysis that do not require prior gene family assignment, but work directly on a sequence similarity graph. We study two problems related to the breakpoint median of three genomes, which asks for the construction of a fourth genome that minimizes the sum of breakpoint distances to the input genomes. Methods: We present a model for constructing a median of three genomes in this family-free setting, based on maximizing an objective function that generalizes the classical breakpoint distance by integrating sequence similarity in the score of a gene adjacency. We study its computational complexity and we describe an integer linear program (ILP) for its exact solution. We further discuss a related problem called family-free adjacencies for k genomes for the special case of k ≤ 3 and present an ILP for its solution. However, for this problem, the computation of exact solutions remains intractable for sufciently large instances. We then proceed to describe a heuristic method, FFAdj-AM, which performs well in practice. Results: The developed methods compute accurate positional orthologs for genomes comparable in size of bacte- rial genomes on simulated data and genomic data acquired from the OMA orthology database. In particular, FFAdj- AM performs equally or better when compared to the well-established gene family prediction tool MultiMSOAR. Conclusions: We study the computational complexity of a new family-free model and present algorithms for its solution. With FFAdj-AM, we propose an appealing alternative to established tools for identifying higher confdence positional orthologs. Keywords: Family-free genome comparison, Positional orthology, Breakpoint median

Background neighborhoods to the input gene orders. Comparing Te presented work relates to the branch of research gene orders of distinct species presupposes knowledge that studies the structural organization of genomes of positional- (sometimes also called main-) ortholo- across species. Genome structures are subject to change gies between their constituting . Tis is where our caused by large-scale . Such mutations per- approach difers from previous work: Whereas tradi- mute the order or alter the composition of functional, tionally genes are required to form equivalence classes inheritable entities, subsequently called genes, in genome across gene orders such that each genome contains one sequences. Te breakpoint median constitutes a fam- and only one member of each class, our model only ily of well-studied problems that mainly difer through assumes a symmetric and refexive similarity measure. varying karyotypic constraints [1]. A general, uncon- Te tasks of forming one-to-one relationships between strained variant asks to construct a fourth gene order, genes (i.e. computing a matching) and fnding a median called a median, composed of one or more linear or cir- are then combined into a single objective. Our approach cular chromosomes, from three given gene orders, such has the decisive advantage of solving what was formerly a that this median maximizes the sum of conserved gene circularity problem: a median provides valuable insights into positional conservation, yet knowledge of posi- tional orthologies are already a prerequisite of traditional *Correspondence: [email protected]‑bielefeld.de 2 Faculty of Technology and Center for Biotechnology (CeBiTec), breakpoint median problems. Resolving this antilogy, our Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany approach continues a research program outlined in [2] Full list of author information is available at the end of the article

© The Author(s) 2017. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 2 of 14

(see also [3]) under the name of (gene) family-free gene algorithm. However, further methods have been pro- order comparison. So far, family-free methods have been posed that integrate synteny and sequence conserva- developed for the pairwise comparison of genomes [4–6] tion for inferring orthogroups, see [10]. Our approach and shown to be efective for orthology analysis [7]. difers frst and foremost in its family-free principle (all Te prediction of evolutionary relationships between other methods require a prior gene family assignment). genomic sequences is a long-standing problem in com- Compared to MultiMSOAR, the only other method that putational biology. According to Fitch [8], two genomic can handle more than two genomes with an optimiza- sequences are called homologous if they descended from tion criterion that considers gene order , both a common ancestral sequence. Furthermore, Fitch iden- MultiMSOAR (for three genomes) and FF-Median aim tifes diferent events that give rise to a branching point at computing a maximum weight tripartite matching. in the phylogeny of homologous sequences, leading to However we difer fundamentally from MultiMSOAR the well-established concepts of orthologous genes (who by the full integration of sequence and synteny conserva- descend from their last common ancestor through a spe- tion into the objective function, while MultiMSOAR pro- ciation) and paralogous genes (descending from their ceeds frst by computing pairwise orthology assignments last common ancestor through a duplication) [9]. Until to defne a multipartite graph. quite recently, orthology and paralogy relationships were mostly inferred from sequence similarity. However it is The gene family‑free median of three now well accepted that the syntenic context can carry The family‑free principle valuable evolutionary information, which has lead to the In the gene family-free framework, we are given all- notion of positional orthologs [10], which are orthologs against-all gene similarities through a symmetric and R whose syntenic context was not changed in a duplication refexive similarity measure σ : � × � → ≥0 over the event. universe of genes [2]. We use sequence similarity but Most methods for detecting potential orthologous other similarity measures can ft the previous defni- groups require a prior clustering of the genes of the con- tion. Tis leads to the formalization of the gene similarity sidered genomes into homologous gene families, defned graph [2], i.e. a graph where each vertex corresponds to a as groups of genes assumed to originate from a single gene of the dataset and where each pair of vertices asso- ancestral gene. Yet clustering sequences into ciated with genes of distinct genomes are connected by a families is already in itself a difcult problem. In the pre- strictly positively weighted edge according to gene simi- sent work, we describe two methods to infer likely posi- larity measure σ. Ten gene family or assign- tional orthologies for a group of three genomes. Te frst ments represent a particular subgroup of gene similarity method solves a new problem we introduce, the gene functions that require transitivity. Independent of the family-free median of three. It generalizes the traditional particular similarity measure σ, relations between genes breakpoint median problem [1]. Our second method imposed by σ are considered as candidates for homology makes use of the frst exact algorithm that solves the assignments. problem family-free adjacencies for k genomes (FF-Adja- cencies) that has been introduced by Braga et al. in [2], Extant genomes, genes and adjacencies for the special case where k ≤ 3. We then discuss the In this work, a genome G is entirely represented by a methods’ abilities to solve the biological question at hand tuple G ≡ (C, A), where C denotes a non-empty set of and study their computational complexity. We show that unique genes, and A is a set of adjacencies. Genes are t h our approach can be used for positional ortholog predic- represented by their extremities, i.e., a gene g ≡ (g , g ) , h t tion in simulated and real data sets of bacterial genomes. g ∈ C, consists of a head g and a tail g . are modeled explicitly, as special genes of C(G) with a single a b Related problems extremity, denoted by “◦”. Extremities g1 , g2, a, b ∈{h, t} a b Te FF-Median problem relates to previously studied of any two genes g1, g2 form an adjacency {g1 , g2 } if they gene order evolution problems. It is a generalization of are immediate neighbors in their genome sequence. In the tractable mixed multichromosomal median problem the following, we will conveniently use the notation C(G) introduced in [1], that can indeed be defned as an FF- and A(G) to denote the set of genes and the set of adja- Median problem with a similarity graph composed of cencies of genome G, respectively. We indicate the pres- a b disjoint 3-cliques and edges having all the same weight. ence of an adjacency {x1, x2} in an extant genome X by Te FF-Median problem also bears similarity with prob- 1 if { a, b}∈A I a, b = x1 x2 (X) lem FF-Adjacencies described in [2] as well as methods X (x1 x2) 0 otherwise. (1) aimed at detecting groups of orthologous genes based on gene order evolution, especially the MultiMSOAR [11] Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 3 of 14

a a b b (M)= s(m1,πX (m1) ,m2,πX (m2) ), Given two genomes G and H and gene similarity measure F a b m1 ,m2 (M) X G,H,I , a b a b { }∈A a∈{ }b σ {g , g }∈A(G) {h , h }∈A(H) πX (m1) ,πX (m2) (X) , two adjacencies, 1 2 and 1 2 { }∈A (4) with a, b ∈{h, t} are conserved if σ(g1, h1)>0 and σ(g2, h2)>0. We subsequently defne the adjacency score where a, b ∈{h, t} and s(·) is the adjacency score as a b c d of any four extremities g , h , i , j , where a, b, c, d ∈{h, t} defned by Eq. (2). and g, h, i, j ∈ as the geometric mean of their corre- sponding gene similarities [2]: Remark 1 Te adjacency score for a median adja- a, b a b c d cency {m1 m2} with respect to the corresponding s(g , h , i , j ) ≡ σ(g, h) · σ(i, j) (2) a b potential extant adjacency {πX (m1) , πX (m2) }, where a b A Median genome, genes and adjacencies Informally, the {m1, m2}∈ (M) and X ∈{G, H, I}, can be entirely family-free median problem asks for a fourth genome expressed in terms of pairwise similarities between genes M that maximizes the sum of pairwise adjacency scores of extant genomes using Eq. (3): to three given extant genomes G, H, and I. In doing so, s ma, m a, mb, m b the gene content of the requested median M must frst ( 1 πX ( 1) 2 πX ( 2) ) m ∈ C(M) be defned: each gene must be unambiguously = 6 σ (πY (m ), πZ (m )) · σ (πY (m ), πZ (m )) ∈ C( ) 1 1 2 2 associated with a triple of extant genes (g, h, i), g G , {Y ,Z}⊂{G,H,I}  h ∈ C(H), and i ∈ C(I). Moreover, we want to associate to a median gene m a sequence similarity score (g, h, i) In the following, a median gene m and its extant coun- relative to its extant genes g, h, and i. As the sequence of terparts (g, h, i) are treated as equivalent. We denote the the median gene is obviously not available, we defne this set of all candidate median genes by score as the geometric mean of their pairwise similarities Σ = (g, h, i) g (G),h (H),i (I):σ(g, h) σ(g, i) σ(h, i) > 0 . { | ∈C ∈C ∈C · · } (5) (see Fig. 1a): (g1,h1,i1), (g2,h2,i2) Σ σ(g, m) = σ(h, m) = σ(i, m) ≡ 3 σ(g, h) · σ(g, i) · σ(h, i) Each pair of median genes ∈ and extremities a, b ∈{h, t} give rise to a candi- (3) a a a b b b date median adjacency {(g1 , h1, i1), (g2 , h2, i2)} if ≡ a a a b b b a a a b b b In the following we make use of mapping πG(m) g, (g1 , h1, i1) �= (g2 , h2, i2), and (g1 , h1, i1) and (g2 , h2, i2) ≡ ≡ πH (m) h, and πI (m) i to relate gene m with its extant are non-conficting. We denote the set of all candidate counterparts. Two candidate median genes or telomeres median adjacencies and the set of all conserved (i.e. pre- �= m1 and m2 are conficting if m1 m2 and the intersection sent in at least one extant genome) candidate median a b {π (m1), π (m1), π (m1)} = m ,m m1,m2 Σ ,a,b h, t between associated gene sets G H I adjacencies by A {{ 1 2}| ∈ ∈{ }} C a b (π (m )a,π (m )b) 1 {πG(m2), πH (m2), πI (m2)} = m1,m2 X G,H,I IX X 1 X 2 and is non-empty (see Fig. 1b and A {{ }∈A | ∈{ } ≥ } , for example). A set of candidate median genes or tel- respectively. omeres C is called confict-free if no two of its mem- , ∈ C bers m1 m2 are conficting. Tis defnition trivially Remark 2 A median gene can only belong to a median extends to the notion of a confict-free median. adjacency with non-zero adjacency score if all pairwise similarities of its corresponding extant genes g, h, i are Problem 1 (FF-Median) Given three genomes G, H, non-zero. Tus, the search for median genes can be limited and I, and gene similarity measure σ, fnd a confict-free to 3-cliques (triangles) in the tripartite similarity graph. median M, which maximizes the following formula: Remark 3 Te right-hand side of the above formula for the weight of an adjacency is independent of genome X. a b From Eq. (4), an adjacency in median M has only an impact g1 g2 g3 g4 in a solution to problem FF-Median if it participates in a i G h3 gene adjacency in at least one extant genome. So includ- ? ing in a median genome median genes that do not belong σ(g, i) σ(h, i) H C m h1 h2 to a candidate median adjacency in A do not increase the ? ? objective function. g h I σ(g, h) i1 i2 i3 Fig. 1 a Illustration of the score of a candidate median gene. b Gene Accounting for gene family evolution similarity graph of three genomes G, H, and I. Colored components Duplication and loss are two important phenomena of gene indicate candidate median genes m1 = (g1, h1, i2), m2 = (g2, h2, i1), family evolution that afect the gene order. Figure 2 visu- m g , h , i m g , h , i m , m 3 = ( 3 3 2), and 4 = ( 4 3 3). Median gene pairs 1 3 and alizes the outcome of a duplication of a gene belonging to m3, m4 are conficting gene family a as well as a deletion of a gene from gene family Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 4 of 14

e G

H

I abac df

duplication deletion Fig. 2 The efect of duplication and deletion of a single gene in problem FF-Median. Colored arcs correspond to potential median adjacencies e. Both events occurred along the evolutionary path from genes that are similar to each other. In this section, we genome M leading to I. Such efects of gene family evolu- review a more fexible model where the constructed tion on the gene order must be accounted for in gene order matching also includes smaller components: analysis. Yet, they can only be detected once the gene fami- k lies are inferred. Consequently, family-free methods must Defnition 1 (partial -matching) Given a gene simi- provide internal mechanisms for their resolution. Problem larity graph B = (G1, ..., Gk , E), a partial k-matching FF-Median meets this ambitious demand to some extend. M ⊆ E is a subset of edges such that for each connected M For instance, the true ancestral gene order “a b c” of the component C in BM ≡ (G1, ..., Gk , ) no two genes in example visualized in Fig. 2 will be recovered by solving C belong to the same genome. problem FF-Median as long as the cumulative score of the adjacency between a and b (yellow arcs), which is conserved A partial 3-matching M ⊆ E in gene similarity graph in all three extant genomes, plus the score of the twofold B = (G, H, I, E) of genomes G, H, and I induces subge- conserved adjacency between b and c (red arcs) is larger nomes GM ⊆ G, HM ⊆ H, and IM ⊆ I with gene sets than the cumulative score of the onefold conserved adjacen- C(GM), C(HM), and C(IM), respectively, corresponding cies b, a (blue arc) and a, c (green arc) of genome I. In other to the set of vertices incident to edges of matching M. ′ cases where immediate neighborhoods of true positional In doing so, a subgenome X ⊂ X may contain adjacen- A a b homologs are less conserved, problem FF-Median likely fails cies that are not part of (X): two gene extremities x1, x2 a b A ′ A to obtain the correct ancestral gene order. Even worse, it is form an adjacency {x1, x2}∈ (X ) �⊆ (X) if all genes generally afected by gene deletion events, such as the one that lie in between x1 and x2 in genome X are not con- ′ shown in the example on the right side of Fig. 2. tained in C(X ). In the following, we discuss a related problem called We then aim to fnd a partial 3-matching that maxi- family-free adjacencies, initially introduced by Braga mizes a linear combination of a sum of conserved adja- et al. [2], that can tolerate the efects of both gene dupli- cencies and a sum of similarities between the matched cations and losses. genes:

Family‑free adjacencies for three genomes Problem 2 (family-free adjacencies for three genomes In the previous section we introduced problem FF- (FF-Adjacencies) [2]) Given a gene similarity graph Median that asks for the construction of a median from B = (G, H, I, E) and some α with 0 ≤ α ≤ 1, fnd a par- three extant genome sequences. In doing so, the median tial 3-matching M ⊆ E that maximizes the following corresponds to a 3-(partite) matching between extant formula:

a a b b Fα(M) = α · s(x1, y1, x2, y2) + (1 − α) · σ(x, y) , , ∈M {x1, y1}, { x2, y2}∈M (x y) (6) a b a b {x1, x2}, {y1, y2}∈AM Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 5 of 14

A A where M =∪X∈{G,H,I} (XM). computing exact solutions for the comparison of bacte- Problem FF-Adjacencies accounts for gene duplications rial-sized genomes in the next section. We present FF- and losses, as well as perturbations in the assessment of Median, an integer linear program (ILP), for the solution gene similarities by (i) considering conserved adjacen- of the correspondent problem. In order to speed up the cies between genes that are not immediate neighbors computation in practice, we additionally present algo- but lie two, three, or more genes apart, (ii) relaxing the rithm ICF-SEG that detects local optimal structures that 3-matching to a partial 3-matching, and (iii) maximiz- commonly appear when comparing genomes of reason- ing similarities between matched genes. Te set of con- ably close species. nected components that satisfy the matching constraint Further, we present ILP FFAdj-3G for the solution of form subcomponents of cliques of size three in the gene problem FF-Adjacencies. However, the problem’s supe- similarity graph of extant genomes G, H, and I. Figure 3 rior capability (compared to problem FF-Median) of visualizes the seven possible subcomponents permitted resolving events of gene family evolution comes at the by a partial 3-matching. Te matching implies orthol- expense of a dramatically increased search space. Tak- ogy assignments between genes conserved in at least two ing adjacencies between genes into consideration that extant genomes. Because of (iii) and unlike in problem are further apart leads to an explosion of conficting FF-Median, connected components are not bound to conserved adjacencies. Tis number is then potentized participate in conserved adjacencies. Tus, problem FF- by the number of possible subcomponents in a partial Adjacencies can also infer orthology assignments that are 3-matching, making the computation of solutions even unsupported by synteny. more challenging. Tus, it is impossible to calculate In the next two sections we describe our theoretical exact solutions to problem FF-Adjacencies with pro- results: a study of computational complexity for prob- gram FFAdj-3G for average-sized bacterial genomes lems FF-Median and FF-Adjacencies, two methods to in reasonable runtime. Addressing problem FF-Adja- compute their exact solutions, and a heuristic that con- cencies in pairwise comparisons, Doerr proposed in [3] structs feasible, but possibly suboptimal solutions to FF- an efective method to identify optimal substructures in Adjacencies based on solutions to problem FF-Median. practical instances, allowing the computation of exact solutions for bacterial-sized genomes. As of the time of Complexity results writing, the search for similar structures in the case of three genomes has been unsuccessful. Terefore, we pro- Teorem 1 Problem FF-Median is MAX SNP-hard. pose an alternative, practically motivated method, called FFAdj-AM, which frst computes a solution to problem We describe the full hardness proof in Additional fle 1: FF-Median, then treating the matching implied by the Section 1. It is based on a reduction from the Maximum obtained median as invariant in the search for a (possibly Independent Set for Graphs of Bounded Degree 3. Also, suboptimal) solution to problem FF-Adjacencies. (Note problem FF-Adjacencies has proven NP-hard: Kowada that every solution to FF-Median is a feasible solution et al. showed that already for the case of pairwise com- to problem FF-Adjacencies.) More precisely, FFAdj-AM parisons and uniform similarity scores the problem calls frst program FF-Median on a given gene simi- = , , , becomes intractable [6]. larity graph B (G H I E) and subsequently treats its In the past decades, numerous problems in the feld output as a partial, feasible solution for problem FF-Adja- of computational biology have been shown NP-hard, yet cencies. Ten it executes program FFAdj-3G to improve the hope of computing fast solutions has not diminished on this solution by exploring the subgraph of B that is not for all. In fact, many instances of such problems arising contained in the initially computed family-free median. in practical applications are less complex and hence can Tis approach turns out to be feasible in practice. We be algorithmically solved rather fast. We are therefore show this in our evaluation by computing exact solutions also concerned about the practical computability of the on a biological dataset composed of 15 γ-proteobacterial problems at hand. In doing so, we devise methods for genomes. Algorithmic results An exact ILP algorithm to problem FF‑Median G We now present program FF-Median, described in H Fig. 4, that exploits the specifc properties of problem 5 FF-Median to design an ILP using O(n ) variables and I statements. Program FF-Median makes use of two a b Fig. 3 The seven valid types of components of a partial 3-matching types of binary variables and as declared in domain Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 6 of 14

Objective: Maximize

a b a b a b a a b b b(g1 ,g2,h1 ,h2,i1 ,i2) s(m1 ,πX (m1) ,m2,πX (m2) ) C X G,H,I , m1,m2 , ∈{ } { }∈A π (m )a,π (m )b (X) m1 =(g1,h1,i1), { X 1 X 2 }∈A m2 =(g2,h2,i2), a,b h, t ∈{ } Constraints: (C.01) g (G): a(g ,h,i) 1 ∀ ∈C ≤ (g ,h,i) Σ ∈ h (H): a(g, h ,i) 1 ∀ ∈C ≤ (g,h ,i) Σ ∈ i (I): a(g, h, i ) 1 ∀ ∈C ≤ (g,h,i ) Σ ∈

(C.02) (g ,h ,i ), (g ,h ,i ) C and a, b h, t : ∀{ 1 1 1 2 2 2 }∈A ∀ ∈{ } 2 b(ga,gb,ha,hb ,ia,ib ) a(g ,h ,i )+a(g ,h ,i ) · 1 2 1 2 1 2 ≤ 1 1 1 2 2 2 (C.03) (g ,h ,i ) Σ and a h, t : ∀ 1 1 1 ∈ ∀ ∈{ } b(ga,gb,ha,hb ,ia,ib ) 1 1 2 1 2 1 2 ≤ (g ,h ,i ) Σ ,b h, t 2 2 2 ∈ ∈{ } Domains: (D.01) (g, h, i) Σ : a(g,h, i) 0, 1 ∀ ∈ ∈{ } (D.02) (g ,h ,i ), (g ,h ,i ) C and a, b h, t : ∀{ 1 1 1 2 2 2 }∈A ∀ ∈{ } b(ga,gb,ha,hb ,ia,ib ) 0, 1 1 2 1 2 1 2 ∈{ } Fig. 4 Program FF-Median, an ILP for solving problem FF-Median

specifcations (D.01) and (D.02), that defnes the set Remark 4 Te output of the algorithm FF-Median is Σ of median genesC and of candidate conserved median a set of adjacencies between median genes that defne a adjacencies A (Remark 3). Te former variable type set of linear and/or circular orders, called CARs (Contig- indicates the presence or absence of candidate genes in uous Ancestral Regions), where linear segments are not b an optimal median M. Te latter, variable type , specifes capped by telomeres. So formally the computed median if an adjacency between two gene extremities or telom- might not be a valid genome. However,C as adding adja- eres is established in M. Constraint (C.01) ensures that cencies that do not belong to A do not modify the score M is confict-free, by demanding that each extant gene of a given median, a set of median adjacencies can always (or ) can be associated with at most one median be completed into a valid genome by such adjacencies gene (or telomere). Further, constraint (C.02) dictates that join the linear segments together and add telomeres. that a median adjacency can only be established between Tese extra adjacencies would not be supported by any genes that both are part of the median. Lastly, constraint extant genome and thus can be considered as dubious,

(C.03) guarantees that each gene extremity and telomere and in our implementation, we only return the medianC of the median participates in at most one adjacency. adjacencies computed by the ILP, i.e. a subset of A .

Property 1 Te size (i.e. number of variables and state- Remark 5 Following Remark 2, preprocessing the input ments) of any ILP returned by program is extant genomes requires to handle the extant genes that 5 FF-Median limited by O(n ) where n = max(|C(G)|, |C(H)|, |C(I)|). do not belong to at least one 3-clique in the similarity Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 7 of 14

graph. Such genes can not be part of any median. So one confict is a confict between two genes from S; an exter- could decide to leave them in the input, and the ILP can nal confict is a confict between a gene from S and a handle them and ensures they are never part of the out- candidate median gene not in S. We say that S is contigu- S put solution. However, discarding them from the extant ous in extant genome X if the set πX ( ) forms a unique, genomes can help recover adjacencies that have been dis- contiguous, segment in X. We say that S is an internal rupted by the insertion of a mobile element for example, confict-free segment (IC-free segment) if it contains no so in our implementation we follow this approach. internal confict and is contiguous in all three extant genomes; this can be seen as the family-free equivalent of As discussed at the end of the previous section, the FF- the notion of common interval in permutations [12]. An Median problem is a generalization of the mixed multi- IC-free segment is a run if the order of the extant genes is chromosomal breakpoint median problem [1]. Tannier conserved in all three extant genomes, up to a full rever- et al. presented in [1] an approach for its solution based sal of the segment. on a Maximum-Weight Matching (MWM) algorithm. Intuitively, one can fnd an optimal solution to the sub- Tis motivates the results presented in the next para- instance defned by an IC-free segment, but it might not graph that also use a MWM algorithm to identify optimal be part of an optimal median for the whole instance due median substructures by focusing on confict-free sets of to side efects of the rest of the instance. So we need to median genes. adapt the graph to which we apply an MWM algorithm to account for such side efects. To do so, we defne the Finding local optimal segments potential of a candidate median gene m as

Tannier et al. [1] solve the mixed multichromosomal a b a b ∆(m)= max w( m1,m )+w( m ,m2 ) . ma,mb , ma,mb { } { } breakpoint median problem by transforming it into an { 1 } { 2}∈A MWM problem, that we outline now. A graph is defned S in which each extremity of a candidate median gene and We then extend graph Ŵ( ) =: (V , E) to graph ′ S ′ each telomere gives rise to a vertex. Any two vertices are Ŵ ( ) := (V , E ) by adding edges between the extremi- connected by an edge, weighted according to the number ties of each candidate median gene of an IC-free seg- S ′ h t S of observed adjacencies between the two gene extremities ment , i.e. E = E ∪ {{m , m }|m ∈ } (note that when S h t S in extant genomes. Edges corresponding to adjacencies | | > 1, w({m , m }) = 0 since is contiguous in all between a gene extremity and telomeres are weighted three extant genomes). In the following we refer to these only by half as much. An MWM in this graph induces a edges as confict edges. Let C(m) be the set of candidate set of adjacencies that defnes an optimal median. median genes that are involved in an (external) confict S We frst describe how this approach applies to our with a given candidate median gene m of , then the h t ′ problem. We defne a graph Γ(Σ ) constructed from an confict edge {m , m }∈E is weighted by the maximum FF-Median instance (G, H, I, σ) that is similar to that of potential of a non-conficting subset of C(m), Tannier et al. deviating by defning vertices as candidate w′({mh, mt }) = median gene extremities and weighting an edge between a b max({ �(m′) | C′ ⊆ C(m) : C′ is conflict-free}) . two vertices m1, m2, a, b ∈{h, t}, by m′∈C′ a b a b w({m , m }) = IX (πX (m1) , πX (m2) ) ′ 1 2 A confict-free matching in Ŵ (S) is a matching without a X∈{G,H,I} a a b b confict edge. ·s(m1, πX (m1) , m2, πX (m2) ). (7) We make frst the following observation, where a con- Given an internal confict-free segment S, any Lemma 1 ′ fict-free matching is a matching that does not contain maximum weight matching in graph Ŵ (S) that is confict- two conficting vertices (candidate median genes): free defnes a set of median genes and adjacencies that belong to at least one optimal FF-Median of the whole Observation 1 Any confict-free matching in graph instance. Γ(Σ ) of maximum weight defnes an optimal median. S Proof Given an IC-free segment ={m1, ..., mk } of We show now that we can defne notions of sub- an FF-Median instance (G, H, I, σ). Let M be a confict- ′ S instances—of a full FF-Median instance—that contains free matching in graph Ŵ ( ). Because M is confict- S no internal conficts, for which applying the MWM can free and contiguous in all three extant genomes, M allow to detect if the set of median genes defning the must contain all candidate median genes of S. Now, let ′ S C ′ sub-instance is part of at least one optimal FF-Median. M be a median such that �⊆ (M ). Further, let C(m) Let S be a set of candidate median genes. An internal be the set of candidate median genes that are involved Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 8 of 14

in a confict with with a given median gene m of S and for three genomes G, H, and I, given their gene similarity C ′ S , , , X = (M ) ∩ ( m∈S C(m) ∪ ). Clearly, X �= ∅ and graph B = (G H I E). (X) (X) ( ) for the contribution must hold F ≥F S , Te objective of the integer linear program is to maxi- ′ F otherwise M is not optimal since it is straightforward mize a linear combination of the sum of adjacency scores to construct a median higher score which includes S. of pairs of matched genes and the sum of similarities of Clearly, the contribution F(X) to the median is bounded matched genes. To evaluate the former sum, program max( m C ∆(m ) C C(m): C is conflict-free )+ ( ) by { ∈ | ⊆ } F S . iterates over the sets of candidate adjacences, FFadj-3G ′ S A⋆ ′ A But since gives rise to a confict-free match- defned as (X) ≡∪X ⊆X (X ) over all subgenomes ′′ ′ ing with maximum score, also median M X ⊆ X of a given genome X. C ′′ = C ′ \ ∪ C S with (M ) ( (M ) X) ( ) and FFAdj-3G makes use of three types of binary vari- A ′′ = A ′ \ A ∪ A c, d e (M ) ( (M ) (X)) (S)) must be an (optimal) ables , and (see domains (D.01) (D.03)). Vari- c − median. ables (x, y) indicate if edge {x, y} in gene similarity graph B is part of the anticipated matching M. Likewise, each d C C C Lemma 1 leads to a procedure (Fig. 5) that iteratively variable (x), x ∈ (G) ∪ (H) ∪ (I), encodes if ver- identifes and tests IC-free segments in the FF-Median tex x in gene similarity graph B is potentially incident M e a, a, b, b instance. For each identifed IC-free segment S an adja- to an edge in . Lastly, variables (x1 y1 x2 y2) indi- ′ a, b, a, b , h, t cency graph Ŵ (S) is constructed and a maximum weight cate if gene extremities x1 x2 y1 y2, with a b ∈{ } of M matching is computed (Line 2–3). If the resulting match- the -induced subgenomes XM and YM can possibly a, b A ing is confict-free (Line 4), adjacencies of IC-free seg- form conserved adjacencies, i.e., {x1 x2}∈ (XM) and a, b A ment S are reported and S is removed from an FF-Median {y1 y2}∈ (YM). instance by masking its internal adjacencies and remov- Constraints (C.01) and (C.02) ensure that the result- M ing all candidate median genes (and consequently their ing matching forms a valid partial 3-matching. Tat M associated candidate median adjacencies) corresponding is, no two genes of a connected component in the to external conficts (Line 5–6). It then follows immedi- -induced subgraph of gene similarity graph B belong to ately from Lemma 1 that the set median genes returned the same genome (see Defnition 1). In doing so, (C.01) by Fig. 5 belongs to at least one optimal solution to the establishes pairwise matching constraints, i.e., it guaran- FF-Median problem. tees that in the matching-induced subgraph, each gene In the experiments, IC-free runs are used instead of is connected to at most one gene per genome. Note that d segments. Step 1 is performed efciently by frst identi- variables are assigned 1 for each gene that is incident M fying maximal IC-free runs, then breaking it down into to at least one edge of partial 3-matching . Tat is, the b smaller runs whenever the condition in Step 4 is not value of a variable can be 1 even though its correspond- M satisfed. ing gene is not incident to an edge of . But then, pro- gram FFAdj-3G permits a gene to be incident to several Solving problem FF‑Adjacencies for three genomes edges of M, if each of these edges is incident to genes of We now describe program FFAdj-3G, as shown in Fig. 6. diferent genomes. Additional constraints are enforced It returns an exact solution to problem FF-Adjacencies by (C.02) on every pair of edges that share a common

Input: FF-Median instance (G, H, I,σ) Output: Setof adjacencies AdjM that is partofamedian M of (G, H, I,σ). 1: while thereexistsanunobserved IC-free conserved segment S in (G, H, I,σ) do 2: Construct adjacencygraph Γ (S)of S 3: Findmaximumweight matching E(Γ (S)) M⊆ 4: if A(S)= then M 5: Add A(S)to Adj M 6: Remove S including external conflictsfrom(G, H, I,σ) 7: endif 8: end while Fig. 5 Algorithm ICF-SEG Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 9 of 14

Objective: Maximize

α s(xa,ya,xb ,yb) e(xa,ya,xb ,yb)+(1 α) σ(x, y) c(x, y) · 1 1 2 2 · 1 1 2 2 − · · X,Y G,H,I xa,xb (X), x (X), { }⊂{ } { 1 2}∈A y∈C (Y ) ya,yb (Y ) ∈C { 1 2}∈A Constraints: (C.01) X, Y G, H, I , x (X), ∀{ }⊂{ } ∀ ∈C c(x, y) d(x) ≤ y (Y ) ∈C (C.02) g (G), h (H), i (I), ∀ ∈C ∀ ∈C ∀ ∈C if σ(g, h) > 0 and σ(g,i) > 0, c(h , i)+c(g, h)+c(g, i) 2, c(h, i )+c(g, h)+c(g, i) 2 ≤ ≤ h (H) i (I) ∈C ∈C h =h i =i if σ(g, h) > 0 and σ(h, i) > 0, c(g ,i)+c(g, h)+c(h, i) 2, c(g, i )+c(g, h)+c(h, i) 2 ≤ ≤ g (G) i (I) ∈C ∈C g =g i =i if σ(g, i) > 0 and σ(h, i) > 0, c(g ,h)+c(g, i)+c(h, i) 2, c(g, h )+c(g, i)+c(h, i) 2 ≤ ≤ g (G) h (H) ∈C ∈C g =g h =h (C.03) X, Y G, H, I , ∀{ }⊂{ } xa,xb (X), ya,yb (Y ) ∀{ 1 2}∈A ∀{ 1 2}∈A c(xa,yb)+c(xa,yb) 2 e(xa,ya,xb ,yb) 0, 1 1 2 2 − · 1 1 2 2 ≥ if xa,xb (X), then for each x in x ,...,x (X) { 1 2 { 3 n}⊆A such that xa,xa3 , xb3 ,xa4 , ..., xbn ,xb (X), {{ 1 3 } { 3 4 } { n 2}} ⊆A d(x)+e(xa,ya,xb ,yb) 1, 1 1 2 2 ≤ if ya,yb (Y ), then foreach y in y ,...,y (Y ) { 1 2 { 3 n}⊆A such that ya,ya3 , yb3 ,ya4 , ..., ybn ,yb (Y ), {{ 1 3 } { 3 4 } { n 2}} ⊆A d(y)+e(xa,ya,xb ,yb) 1 1 1 2 2 ≤ Domains: (D.01) x (X)and y (X), X, Y G, H, I , c(x, y) 0, 1 ∀ ∈C ∈C { }⊂{ } ∈{ } (D.02) x (G) (H) (I), d(x) 0, 1 , ∀ ∈C ∪C ∪C ∈{ } (D.03) xa,xb (X) and ya,yb (Y ), X, Y G, H, I , ∀{ 1 2}∈A ∀{ 1 2}∈A { }⊂{ } e(xa,ya,xb ,yb) 0, 1 1 1 2 2 ∈{ } Fig. 6 Program FFAdj-3G, an ILP for solving FF-Adjacencies for three genomes gene in one genome, but are incident to genes of diferent colored green. Te fgure schematizes all 16 combinations genomes. Let us consider three genes g ∈ G, h ∈ H, and in which edges in the neighborhood of {g, h} and {g, i} i ∈ I, which are connected by two edges {g, h}, {g, i}∈E . (including {g, h} and {g, i}) can participate in a matching Tis scenario is represented in Fig. 7, where the two only constrained by (C.01). Saturated edges are indicated edges {g, h} and {g, i} that share the common gene g are by thick continuous lines, unsaturated edges by dashed Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 10 of 14

ABCD g g g g

h h h h

i i i i

EFGH g g g g

h h h h

i i i i

IJKL g g g g

h h h h

i i i i

MNOP g g g g

h h h h

i i i i Fig. 7 The implications of Constraint (C.02) on combinations of saturated edges. Parts a–p visualize all 16 possibilities that are valid under Con- g, h straint (C.01). The parts show how edges incident to genes i and h are efected by the frst case of Constraint (C.02) that acts on edges { } and {g, i} (green lines). Saturated edges are indicated by thick continuous lines, unsaturated edges by dashed lines. Dotted gray lines are not considered by the constraint and can be either saturated or unsaturated. Only combinations shown in Parts h, l, and p violate constraint (C.02)

lines, and gray dotted lines (which can be either saturated dotted line between genes h and i indicates that edge {h, i} or unsaturated) are not considered by the two sum con- is not considered by the constraints of (C.02). In case straints. For instance, Fig. 7a represents the case in which edge {h, i} is saturated, it may be in confict with saturated no edge incident to vertices g, h, or i is saturated. When blue and red edges which results in violations of the pair- applying Constraint (C.02) on these 16 combinations, wise matching constraints of (C.01). it is ensured that (i) the sum of saturated edges that are Lastly, Constraint (C.03) covers the rules of form- e red or green is less than or equal to two, and (ii) that the ing conserved adjacencies: (i) it ensures that a variable , sum of saturated edges that are blue or green is less than which indicates a conserved adjacency for two edges, is or equal to two. Combinations that violate any of the two set to 1 only if the edges are saturated; (ii) using variables d sum constraints, shown in Fig. 7h, l, p, are exactly those , it prohibits that no gene (and thus no incident edge) that violate the partial 3-matching property. Te gray within a conserved adjacency is part of the matching. Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 11 of 14

Experimental results and discussion 7 × 10 data sets with increasing evolutionary distance Our algorithms have been implemented in Python and ranging from 10 to 130 percent accepted mutations require CPLEX1; they are freely available as part of the (PAM). Details about the generated data sets are shown family-free genome comparison tool FFGC downloadable in Additional fle 1: Table 1 of Section 2. Figure 8a, at http://bibiserv.cebitec.uni-bielefeld.de/fgc. b show the outcome of our analysis with respect to preci- In subsequent analyses, gene similarities are based on sion and recall2 of inferring positional orthologs. In all local alignment hits identifed with BLASTP on protein simulations, program FF-Median and heuristic 10−5 sequences using an e-value threshold of . In gene FFAdj-AM generated no or very few false positives, lead- similarity graphs, we discard spurious edges by applying ing to perfect or near-perfect precision score, consist- a stringency flter proposed by Lechner et al. [13] that uti- ently outperforming MultiMSOAR. Te comparison 0, 1 lizes a local threshold parameter f ∈[ ] and BLAST between orthologs inferred by FF-Median and FFAdj- bit-scores: a BLAST hit from a gene g to h is only retained AM shows that the additional orthologies identifed by if it is has a higher or equal score than f times the best do not deteriorate the precision, but only ′ FFAdj-AM BLAST hit from h to any gene g that is member of the improve its recall. Tus, our heuristic method consist- same genome as g. In all our experiments, we set f to 0.5. ently outperforms MultiMSOAR in precision and recall Edge weights of the gene similarity graph are then cal- over all evolutionary distances. culated according to the relative reciprocal BLAST score (RRBS) [14]. Finally we applied algorithm ICF-SEG with Evaluation on real data conserved segments defned as runs. We study 15 γ-proteobacterial genomes that span a large For running programs FF-Median and FFAdj-3G, taxonomic spectrum and are contained in the OMA we granted CPLEX 64 CPU cores, 64 GB memory and a database [16]. A complete list of species names is given time limit of 1 h per dataset. In both simulated and real in Additional fle 1: Table 2 of Section 3. We obtained the data we set the FFAdj-3G’s parameter α to 0.9. genomic sequences from the NCBI database and con- In our experiments, we compare ourselves against the structed for each combination of three genomes a gene orthology prediction tool MultiMSOAR [11]. Tis tool similarity graph following the same procedure as in the requires precomputed gene families, which we con- simulated dataset. In 9 out of the 455 combinations of structed by following the workfow described in [11]. genomes the time limit prohibited CPLEX from fnding an optimal solution for program FF-Median. Likewise Evaluation on simulated data for FFAdj-AM, CPLEX was unable to fnd and optimal We frst evaluate our algorithms on simulated data sets solution in 69 combinations within the provided 1h time obtained by ALF [15]. Te ALF simulator covers many frame. However, in all those cases CPLEX was still able to aspects of genome evolution from point mutations to fnd integer feasible suboptimal solutions, many of which global modifcations. Te latter includes inversions and were less than a factor of 10% from the optimal. Figure 8e transpositions as genome rearrangement operations. displays statistics of the medians constructed from the Various options are available to customize the process of real dataset. Te number of candidate median genes gene family evolution. In our simulations, we mainly use and adjacencies ranges from 756 to 18,005 and 3164 to standard parameters suggested by the authors of ALF 2,261,716, respectively, giving rise to up to 3223 median and we focus on three parameters that primarily infu- genes that are distributed on 5 to 90 CARs per median. ence the outcome of gene family-free genome analysis: (i) Some CARs are circular, indicating dubious conforma- the rate of sequence evolution, (ii) the rate of genome tions mostly arising from tandem duplications, but the rearrangements, and (iii) the rate of gene duplications number of such cases were low (mean: 2.76, max: 14). and losses. We keep all three rates constant, only varying We observed that the gene families in the OMA data- the evolutionary distance between the generated extant base are clustered tightly and therefore missing many genomes. We confne our simulations to protein coding true orthologies in the considered triples of genomes. As sequences. A comprehensive list of parameter settings a result, many of the orthologous groups inferred by FF- used in our simulations is shown in Additional fle 1: Median/FFAdj-AM and MultiMSOAR fall into more Table 2 of Section 2. As root genome in the simulations, than one gene family inferred by OMA. We therefore we used the genomic sequence of an evaluate our results by classifying the inferred ortholo- K-12 strain (Accession No: NC_000913.2) which com- gous groups into three categories: An orthologous group prises 4320 protein coding genes. We then generated

2 Precision: #true positives/(#true positives #false positives), recall: #true 1 + http://www.ibm.com/software/integration/optimization/cplex-optimizer/. positives/(#true positives + #false negatives). Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 12 of 14

ab ALF simulated datasets ALF simulated datasets 1 1

0.98 0.98

0.96 0.96

0.94 0.94

0.92 precision FF-Median 0.92 precision FFAdj-AM recall FF-Median recall FFAdj-AM 0.9 0.9 precision MultiMSOAR precision MultiMSOAR 0.88 recall MultiMSOAR 0.88 recall MultiMSOAR

10 30 50 70 90 110130 10 30 50 70 90 110130 evolutionary distance (PAM) evolutionary distance (PAM) cd OMA: FFAdj-AM OMA: MultiMSOAR 5000 5000 agreeing agreeing compatible compatible 4000 4000 disagreeing disagreeing

3000 3000

2000 2000

1000 1000

0 0 0 100 200 300400 500 0100 200 300400 500 combinations of 3 input genomes combinations of 3 input genomes

ef real data: median statistics real data: fragile orthologies 107 20 #edges in GSG 106 #median genes #cars 5 15 10 #circular cars

104 10 103

2 10 5

101

100 0 0 100 200 300400 500 -1000 010002000 combinations of 3 input genomes fragile(FFAdj-AM)- fragile(MultiMSOAR) Fig. 8 Top Precision and recall of a FF-Median and b FFAdj-AM in comparison with MultiMSOAR in simulations; Middle agreement, compat- ibility and disagreement of positional orthologs inferred by c FFAdj-AM and d MultiMSOAR with the OMA database; Bottom e statistical assess- ment of CARs and median genes inferred by FF-Median on real datasets; f histogram of fragile orthologies in results obtained by FFAdj-AM and MultiMSOAR

agrees with OMA if all its genes are in the same OMA of orthologous groups of FFAdj-AM and MultiMSOAR group. It disagrees with OMA if any two of its genes x and in each of the three categories. Figure 8c, d give an over- y (of genomes X and Y respectively) are in diferent OMA view on the outcome this analysis, showing that FFAdj- groups but the OMA group of x contains another gene AM and MultiMSOAR perform roughly equally well. from genome Y. It is compatible with OMA if it neither Te number of orthologous groups that disagree with agrees nor disagrees with OMA. We measure the number OMA is comparably low for both FFAdj-AM (mean: Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 13 of 14

44.43, var: 129) and MultiMSOAR (mean: 44.63, var: Additional fle 243). In total, FFAdj-AM is able to infer 7865 orthologies more that are agree and 94 less that disagree with OMA. Additional fle 1. This fle contains a hardness proof for problem FF- Conversely, fnds 69,769 more compatible Median and additional information regarding the simulated and real MultiMSOAR dataset that were used in the evaluation of this study. orthologies than FFAdj-AM. We then performed another analysis to assess the fra- gility of the positional orthology predictions. To this end, Authors’ contributions we look at orthologous groups across multiple datasets All authors participated in discussing, formulating, and modulating the that share two extant genomes, but vary in the third. research. DD, PF, and CC initiated and directed the research. DD and CC wrote Given two genes, x of genome X and y of genome Y, an the manuscript. DD developed the presented methods, MB proposed and implemented algorithm ICF-SEG. DD, MB, and CC performed the experi- orthologous group that contains x and y is called frag- ments on simulated and real data. All authors read and approved the fnal ile if x and y no longer occur not in the same ortholo- manuscript. gous group if the third extant genome is exchanged for Author details another. We computed the total count of fragile orthol- 1 School of Computer and Communication Sciences, EPFL, INJ211 Station 14, ogies produced by FFAdj-AM and MultiMSOAR for 1015 Lausanne, Switzerland. 2 Faculty of Technology and Center for Biotech- all 105 genome pairs in our dataset, see Fig. 8f. In 88 nology (CeBiTec), Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Ger- 83.8% many. 3 Department of Mathematics, Simon Fraser University, 8888 University pairwise comparisons ( ) the orthologous groups Drive, Burnaby, BC V5A 1S6, Canada. inferred by FFAdj-AM have fewer fragile orthologies than those by MultiMSOAR. Acknowledgements DD and MB wish to thank Bernard M.E. Moret for helpful discussions. Overall, we can observe that FFAdj-AM performs equally well or better as MultiMSOAR—which is con- Competing interests sistent with our observation on simulated data—while The authors declare that they have no competing interests. producing less fragile orthologies in general. Tis sug- Availability of data and materials gests FFAdj-AM is an interesting alternative to identify The presented algorithms are implemented in the Python 2 programming higher confdence positional orthologs. language and freely available as part of the family-free genome comparison tool FFGC downloadable at http://bibiserv.cebitec.uni-bielefeld.de/fgc. The datasets used and analysed during the current study available from the cor- Conclusions and future work responding author on reasonable request. Our main contributions in this work are (i) the intro- Funding duction and analysis of a new problem, FF-Median, a CC acknowledges funding from the NSERC Discovery Grant. generalization of the unconstrained breakpoint median of three, (ii) FFAdj-3G, an exact algorithm for solv- Publisher’s Note ing problem FF-Adjacencies for three genomes, and (iii) Springer Nature remains neutral with regard to jurisdictional claims in pub- FFAdj-AM, a heuristic method combining both pro- lished maps and institutional afliations. grams FF-Median and FFAdj-3G. Our heuristic shows Received: 6 January 2017 Accepted: 18 May 2017 superior performance in simulations and comparable performance on real data compared to MultiMSOAR, a competing software tool. One aim of future work is to investigate alternative References methods to reduce the computational load of programs 1. Tannier E, Zheng C, Sankof D. Multichromosomal median and halving FF-Median and FFAdj-3G by identifying further problems under diferent genomic distances. BMC Bioinform. 2009;10:120. strictly sub-optimal and optimal substructures, which 2. Braga MDV, Chauve C, Doerr D, Jahn K, Stoye J, Thévenin A, Wittler R. The potential of family-free genome comparison. In: Models and algorithms might require a better understanding of the impact of for genome evolution. London: Springer; 2013. p. 287–323. internal conficts within substructures defned by inter- 3. Dörr D. Gene family-free genome comparison. PhD thesis, Univer- vals in the extant genomes. Without the need to modify sität Bielefeld, Bielefeld, Germany. 2016. https://pub.uni-bielefeld.de/ publication/2902049. drastically either the FF-Median/FF-Adjacencies prob- 4. Doerr D, Thévenin A, Stoye J. Gene family assignment-free comparative lem defnition or the ILP, one can think about more genomics. BMC Bioinform. 2012;13(Suppl 19):3. complex weighting schemes for adjacencies that could 5. Martinez FV, Feijão P, Braga MDV, Stoye J. On the family-free DCJ distance and similarity. Algorithms Mol Biol. 2015;10:13. account for known divergence time between genomes. 6. Kowada LAB, Doerr D, Dantas S, Stoye J. New genome similarity measures With regard to program FF-Median, it would prob- based on conserved gene adjacencies. RECOMB. 2016;2016:204–24. ably be interesting to combine this with the use of com- 7. Lechner M, Hernandez-Rosales M, Doerr D, Wieseke N, Thévenin A, Stoye J, Hartmann RK, Prohaska SJ, Stadler PF. Orthology detection mon intervals instead of runs to defne confict-free combining clustering and synteny for very large datasets. PLoS ONE. sub-instances. 2014;9(8):105015. Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 14 of 14

8. Fitch WM. Homology a personal view on some of the problems. Trends 14. Pesquita C, Faria D, Bastos H, Ferreira AEN, Falcão AO, Couto FM. Metrics Genet. 2000;16:227–31. for GO based protein semantic similarity: a systematic evaluation. BMC 9. Gabaldón T, Koonin EV. Functional and evolutionary implications of gene Bioinform. 2008;9(Suppl 5):4. orthology. Nat Rev Genet. 2013;14:360–6. 15. Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C. Alf—a simulation 10. Dewey CN. Positional orthology: putting genomic evolutionary relation- framework for genome evolution. Mol Biol Evol. 2012;29(4):1115–23. ships into context. Brief Bioinform. 2011;12(5):401–12. 16. Altenhof AM, Skunca N, Glover N, Train C, Sueki A, Pilizota I, Gori 11. Shi G, Peng M-C, Jiang T. Multimsoar 2.0: an accurate tool to identify K, Tomiczek B, Müller S, Redestig H, Gonnet GH, Dessimoz C. The ortholog groups among multiple genomes. PLoS ONE. 2011;6(6):20892. OMA orthology database in 2015: function predictions, better plant 12. Bergeron A, Chauve C, Gingras Y. Formal models of gene clusters. In: support, synteny view and other improvements. Nucleic Acids Res. Bioinformatics algorithms: techniques and applications, vol. 8. New York: 2015;43(Database–Issue):240–9. Wiley; 2008. p. 177–202. 13. Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinform. 2011;12:124.

Submit your next manuscript to BioMed Central and we will help you at every step: • We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal • We provide round the clock customer support • Convenient online submission • Thorough peer review • Inclusion in PubMed and all major indexing services • Maximum visibility for your research

Submit your manuscript at www.biomedcentral.com/submit