Linear-Time Computation of Minimal Absent Words Using Suffix Array Carl Barton1, Alice Heliou2,3, Laurent Mouchard4 and Solon P Pissis1*

Linear-time computation of minimal absent words using suﬀix array Carl Barton, Alice Héliou, Laurent Mouchard, Solon P. Pissis To cite this version: Carl Barton, Alice Héliou, Laurent Mouchard, Solon P. Pissis. Linear-time computation of minimal absent words using suﬀix array. BMC Bioinformatics, BioMed Central, 2014, 15, pp.11. 10.1186/s12859- 014-0388-9. hal-01110274 HAL Id: hal-01110274 https://hal.inria.fr/hal-01110274 Submitted on 27 Jan 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Barton et al. BMC Bioinformatics (2014) 15:388 DOI 10.1186/s12859-014-0388-9 RESEARCH ARTICLE Open Access Linear-time computation of minimal absent words using suffix array Carl Barton1, Alice Heliou2,3, Laurent Mouchard4 and Solon P Pissis1* Abstract Background: An absent word of a word y of length n isawordthatdoesnotoccuriny.Itisaminimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n)-time and O(n)-space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n2)-time and O(n)-space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available. Results: Our contribution in this article is twofold: first, we bridge this unpleasant gap by presenting an O(n)-time and O(n)-space algorithm for computing all minimal absent words based on the construction of suffix arrays; and second, we provide the respective implementation of this algorithm. Experimental results, using real and synthetic data, show that this implementation outperforms the one by Pinho et al. The open-source code of our implementation is freely available at http://github.com/solonas13/maw. Conclusions: Classical notions for sequence comparison are increasingly being replaced by other similarity measures that refer to the composition of sequences in terms of their constituent patterns. One such measure is the minimal absent words. In this article, we present a new linear-time and linear-space algorithm for the computation of minimal absent words based on the suffix array. Keywords: Absent words, Minimal absent words, Suffix array Background computationally expensive and hardly significant. These Sequence comparison is an important step in many observations have led to increased research into alignment important tasks in bioinformatics. It is used in many free techniques for sequence comparison. A number of applications; from phylogenies reconstruction to the alignment free techniques have been proposed: in [1], a reconstruction of genomes. Traditional techniques for method based on the computation of the shortest unique measuring approximation in sequence comparison are factors of each sequence is proposed; other approaches basedonthenotionsofdistanceorofsimilaritybetween estimate the number of mismatches per site based on the sequences; and these are computed through sequence length of exact matches between pairs of sequences [2]. alignment techniques. An issue with using alignment Thus standard notions are gradually being comple- techniques is that they are computationally expen- mented (or even supplanted) by other measures that refer, sive: they require quadratic time in the length of the implicitly or explicitly, to the composition of sequences sequences. Moreover, in molecular taxonomy and phy- in terms of their constituent patterns. One such measure logeny, for instance, whole-genome alignment proves both is the notion of words absent in a sequence. A word is an absent word of some sequence if it does not occur in *Correspondence: [email protected] the sequence. These words represent a type of negative 1 Department of Informatics, King’s College London, The Strand, WC2R 2LS information: information about what does not occur in London, UK Full list of author information is available at the end of the article the sequence. Noting the words which do occur in one © 2014 Barton et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Barton et al. BMC Bioinformatics (2014) 15:388 Page 2 of 10 sequence but do not occur in another can be used to (or of tries) may have a large memory footprint. Due detect mutations or other biologically significant events. to this, an important problem is to be able to com- Given a sequence of length n, the number of absent pute the minimal absent words of a sequence without words of length at most n can be exponential in n,mean- the use of data structures such as the suffix automa- ing that using all the absent words for sequence com- ton. To this end, the computation of minimal absent parison is more expensive than alignments. However, the words based on the construction of suffix arrays was number of certain subsets of absent words is only lin- considered in [6]; although fast in practice, the worst- ear in n. An absent word of a sequence is a shortest case runtime of this algorithm is O(n2). Alternatively, absent word if all words shorter than it do occur in one could make use of the succinct representations of the sequence. An O(mn)-time algorithm for computing the bidirectional BWT, recently presented in [14], to shortest absent words was presented in [3], where m is compute all minimal absent words in time O(n).How- a user-specified threshold on the length of the short- ever, an implementation of these representations was est absent words. This was later improved by [4], who not made available by the authors; and it is also rather presented an O(n log log n)-time algorithm for the same unlikely that such an implementation will outperform an problem. This has been further improved and an O(n)- O(n)-time algorithm based on the construction of suffix time algorithm was presented in [5]. arrays. A minimal absent word of a sequence is an absent word whose proper factors all occur in the sequence. Our contribution Notice that minimal absent words are a superset of short- In this article, we bridge this unpleasant gap by presenting est absent words [6]. An upper bound on the number the first O(n)-time and O(n)-space algorithm for com- of minimal absent words is O(σn) [7,8], where σ is the puting all minimal absent words of a sequence of length size of the alphabet. This suggests that it may be possi- n based on the construction of suffix arrays. In addi- ble to compare sequences in time proportional to their tion, we provide the respective implementation of this lengths, for a fixed-sized alphabet, instead of propor- algorithm. This implementation is shown to be more effi- tional to the product of their lengths [1]. Theory and cient than existing tools, both in terms of speed and some applications of minimal absent words can be found memory. in [9]. Recently, there has been a number of biological stud- ies on the significance of absent words. The most com- Methods prehensive study on the significance of absent words is Definitions and notation probably [10]; in this, the authors suggest that the deficit To provide an overview of our result and algorithm, we of certain subsets of absent words in vertebrates may begin with a few definitions. Let y = y[0] y[1] . .y[n − 1] be be explained by the hypermutability of the genome. It a word of length n =|y| over a finite ordered alphabet was later found in [11] that the compositional biases of size σ =||=O(1). We denote by y[i..j] = y[i]..y[j] observed in [10] for vertebrates are not uniform through- the factor of y that starts at position i and ends at position out different sets of minimal absent words. Moreover, j and by ε the empty word, word of length 0. We recall that the analyses in [11] support the hypothesis of the inheri- aprefixofy is a factor that starts at position 0 (y[0. .j]) and tance of minimal absent words through a common ances- a suffix is a factor that ends at position n − 1(y[i..n − 1]), tor, in addition to lineage-specific inheritance, only in and that a factor of y is a proper factor if it is not the empty vertebrates. In [12], the minimal absent words in four word or y itself. human genomes were computed, and it was shown that, Let x be a word of length 0 < m ≤ n. We say that there as expected, intra-species variations in minimal absent exists an occurrence of x in y,or,moresimply,thatx occurs words were lower than inter-species variations.

Load more