Generalized Suffix Trees for Biological Sequence Data: Applications and Implementation

Generalized Suffix Trees for Biological Sequence Data: Applications and Implementation Paul Bieganski, John Ned1 and John V. Cadis Computer Science Department, University of Minnesota Ernest E Retzel Medical School, University of Minnesota This paper addresses applications of sujjix trees and subsequence the entire sequence must be accessed, and in generalized suffix trees (GSTs)to biological sequence data order to detect a subsequence in a set of sequences, each analysis. We define a basic set of suffix tree and GST oper- sequence must be accessed sequentially. As the volume of ations needed to support sequence data analysis. While sequence data increases, the data access time itself those &finitions are straightforward, the construction and becomes the limiting factor of sequence information manipulation of disk-based GST structures for large vol- retrieval regardless of advances in sequence comparison umes of sequence data requires intricate design. GST pro- speeds. What is needed is a system capable of content- cessing is fast because the structure is content addressing of sequence. Such a system allows a sequence addressable, supporting efJicient searches for all to be accessed in terms of what it contains without having sequences that contain particular subsequences. Instead to specify where it is contained. of laboriously searching sequences stored as arrays, we In this, paper we describe the data structures and opera- search by walking down the tree. We present a new GST- tions providing content-addressable access to sequence based sequence alignment algorithm, called GESTALT. data, outline their possible applications and report on our GESTALT finds all exact matches in parallel, and uses work on their implementation. In section 2, we provide a best-first search to extend them to produce alignments. short review of suffix trees and generalized suffix trees. In Our implementation experiences with applications using section 3, we review the operations that provide the basis GST structures for sequence analysis lead us to conclude for construction of applications using suffix tree struc- that GSTs are valuable tools for analyzing biological tures. In section 4 we outline a generalized suffix tree- sequence data. based sequence homology search algorithm. In section 5, we present an overview of sequence information processing applications that can be constructed using the 1 Introduction basic operations described in section 3. In section 6 we report on our experiences with implementations of some Genetic codings transmit information from a parent to of the applications. In section 7, we summarize and outline its progeny. The primary representation of information is future work. sequences of symbols from a limited alphabet. The most commonly used alphabets represent the four nucleic acids 2 Suffix trees forming strands of DNA and the twenty amino acids con- stituting polypeptides. The information encoding power of A trie is an indexing structure used for indexing sets of these limited sets of molecules lies in their ability to form key values of varying sizes[l3]. A trie is a tree in which long chains. the branching at any level is determined by a partial key Sequence information is most commonly stored in value (see figure 1). A suffix tree is a PATRICIA trie [4] computer memory in contiguous locations, in order of the built over the set of all suffixes of a given sequence S. Fig- molecules in the biological sequence. This storage method ure 2 illustrates a sample suffix tree of a short DNA is not efficient for a large group of sequence information sequence. Each path from the root of the suffix tree repre- processing applications. The key problem lies in the fact sents a suffix of the original string. Any individual suffix that data stored sequentially must be processed sequen- of the original sequence can be recreated by walking along tially. The information within the sequence is often a path from the root and catenating the labels of the edges encoded through the presence of a certain subsequence of traversed along the way. Nodes with a single outgoing molecules; for example a sequence of DNA coding for a edge can be collapsed, resulting in edges with multi- certain protein. In order to detect the presence of any given 35 Proceedings of the Twenty-Seventh Annual Hawaii 1060-3425194$3.00 0 1994 IEEE International Conference on System Sciences, 1994 Authorized licensed use limited to: University of Minnesota. Downloaded on March 17, 2009 at 15:56 from IEEE Xplore. Restrictions apply. A? ACACTT fl0 0 FIGURE 1. A suffix tree for sequence ACTT I FIGURE 4. A generalized suffix tree for two sequences, ACACTT and CT. Notice the sequence identifier leaves (squares) denoting the origin of every suffix of every sequence. The number of identifier leaves for every sequence is equal to the number of suffixes of that sequence (equal to the number of characters in the sequence). be constructed in O(n) time [30], where n is the sum of FIGURE 2. Suffix tree for sequence ACACTT. Some edges have multi-symbol labels resulting lengths of all sequences stored in the tree. Parallel algo- from collapsing nodes with single children. rithms for GST construction will be described later. 3 Basic suffix tree operations In this section we describe a set of basic operations on generalized suffix trees. Molecular biology applications FIGURE 3. Suffix fitree for sequence CT are implemented by combining one or more of these operations. symbol labels, such as the edge labeled AC coming out of In the remainder of the paper we use seq to denote a the root of the tree in figure 2. single sequence and SEQS to denote a set of sequences. A suffix tree for sequence S of length n can be con- In suffix trees with multi-letter edge labels, it is not suffi- structed in O(n) time [8] and the number of nodes in the cient to give the node in order to describe a subsequence tree is in general a linear function of n [30]. Heuristic because the subsequence may terminate within an edge. arguments [301 and our experiments with trees containing For the sake of clarity we assume subsequences terminate the rodent subsection of GenBank indicate that the number at nodes in the discussion. of nodes is equal to less than twice the number of bases in 3.1 Suffix tree construction [tree(seq)l the sequence for long DNA sequences. A generalized suffix tree (GST) is an augmented ver- Suffix tree construction algorithms have been described sion of the suffix tree allowing for multiple sequences to in the literature [8, 17, 231. They operate by constructing be stored in the same tree. A GST can be viewed as a suf- an initial tree with a single branch corresponding to the fix tree with additional sequence-identifierleaves added to entire sequence and incrementally modifying the tree to the leaves of the original suffix tree. A generalized suffix include all of its suffixes. An important variable in the con- tree containing multiple sequences contains all suffixes of struction process is the choice of data structures used to each of the original sequences (see figures 2,2 and 4). For represent the tree. Fixed node-size trees have a node every suffix, its sequence of origin is identified. A GST access time advantage over child list-based ones since a can be augmented with information about the number of descendant can be accessed directly, without having to different sequences that contain suffixes expressed by traverse a list. They are particularly useful when the num- descendants of each node (see figure 4). This number is ber of descendant pointers is small. For instance, each also known as the Color Set Count (CSC) [7]. A GST can node of a DNA sequence GST can have at most four children, labelled A, C, G and T. Child list-based nodes are 36 Authorized licensed use limited to: University of Minnesota. Downloaded on March 17, 2009 at 15:56 from IEEE Xplore. Restrictions apply. more space efficient in applications when the number of matching the prefix of seq can be determined by examin- descendant nodes may be large, for example when repre- ing the sequence count information in the nearest descen- senting amino acid sequences. dant of the traversal termination point (see the description 3.2 Generalized suffm tree construction of GST in section 2). [gst(SEQS)i 3.5 Suffix tree matching [match(tree(s), tree(p))l The GST construction process is usually approached Matching of suffix trees against suffix trees is per- 17, 231 by adding a special sequence separator symbol to formed similarly to matching of sequences against suffix the alphabet. The sequences to be included in the tree are trees. Instead of traversing a single path, however, all paths catenated, separated from each other by the separator sym- corresponding to an exhaustive traversal of one tree are bol. The GST is created using the ordinary suffix tree con- traversed in the other tree. If the trees were truncated every struction algorithm on the catenated sequence. The GST time the traversal process reaches a dead end in either tree. created using this process has to be kept in main memory the resulting tree would contain all common subsequences during construction, hence this approach is not feasible of the two sequences. This tree can be examined to deter- when sets of thousands or more sequences are involved. mine the lengths and locations for common subsequences. We have developed an incremental disk-based GST con- To avoid truncating the trees, summary information about struction method using binary merging of GSTs. Two the match can be collected during the traversal process. GSTs representing two disjoint sets of sequences are For instance, the longest common subsequence can be merged to produce a single GST representing the union of determined by keeping track of the longest match encoun- the two sets.

Generalized Suffix Trees for Biological Sequence Data: Applications and Implementation

Full-Text and Keyword Indexes for String Searching

Suffix Trees, Suffix Arrays, BWT

Suffix Trees and Suffix Arrays in Primary and Secondary Storage Pang Ko Iowa State University

Space-Efficient K-Mer Algorithm for Generalised Suffix Tree

String Matching and Suffix Tree BLAST (Altschul Et Al'90)

Text Algorithms

4. Suffix Trees and Arrays

Modifications of the Landau-Vishkin Algorithm Computing Longest

EECS730: Introduction to Bioinformatics

Construction of Aho Corasick Automaton in Linear Time for Integer Alphabets

Suffix Trees and Suffix Arrays Some Problems Exact String Matching Applications in Bioinformatics Suffix Trees History of Suffix

Suffix Trees and Their Applications