Optimizing Phylogenetic Queries for Performance Hasan M Jamil, Member, IEEE
Abstract—The vast majority of phylogenetic databases do not support declarative querying using which their contents can be flexibly and conveniently accessed and the template based query interfaces they support do not allow arbitrary speculative queries. They therefore also do not support query optimization leveraging unique phylogeny properties. While a small number of graph query languages such as XQuery, Cypher and GraphQL exist for computer savvy users, most are too general and complex to be useful for biologists, and too inefficient for large phylogeny querying. In this paper, we discuss a recently introduced visual query language, called PhyQL, that leverages phylogeny specific properties to support essential and powerful constructs for a large class of phylogentic queries. We develop a range of pruning aids, and propose a substantial set of query optimization strategies using these aids suitable for large phylogeny querying. A hybrid optimization technique that exploits a set of indices and “graphlet” partitioning is discussed. A “fail soonest” strategy is used to avoid hopeless processing and is shown to produce dividends. Possible novel optimization techniques yet to be explored are also discussed.
Index Terms—Phylogenetics, declarative queries, Datalog, visual querying, query optimization, pruning aids, graph matching. !
1INTRODUCTION the time complexity in least common ancestor (LCA) queries, PhyloFinder [10] preprocesses the trees, and stores additional HE interest in developing a flexible, expressive and effi- T cient structure querying engine for phylogenetic databases labeling information in nodes; and Crimson [29] used Dewey has been gaining steady popularity [1], [2], [3], [4], [5], [6]. node labeling [30]. Although Dewey labeling helps, it often This interest is based in part on the observations that 1) various require long nested tree representation. The Crimson system types of evolutionary data are being generated using extremely eliminates this problem by storing the labels in nested subtrees expensive algorithms for life sciences research [7], [8], [9] and to avoid long chains. Such labeling also complicates updates stored in public databases [10], [11], [12], [13]1, and 2) their because insertion and modifications disrupt Dewey order, and unique properties were not exploited to develop scalable meth- must now be recomputed. To deal with such labeling hurdles, ods for the storage and manipulations of such vast collections nested interval encoding [31] was used in PhyloFinder, which of complex data structures [20], [21], [22]. Although phyloge- translates essentially into a simple string search. nies2 are fundamentally trees, it was observed that most well Evidently, the ability to conveniently store phylogenies developed data manipulation techniques for graphs and trees computed using CPU intensive algorithms [32], [33], [34], are rendered ineffective or have unacceptable performance in [35], [36], and later retrieving them for analyses is increasingly phylogenetic databases. However, recent advances in graph becoming an imperative. So is the need for a convenient and complex structured data management [23], [24], [25], [26], representation for effective integration of various types of [27], [28] simultaneously show promise for a well rounded phylogenies. From these standpoints, a format and application phylogenetic data management system and raise new research independent abstract data model, and a declarative query questions that need to be addressed. language can play a transformative role [37], [38] in phy- While the recent graph matching algorithms are efficient, logenetic databases. Thus, for declarative phylogeny query they are not directly suitable as a query language to support languages, efficient query processing and optimization become features such as part fixed and part tentative structure match- a serious next step. The PhyloBase model and the PhyQL ing, or for computing wildcard queries such as least common query language we present in this paper address both. To our ancestor or reachable nodes. The handful of languages that knowledge, PhyQL is one of a handful of languages that allow support declarative querying, do so incurring a high mainte- declarative querying (other than PQL [39] and Crimson), and nance and query processing overhead. For example, to reduce the only language that allows composable structure and pattern queries visually due to its declarative foundation. In this paper, H. M. Jamil is with the Department of Computer Science, University of Idaho, our focus is on query processing and optimization in PhyQL Moscow, ID, 83844, USA. His research was supported in part by National (introduced recently in [3]) that leverages a deductive reasoner Science Foundation grant DRL 1515550. E-mail: [email protected]. as its query engine. 1. Also in many specialized phylogenetic databases such as FUNYBASE [14], DarkHorse [15], TreeFam [16], PALI [17], ImmTree [18], and Hvr- Base++ [19]. 2RELATED RESEARCH 2. Although there are several different types of evolution trees such as taxonomy, phylogentic and gene trees, our use of the term phylogeny is generic While a large number of phylogenetic analysis tools and and could represent any of these tree types. applications have been developed [1], [40], [2], [41], [4], [42], 'LJLWDO2EMHFW,GHQWL¿HU7&%% ,(((3HUVRQDOXVHLVSHUPLWWHGEXWUHSXEOLFDWLRQUHGLVWULEXWLRQUHTXLUHV,(((SHUPLVVLRQ 6HH KWWSZZZLHHHRUJSXEOLFDWLRQV VWDQGDUGVSXEOLFDWLRQVULJKWVLQGH[KWPO IRU PRUH LQIRUPDWLRQ IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2017 2
[6], very few query languages are available for phylogenetic across nodes within a phylogeny. The accommodation of databases. Many of these tools are focused on generating hybridization events basically renders the model to be DAGs as phylogenies or using phylogenies already collected in specific opposed to trees, but treating them orthogonally as exceptions formats. Databases such as TreeBASE [13] and PhyloExplorer makes them trees nonetheless. The formal model discussed [4] support custom APIs to search phylogenies in specific below can be tailored appropriately to accommodate most formats that are not actually based on tree query languages, of the popular standards which treat phylogenies as rooted and are often based on tree matching algorithms [43]. There trees in which internal nodes and edges are optionally labeled, has been previous efforts in developing query languages, but leaf nodes are always labeled, i.e., edges and internal declarative languages to be specific, for phylogenetic databases nodes need not be named or labeled3. A set of homogeneous [2], [3], [39] based on the observations in [21], [44] with phylogenies (trees) is called collections (or forests), and a varying degrees of success. The limitations of these languages PhyloBase database essentially is a set of such collections. are still forcing procedural extensions of languages such as Technically, the language L (PhyQL) of PhyloBase is a Python and Java [45], [46], [47], [5] for accessing and visu- structure of the form I ,V ,L ,C ,λ where I is a set of alizing phylogenies and forcing the user to incur significant identifiers, V is a set of vertices, L is a set of labels including development costs. empty labels, C is a set of collections, and λ is a labeling The PQL language [39], though declarative and can be used function. Each phylogeny T in a collection C is of the form to query phylogenies, is specifically designed for querying T = I,V,Ea,Eh where I ∈ I , V ⊆ V is a set of vertices, pathways, have a complex syntax and semantics, and is not and Ea ⊆ V ×V and Eh ⊆ V ×V are sets of edges such that amenable to developing a simple and intuitive visual interface. v1,v2 ∈Ea ⇒ v1,v2 ∈ Eh and vice versa, Ea ∪Eh is acyclic, The CDAO-Store on the other hand is designed to support ∀v ∈ V{∃u ∈ V{ u,v ∈Ea ∨ v,u ∈Ea}} (i.e., |V|≥2), and integrated phylogenetic analysis based on NexML format. Ea is a tree. Since Eh models horizontal transfers, we impose Although it is based OWL and a logic based model, it the constraint that ∀v1,v2{ v1,v2 ∈Eh, neither v1 nor v2 is 4} only supports a web browser and predefined query suits for the root node in T . The set of all such collections C is the accessing its content for specific set of data sources in a limited set C , i.e., C = C. way. While Nakhleh et. al. [21] had proposed a declarative We require λ to be a labeling function of the form λ : engine to process tree queries based on a similar canonical U → L × L × ...× L that assigns an n-ary vector of labels model as PhyloBase, its performance was a limiting factor as it where U is one of the components in {I,V,Ea,Eh}. Intuitively, used traditional negation based rules to compute LCA queries this means every tree T ∈ C and C ∈ C is unique (identified that forced expensive stable model computation in Datalog. by its ID I), and is possibly described using attributes (i.e., λ : I → L×L×...×L), such as author and date. Each edge in 3A MODEL FOR REPRESENTING AND Ea and Eh may also be optionally labeled. Finally, although QUERYING PHYLOGENIES we allow labeling of any node in V, we require that all leaf Over the years, researchers have tried to develop a canonical nodes v ∈ T to be labeled, i.e., ∀v,v ∈ V( u,v ∈Ea∧ ∃w,w ∈ data model for phylogenies and have proposed several stan- V( v,w ∈Ea ⇒ λ(v) = 0/)). This definition allows a subset dards for representation. In some ways, all of these models of internal nodes and edges to be labeled while the rest are have strengths and deficiencies relative to the applications possibly not. Finally, we adhere to and use standard terms of they aim to support. Consequently, several popular but widely graphs and trees such as height or depth of trees, path between heterogeneous phylogeny representation standards such as nodes, nodes, branching factor or fan out, and average fan out Newick, NEXUS, PhyloXML, and NexML have emerged and of a node, internal and leaf nodes, and subtrees. evolved over the years. The multitude of phylogeny analytics that have been developed and used by the members of the 3.2 Persistent Storage Model community largely favor one of these standards. To reconcile the heterogeneities of these representation standards and help Given that some simulation phylogenies are potentially more cross reference data from various models, many mapping and than a million level deep [29], often have several millions of translation methods have also been developed, indicating that species and many millions of internal nodes, literally storing these standards are here to stay for an indefinite period. An them as trees is both infeasible and not prudent from man- interesting observation is that deviating from these standards agement and querying standpoints. The sheer size of the real for the purpose of representation, querying and manipulation, life phylogenies also challenges the wisdom of the algorithms and query optimization in no way limits the strengths of any that try to match them in memory [48] using the tree at a new model since developing a mapping procedure addresses time paradigm. Furthermore, Felsenstein [20] estimated that 23 the all too common and prevailing standardization disparity. there are about 8.8 × 10 possible tree topologies just for a In other words, developing a generalized data model for the 3. For example, in species evolution trees, identities of ancestors are often representation of phylogenies for our purposes is academic not known and thus the internal node from which divergence occurred cannot and follows standard practices. be labeled. Similarly, edges in species trees often show estimated length of evolution, or descriptive information, which may not be present in gene trees. 3.1 PhyloBase Data Model 4. Consider a tree with edges Ea = { a,b , a,c , c,d