KP-Trie Algorithm for Update and Search Operations
Total Page:16
File Type:pdf, Size:1020Kb
The International Arab Journal of Information Technology, Vol. 13, No. 6, November 2016 722 KP-Trie Algorithm for Update and Search Operations Feras Hanandeh1, Izzat Alsmadi2, Mohammed Akour3, and Essam Al Daoud4 1Department of Computer Information Systems, Hashemite University, Jordan 2, 3Department of Computer Information Systems, Yarmouk University, Jordan 4Computer Science Department, Zarqa University, Jordan Abstract: Radix-Tree is a space optimized data structure that performs data compression by means of cluster nodes that share the same branch. Each node with only one child is merged with its child and is considered as space optimized. Nevertheless, it can’t be considered as speed optimized because the root is associated with the empty string. Moreover, values are not normally associated with every node; they are associated only with leaves and some inner nodes that correspond to keys of interest. Therefore, it takes time in moving bit by bit to reach the desired word. In this paper we propose the KP-Trie which is consider as speed and space optimized data structure that is resulted from both horizontal and vertical compression. Keywords: Trie, radix tree, data structure, branch factor, indexing, tree structure, information retrieval. Received January 14, 2015; accepted March 23, 2015; Published online December 23, 2015 1. Introduction the exception of leaf nodes, nodes in the trie work merely as pointers to words. Data structures are a specialized format for efficient A trie, also called digital tree, is an ordered multi- organizing, retrieving, saving and storing data. It’s way tree data structure that is useful to store an efficient with large amount of data such as: Large data associative array where the keys are usually strings, bases. Retrieving process retrieves data from either that comes in the form of a word or dictionaries. The main memory or secondary memory. Using data word “Trie” comes from the middle letters of the word structures vary depending on the nature or the purpose “retrieval” that was coined by Fredkin [7]. Their of the structure. The main characteristics that use for structure nature facilitates the process of searching or assessing the quality of any data structure depending retrieving queried words quickly. A common use-case on the performance or the speed of storing and for tries was finding all dictionary words that start with accessing data in those data structures. Any current a particular prefix (e.g., finding all words that start natural language such as: English, French, Arabic, etc., with “data”). In tries, keys were stored in the leaves can have number of words up to one million words and the search method involves left-to-right although dictionaries couldn’t contain all such words comparison of prefixes of the keys. especially as languages continuously grow to add new In each trie, nodes form the children that could words or borrow words from other languages. Current further be parents for lower nodes. Nodes contain versions of Oxford English dictionary may have up to letters that represent keys or pointers to words (or the half million words. As such, a software product or web rest of the words) at the lowest leaf levels. In principle, application that needs or uses a dictionary should have each node can contain the searched for word (if it is a efficient data structures for effective: Storage, access leaf). This can dynamically change if more words are and expansion of data or words. added. Finding a word in a trie depends on the size of The concept of data structures such as trees has th the tree or the number of words along with its developed since the 19 century. Tries evolved from structure. The depth of the nodes that a query could go trees. They have different names such as: Radix tree, depends on the number of words in the searched for prefix tree compact Trie, bucket Trie, crit bit tree and word. If the word does not exist in the tree, the longest Practical Algorithm to Retrieve Information Coded in node sequence is performed. Alphanumeric (PATRICIA) [9]. Certain trie structures nature can facilitate not only In the existing general form, it is believed that tries quicker access and retrieval of information; they can was first proposed by Morrison [14]. Radix tree, prefix further facilitate smooth expansion of such structures. tree compact Trie, bucket Trie, crit bit tree and In many cases, it is necessary for a dictionary to accept PATRICIA those different names may have some the addition of new words and hence the structure differences in the detail structure. For example, unlike should facilitate this expansion smoothly and PATRICIA tree nodes that store keys and words, with dynamically. This can be for either fragmentation or 723 The International Arab Journal of Information Technology, Vol. 13, No. 6, November 2016 for allocation/reallocation [9]. In some other cases, it the proposed Flash trie data structure followed may be necessary to compress, encode or encrypt data compression techniques to minimize the total size of in those tree structures. the Trie. Behdadfar and Saidi [4] applied Trie search In computer science, a radix tree also Known as for IP routing table, search optimization where they PATRICIA Trie or compact prefix tree, is a tried to adjust tries to deal with large size data compressed version of a trie and space optimized data structures such as those of IP addresses. They tried to structure, in which any node that is an only child is arrange nodes through encoding their addresses with merged with its parent. Unlike regular trees, the key at numbers which can be reached faster based on the each node is compared part-of-bits by part-of-bits; numbers encoded. They concluded that some methods where the quantity of bits in that part at that node is the can be optimized for search or query whereas similar radix r of the radix trie, Instead of compare the whole methods can be optimized for an addition or update keys from their beginning up to the point of inequality. process to the Trie table. Insertion, deletion, and searching operations were Bodon and Ronyal [5] prepared a more modified supported by radix trees with O(k) where k is the version of trie for solving frequent itemset mining maximum length of all strings in the set. Unlike in which is assumed to be the most important data mining regular tries, edges could be labeled with sequences of fields. Experiments proved that the performance of elements as well as single elements. tries is close to the performance of hash-trees with The branching factor was an essential in compute optimal parameters at high support threshold; however, the time complexity of the searching tree, the at lower threshold the trie-based algorithm does better branching factor is the number of children in each node than the hash-tree. Moreover, tries are mainly suitable of the tree, where usually every node had the same for useful implementation of candidate generation branching factor, but the difficulty occurs when because pairs of items that produce candidates have the different nodes at the same level of the tree have same parents. Thus, candidates can be easily obtained different numbers of children. In this situation, branch by a simple scan of a part of the trie. It gives simpler factor with a given depth of the tree could be defined algorithms, which are faster for a range of applications by computing the ratio of the number of nodes at that and escapes the need of fine tuning by employing a depth with the number of nodes at the next self-adjusting method. Ferragina and Grossi [6] insignificant depth. introduced the string B-Tree that is link between some The time complexity also depends on solution traditional external-memory and string-matching data depth. Solution depth is the length of a shortest structure. It’s a combination of B-Trees and Patricia solution path. Computing the branch factor and Tries for internal-node indices that is made more solution depth depends on the given problem instance. helpful by adding extra pointers to accelerate search The rest of this paper is organized as follows: and renovate operations. Section two presents’ related studies to the paper Fredkin [7] described the Trie structure. Fredkin subject, section three shows the proposed KP-Trie, said “As defined by me, nearly 50 years ago, it is section four presents evaluation and experimental properly pronounced “tree” as in the word “retrieval”, results, Last section presents the conclusions. at least that was my intent when I gave it the name “Trie”, the idea behind the name was to combine 2. Related Works reference to both the structure (a tree structure) and a major target (data storage and retrieval)”. Fu et al. [8] Some recent researches suggested that new data introduced a modified LC-Trie lookup algorithm in structures such as: Hash tables or linked structures can order to enhance its performance. The idea depends on be suitable alternatives for tries in terms of flexibility expansion and collapsing of the routing prefixes which and performance. turns them into disjoints, complete and minimal set. Askitis and Sinha [1] suggested that HAT-Trie is Subsequently, the base and prefix vectors are removed better alternative for trie representation, this is based from the data structure. This result in a smaller data on the previous approach: Bucket-trie where Buckets structure and less number of executed instructions, are sectioned using B-tree splitting. This research tends which in turn improves the performance, a technique to improve Burst-tries through caching and using hash called prefix transformation. Thereafter, the LC-Trie’s tables.