Algorithms for Efficiently Collapsing Reads with Unique Molecular Identifiers
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Application of TRIE Data Structure and Corresponding Associative Algorithms for Process Optimization in GRID Environment
Application of TRIE data structure and corresponding associative algorithms for process optimization in GRID environment V. V. Kashanskya, I. L. Kaftannikovb South Ural State University (National Research University), 76, Lenin prospekt, Chelyabinsk, 454080, Russia E-mail: a [email protected], b [email protected] Growing interest around different BOINC powered projects made volunteer GRID model widely used last years, arranging lots of computational resources in different environments. There are many revealed problems of Big Data and horizontally scalable multiuser systems. This paper provides an analysis of TRIE data structure and possibilities of its application in contemporary volunteer GRID networks, including routing (L3 OSI) and spe- cialized key-value storage engine (L7 OSI). The main goal is to show how TRIE mechanisms can influence de- livery process of the corresponding GRID environment resources and services at different layers of networking abstraction. The relevance of the optimization topic is ensured by the fact that with increasing data flow intensi- ty, the latency of the various linear algorithm based subsystems as well increases. This leads to the general ef- fects, such as unacceptably high transmission time and processing instability. Logically paper can be divided into three parts with summary. The first part provides definition of TRIE and asymptotic estimates of corresponding algorithms (searching, deletion, insertion). The second part is devoted to the problem of routing time reduction by applying TRIE data structure. In particular, we analyze Cisco IOS switching services based on Bitwise TRIE and 256 way TRIE data structures. The third part contains general BOINC architecture review and recommenda- tions for highly-loaded projects. -
KP-Trie Algorithm for Update and Search Operations
The International Arab Journal of Information Technology, Vol. 13, No. 6, November 2016 722 KP-Trie Algorithm for Update and Search Operations Feras Hanandeh1, Izzat Alsmadi2, Mohammed Akour3, and Essam Al Daoud4 1Department of Computer Information Systems, Hashemite University, Jordan 2, 3Department of Computer Information Systems, Yarmouk University, Jordan 4Computer Science Department, Zarqa University, Jordan Abstract: Radix-Tree is a space optimized data structure that performs data compression by means of cluster nodes that share the same branch. Each node with only one child is merged with its child and is considered as space optimized. Nevertheless, it can’t be considered as speed optimized because the root is associated with the empty string. Moreover, values are not normally associated with every node; they are associated only with leaves and some inner nodes that correspond to keys of interest. Therefore, it takes time in moving bit by bit to reach the desired word. In this paper we propose the KP-Trie which is consider as speed and space optimized data structure that is resulted from both horizontal and vertical compression. Keywords: Trie, radix tree, data structure, branch factor, indexing, tree structure, information retrieval. Received January 14, 2015; accepted March 23, 2015; Published online December 23, 2015 1. Introduction the exception of leaf nodes, nodes in the trie work merely as pointers to words. Data structures are a specialized format for efficient A trie, also called digital tree, is an ordered multi- organizing, retrieving, saving and storing data. It’s way tree data structure that is useful to store an efficient with large amount of data such as: Large data associative array where the keys are usually strings, bases. -
Lecture 26 Fall 2019 Instructors: B&S Administrative Details
CSCI 136 Data Structures & Advanced Programming Lecture 26 Fall 2019 Instructors: B&S Administrative Details • Lab 9: Super Lexicon is online • Partners are permitted this week! • Please fill out the form by tonight at midnight • Lab 6 back 2 2 Today • Lab 9 • Efficient Binary search trees (Ch 14) • AVL Trees • Height is O(log n), so all operations are O(log n) • Red-Black Trees • Different height-balancing idea: height is O(log n) • All operations are O(log n) 3 2 Implementing the Lexicon as a trie There are several different data structures you could use to implement a lexicon— a sorted array, a linked list, a binary search tree, a hashtable, and many others. Each of these offers tradeoffs between the speed of word and prefix lookup, amount of memory required to store the data structure, the ease of writing and debugging the code, performance of add/remove, and so on. The implementation we will use is a special kind of tree called a trie (pronounced "try"), designed for just this purpose. A trie is a letter-tree that efficiently stores strings. A node in a trie represents a letter. A path through the trie traces out a sequence ofLab letters that9 represent : Lexicon a prefix or word in the lexicon. Instead of just two children as in a binary tree, each trie node has potentially 26 child pointers (one for each letter of the alphabet). Whereas searching a binary search tree eliminates half the words with a left or right turn, a search in a trie follows the child pointer for the next letter, which narrows the search• Goal: to just words Build starting a datawith that structure letter. -
Lecture 04 Linear Structures Sort
Algorithmics (6EAP) MTAT.03.238 Linear structures, sorting, searching, etc Jaak Vilo 2018 Fall Jaak Vilo 1 Big-Oh notation classes Class Informal Intuition Analogy f(n) ∈ ο ( g(n) ) f is dominated by g Strictly below < f(n) ∈ O( g(n) ) Bounded from above Upper bound ≤ f(n) ∈ Θ( g(n) ) Bounded from “equal to” = above and below f(n) ∈ Ω( g(n) ) Bounded from below Lower bound ≥ f(n) ∈ ω( g(n) ) f dominates g Strictly above > Conclusions • Algorithm complexity deals with the behavior in the long-term – worst case -- typical – average case -- quite hard – best case -- bogus, cheating • In practice, long-term sometimes not necessary – E.g. for sorting 20 elements, you dont need fancy algorithms… Linear, sequential, ordered, list … Memory, disk, tape etc – is an ordered sequentially addressed media. Physical ordered list ~ array • Memory /address/ – Garbage collection • Files (character/byte list/lines in text file,…) • Disk – Disk fragmentation Linear data structures: Arrays • Array • Hashed array tree • Bidirectional map • Heightmap • Bit array • Lookup table • Bit field • Matrix • Bitboard • Parallel array • Bitmap • Sorted array • Circular buffer • Sparse array • Control table • Sparse matrix • Image • Iliffe vector • Dynamic array • Variable-length array • Gap buffer Linear data structures: Lists • Doubly linked list • Array list • Xor linked list • Linked list • Zipper • Self-organizing list • Doubly connected edge • Skip list list • Unrolled linked list • Difference list • VList Lists: Array 0 1 size MAX_SIZE-1 3 6 7 5 2 L = int[MAX_SIZE] -
Efficient In-Memory Indexing with Generalized Prefix Trees
Efficient In-Memory Indexing with Generalized Prefix Trees Matthias Boehm, Benjamin Schlegel, Peter Benjamin Volk, Ulrike Fischer, Dirk Habich, Wolfgang Lehner TU Dresden; Database Technology Group; Dresden, Germany Abstract: Efficient data structures for in-memory indexing gain in importance due to (1) the exponentially increasing amount of data, (2) the growing main-memory capac- ity, and (3) the gap between main-memory and CPU speed. In consequence, there are high performance demands for in-memory data structures. Such index structures are used—with minor changes—as primary or secondary indices in almost every DBMS. Typically, tree-based or hash-based structures are used, while structures based on prefix-trees (tries) are neglected in this context. For tree-based and hash-based struc- tures, the major disadvantages are inherently caused by the need for reorganization and key comparisons. In contrast, the major disadvantage of trie-based structures in terms of high memory consumption (created and accessed nodes) could be improved. In this paper, we argue for reconsidering prefix trees as in-memory index structures and we present the generalized trie, which is a prefix tree with variable prefix length for indexing arbitrary data types of fixed or variable length. The variable prefix length enables the adjustment of the trie height and its memory consumption. Further, we introduce concepts for reducing the number of created and accessed trie levels. This trie is order-preserving and has deterministic trie paths for keys, and hence, it does not require any dynamic reorganization or key comparisons. Finally, the generalized trie yields improvements compared to existing in-memory index structures, especially for skewed data. -
The Fractal Dimension Making Similarity Queries More Efficient
The Fractal Dimension Making Similarity Queries More Efficient Adriano S. Arantes, Marcos R. Vieira, Agma J. M. Traina, Caetano Traina Jr. Computer Science Department - ICMC University of Sao Paulo at Sao Carlos Avenida do Trabalhador Sao-Carlense, 400 13560-970 - Sao Carlos, SP - Brazil Phone: +55 16-273-9693 {arantes | mrvieira | agma | caetano}@icmc.usp.br ABSTRACT time series, among others, are not useful, and the simila- This paper presents a new algorithm to answer k-nearest rity between pairs of elements is the most important pro- neighbor queries called the Fractal k-Nearest Neighbor (k- perty in such domains [5]. Thus, a new class of queries, ba- NNF ()). This algorithm takes advantage of the fractal di- sed on algorithms that search for similarity among elements mension of the dataset under scan to estimate a suitable emerged as more adequate to manipulate these data. To radius to shrinks a query that retrieves the k-nearest neigh- be able to apply similarity queries over elements of a data bors of a query object. k-NN() algorithms starts searching domain, a dissimilarity function, or a “distance function”, for elements at any distance from the query center, progres- must be defined on the domain. A distance function δ() ta- sively reducing the allowed distance used to consider ele- kes pairs of elements of the domain, quantifying the dissimi- ments as worth to analyze. If a proper radius can be set larity among them. If δ() satisfies the following properties: to start the process, a significant reduction in the number for any elements x, y and z of the data domain, δ(x, x) = 0 of distance calculations can be achieved. -
Search Trees for Strings a Balanced Binary Search Tree Is a Powerful Data Structure That Stores a Set of Objects and Supports Many Operations Including
Search Trees for Strings A balanced binary search tree is a powerful data structure that stores a set of objects and supports many operations including: Insert and Delete. Lookup: Find if a given object is in the set, and if it is, possibly return some data associated with the object. Range query: Find all objects in a given range. The time complexity of the operations for a set of size n is O(log n) (plus the size of the result) assuming constant time comparisons. There are also alternative data structures, particularly if we do not need to support all operations: • A hash table supports operations in constant time but does not support range queries. • An ordered array is simpler, faster and more space efficient in practice, but does not support insertions and deletions. A data structure is called dynamic if it supports insertions and deletions and static if not. 107 When the objects are strings, operations slow down: • Comparison are slower. For example, the average case time complexity is O(log n logσ n) for operations in a binary search tree storing a random set of strings. • Computing a hash function is slower too. For a string set R, there are also new types of queries: Lcp query: What is the length of the longest prefix of the query string S that is also a prefix of some string in R. Prefix query: Find all strings in R that have S as a prefix. The prefix query is a special type of range query. 108 Trie A trie is a rooted tree with the following properties: • Edges are labelled with symbols from an alphabet Σ. -
X-Fast and Y-Fast Tries
x-Fast and y-Fast Tries Outline for Today ● Bitwise Tries ● A simple ordered dictionary for integers. ● x-Fast Tries ● Tries + Hashing ● y-Fast Tries ● Tries + Hashing + Subdivision + Balanced Trees + Amortization Recap from Last Time Ordered Dictionaries ● An ordered dictionary is a data structure that maintains a set S of elements drawn from an ordered universe � and supports these operations: ● insert(x), which adds x to S. ● is-empty(), which returns whether S = Ø. ● lookup(x), which returns whether x ∈ S. ● delete(x), which removes x from S. ● max() / min(), which returns the maximum or minimum element of S. ● successor(x), which returns the smallest element of S greater than x, and ● predecessor(x), which returns the largest element of S smaller than x. Integer Ordered Dictionaries ● Suppose that � = [U] = {0, 1, …, U – 1}. ● A van Emde Boas tree is an ordered dictionary for [U] where ● min, max, and is-empty run in time O(1). ● All other operations run in time O(log log U). ● Space usage is Θ(U) if implemented deterministically, and O(n) if implemented using hash tables. ● Question: Is there a simpler data structure meeting these bounds? The Machine Model ● We assume a transdichotomous machine model: ● Memory is composed of words of w bits each. ● Basic arithmetic and bitwise operations on words take time O(1) each. ● w = Ω(log n). A Start: Bitwise Tries Tries Revisited ● Recall: A trie is a simple data 0 1 structure for storing strings. 0 1 0 1 ● Integers can be thought of as strings of bits. 1 0 1 1 0 ● Idea: Store integers in a bitwise trie. -
Compact Fenwick Trees for Dynamic Ranking and Selection
Compact Fenwick trees for dynamic ranking and selection Stefano Marchini Sebastiano Vigna Dipartimento di Informatica, Universit`adegli Studi di Milano, Italy October 15, 2019 Abstract The Fenwick tree [3] is a classical implicit data structure that stores an array in such a way that modifying an element, accessing an element, computing a prefix sum and performing a predecessor search on prefix sums all take logarithmic time. We introduce a number of variants which improve the classical implementation of the tree: in particular, we can reduce its size when an upper bound on the array element is known, and we can perform much faster predecessor searches. Our aim is to use our variants to implement an efficient dynamic bit vector: our structure is able to perform updates, ranking and selection in logarithmic time, with a space overhead in the order of a few percents, outperforming existing data structures with the same purpose. Along the way, we highlight the pernicious interplay between the arithmetic behind the Fenwick tree and the structure of current CPU caches, suggesting simple solutions that improve performance significantly. 1 Introduction The problem of building static data structures which perform rank and select operations on vectors of n bits in constant time using additional o(n) bits has received a great deal of attention in the last two decades starting form Jacobson's seminal work on succinct data structures. [7] The rank operator takes a position in the bit vector and returns the number of preceding ones. The select operation returns the position of the k-th one in the vector, given k. -
List of Algorithms
List of Algorithms Swiss Olympiad in Informatics September 13, 2012 This list contains a few algorithms that may prove useful when solving SOI or IOI related tasks. The current IOI Syllabus can be found here: http://people.ksp.sk/~misof/ioi-syllabus/ioi-syllabus.pdf Please note that this list simply serves as an outline and that tasks are not limited by the algorithms listed below. However, one can also participate (and even be successful) without knowing the algorithms listed below. To guide you better, the most important topics are in bold. • Sorting1 • Binary Search • Data Structures Stack1 Queue1 List: Linked List, Doubly Linked List1 Tree Sets1 and Tree Maps1 (Binary Search Trees) Priority Queue1 Double ended queue1 Binary Heap Data Structure Interval Tree Segment Tree Disjoint{set Data Structure (Union{Find) Rational Numbers Bignum (Addition, Subtraction, Multiplication, Division) Quad Tree optional: Binary Indexed Tree (BIT) (Fenwick Tree) • Graph Algorithms DFS (Depth{First Search) BFS (Breadth{First Search) Connected Components Topological Sorting (toposort) Shortest Path · Dijkstra's Algorithm · Bellman{Ford Algorithm · Floyd{Warshall's Algorithm (all{pairs shortest path) MST (Minimum Spanning Tree) · Kruskal's Algorithm · Prim's Algorithm · Find Articulation Points (articfind) · Hierholzer's Algorithm (for finding Euler cycles) • Dynamic Programming (DP) Prefix Sum Edit Distance LCS (Longest Common Subsequence) LIS (Longest Increasing Subsequence) MCM (Matrix Chain Multiplication) MER (Maximum Empty Rectangle) -
Tries and String Matching
Tries and String Matching Where We've Been ● Fundamental Data Structures ● Red/black trees, B-trees, RMQ, etc. ● Isometries ● Red/black trees ≡ 2-3-4 trees, binomial heaps ≡ binary numbers, etc. ● Amortized Analysis ● Aggregate, banker's, and potential methods. Where We're Going ● String Data Structures ● Data structures for storing and manipulating text. ● Randomized Data Structures ● Using randomness as a building block. ● Integer Data Structures ● Breaking the Ω(n log n) sorting barrier. ● Dynamic Connectivity ● Maintaining connectivity in an changing world. String Data Structures Text Processing ● String processing shows up everywhere: ● Computational biology: Manipulating DNA sequences. ● NLP: Storing and organizing huge text databases. ● Computer security: Building antivirus databases. ● Many problems have polynomial-time solutions. ● Goal: Design theoretically and practically efficient algorithms that outperform brute-force approaches. Outline for Today ● Tries ● A fundamental building block in string processing algorithms. ● Aho-Corasick String Matching ● A fast and elegant algorithm for searching large texts for known substrings. Tries Ordered Dictionaries ● Suppose we want to store a set of elements supporting the following operations: ● Insertion of new elements. ● Deletion of old elements. ● Membership queries. ● Successor queries. ● Predecessor queries. ● Min/max queries. ● Can use a standard red/black tree or splay tree to get (worst-case or expected) O(log n) implementations of each. A Catch ● Suppose we want to store a set of strings. ● Comparing two strings of lengths r and s takes time O(min{r, s}). ● Operations on a balanced BST or splay tree now take time O(M log n), where M is the length of the longest string in the tree. -
An Investigation of Practical Approximate Nearest Neighbor Algorithms
An Investigation of Practical Approximate Nearest Neighbor Algorithms Ting Liu, Andrew W. Moore, Alexander Gray and Ke Yang School of Computer Science Carnegie-Mellon University Pittsburgh, PA 15213 USA ftingliu, awm, agray, [email protected] Abstract This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimen- sional perception areas such as computer vision, with dozens of publica- tions in recent years. Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hash- ing (LSH). In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. We also intro- duce new approximate k-NN search algorithms on this structure. We show why these structures should be able to exploit the same random- projection-based approximations that LSH enjoys, but with a simpler al- gorithm and perhaps with greater efficiency. We then provide a detailed empirical evaluation on five large, high dimensional datasets which show up to 31-fold accelerations over LSH. This result holds true throughout the spectrum of approximation levels. 1 Introduction The k-nearest-neighbor searching problem is to find the k nearest points in a dataset X ⊂ RD containing n points to a query point q 2 RD, usually under the Euclidean distance. It has applications in a wide range of real-world settings, in particular pattern recognition, machine learning [7] and database querying [11].