Online Algorithms for Partially Ordered Sets

Date of acceptance Grade Instructor Online algorithms for partially ordered sets Mikko Herranen Helsinki October 13, 2013 MSc thesis UNIVERSITY OF HELSINKI Department of Computer Science HELSINGIN YLIOPISTO — HELSINGFORS UNIVERSITET — UNIVERSITY OF HELSINKI Tiedekunta — Fakultet — Faculty Laitos — Institution — Department Faculty of Science Department of Computer Science Tekijä — Författare — Author Mikko Herranen Työn nimi — Arbetets titel — Title Online algorithms for partially ordered sets Oppiaine — Läroämne — Subject Computer Science Työn laji — Arbetets art — Level Aika — Datum — Month and year Sivumäärä — Sidoantal — Number of pages MSc thesis October 13, 2013 50 pages + 1 appendices Tiivistelmä — Referat — Abstract Partially ordered sets (posets) have various applications in computer science ranging from database systems to distributed computing. Content-based routing in publish/subscribe systems is a major poset use case. Content-based routing requires efficient poset online algorithms, including efficient insertion and deletion algorithms. We study the query and total complexities of online operations on posets and poset-like data structures. The main data structures considered are the incidence matrix, Siena poset, ChainMerge, and poset-derived forest. The contributions of this thesis are twofold: First, we present an online adaptation of the ChainMerge data structure as well as several novel poset-derived forest variants. We study the effectiveness of a first-fit-equivalent ChainMerge online insertion algorithm and show that it performs close to optimal query-wise while requiring less CPU processing in a benchmark setting. Second, we present the results of an empirical performance evaluation. In the evaluation we compare the data structures in terms of query complexity and total complexity. The results indicate ChainMerge as the best structure overall. The incidence matrix, although simple, excels in some benchmarks. Poset-derived forest is very fast overall if a “true” poset data structure is not a requirement. Placing elements in smaller poset-derived forests and then merging them is an efficient way to construct poset-derived forests. Lazy evaluation for poset-derived forests shows some promise as well. ACM Computing Classification System (CCS): A.1 [Introductory and Survey], E.1 [Data Structures] Avainsanat — Nyckelord — Keywords poset, publish-subscribe, benchmark Säilytyspaikka — Förvaringsställe — Where deposited Muita tietoja — övriga uppgifter — Additional information ii Contents 1 Introduction 1 2 Definitions 2 3 Use cases for posets 3 3.1 Publish/subscribe systems . .3 3.2 Recommender systems . .5 4 Problem description 6 4.1 The set of operations . .6 4.2 Types of complexities . .7 4.3 Certain theoretical lower bounds . .8 5 Poset data structures 8 5.1 Incidence matrix . .8 5.2 Siena filters poset . 11 5.3 ChainMerge . 12 5.3.1 Offline insertion algorithms . 13 5.3.2 Online insertion algorithm . 13 5.3.3 Delete and look-up algorithms . 15 5.3.4 Root set computation . 16 5.3.5 Maintaining a minimum chain decomposition . 17 5.3.6 The Merge algorithm . 18 5.4 Worst-case complexities . 21 5.5 Other data structures . 22 6 Poset-derived forest 22 6.1 Definitions . 23 6.2 The algorithms . 23 iii 6.3 Balanced PF . 25 6.4 Lazy-evaluated PF . 26 6.5 PF merging . 27 7 Experimental evaluation 28 7.1 Input values . 28 7.2 Element comparison scheme . 33 7.3 Benchmarking setup . 34 7.4 Poset structures: results and analysis . 35 7.5 Poset-derived forest: results and analysis . 36 8 Discussion 45 9 Future work 47 10 Conclusions 48 References 48 Appendices 1 Algorithm pseudo code listings 1 1 Introduction Partially ordered sets or posets have various applications in computer science ranging from database systems to distributed computing. Posets have uses in ranking scenarios where certain pairs of elements are incomparable, such as ranking con- ference submissions [DKM+11]. In a recommender system, a poset data structure may be used to record the partially known preferences of the users [RV97]. In publish/subscribe systems, posets are used for message filtering [TK06] [CRW01]. In database systems, posets can be used e.g. to aid decomposition of a database schema [Heg94]. In this thesis we focus on poset online operations. An online algorithm operates on incomplete information. For example, a content-based router might store the client subscriptions in a poset data structure. The data structure has to be updated every time the clients make changes to their subscriptions; the entire set of input elements cannot be known in advance, which places additional requirements on the poset data structures and algorithms. In contrast, an offline algorithm knows the entire input set in advance and can thus make an optimal choice at each step. With online operations we sometimes have to settle for a non-optimal solution. A poset data structure may have to store large amounts of data. Additionally, the element comparisons might be expensive. This creates a need for efficient poset data structures and algorithms. The online requirement and the use cases considered in this thesis necessitate also fast insertion and deletion operations. We study the complexity of poset online operations on various poset and poset- like data structures. We seek to find the most efficient poset data structures and algorithms in terms of a fixed set of online operations which was chosen to support the online case while remaining small enough to fit in the scope of a master’s thesis. We discuss the issue both from a theoretical and empirical aspect although the focus is on the empirical part. In the empirical part, we present the results of an empirical evaluation that was carried out by implementing all studied data structures from scratch in Java and executing a series of benchmark runs in a controlled environment. This thesis is organized as follows. First, in Section 2 we define the terms and syntax used in the rest of the paper. In Section 3 we study a couple of the poset use cases in more detail. In Section 4 we discuss a few necessary preliminaries before introducing the poset data structures in Section 5 and poset-derived forest in Section 6. In Section 7 we present the results of the experimental evaluation and finally in 2 sections 8 to 10 we discuss the results and possible future work. Algorithm pseudo code listings can be found in the appendix. 2 Definitions A partially ordered set or poset P is a partial ordering of the elements of a set. Poset P is formally defined as P = (P; ) where P is the set of elements in P and is an irreflexive, transitive binary relation. a b claims element a 2 P precedes, or dominates, element b 2 P and a b claims a does not precede b. If either a b or b a holds, the elements are said to be comparable, written as a ∼ b. Otherwise the elements are said to be incomparable, written as a b. An oracle is a theoretical black box that answers queries about the relations of elements. Given elements a and b, the oracle answers with a b, b a, or a b. A chain of P is a subset of mutually comparable elements of P. A chain C ⊆ P is defined as ci; cj 2 C; i 6= j such that for any ci; cj either ci cj or cj ci holds. An antichain of P is a subset of mutually incomparable elements of P. An antichain A ⊆ P is defined as ai; aj 2 A; i 6= j such that ai aj for any ai; aj. A chain decomposition of P is a set of chains C ⊆ P such that their union equals P . A minimum chain decomposition of P is a chain decomposition that contains the fewest number of chains possible given P and a maximum chain decomposition of P is a chain decomposition that contains the largest number of chains given P. The width of a poset, w(P), is the number of elements in the largest antichain of the poset [DKM+11] [BKS10], which equals the number of chains in a minimum chain decomposition of that poset [Dil50]. For our purposes we also define another width, wmax(P), as the number of chains in a maximum chain decomposition of P. The widths are characterized by the following equation: w(P) ≤ wmax(P) ≤ n, where n is the number of elements in P. The root set of poset P is the set of elements p 2 P such that pi p for all pi 2 P; pi 6= p, that is the set of elements that are not covered by any other elements of P. It is also called the non-covered set. The covered set of an element e is a set of elements p 2 P such that e p. 3 3 Use cases for posets In this section we briefly discuss two poset use cases in order to provide background for the rest of the paper. In Section 3.1 we discuss content-based routing in the context of publish/subscribe systems, and in Section 3.2 we discuss recommender systems. The treatment given in this section is necessarily brief and the reader is encouraged to read the referenced papers for more information. 3.1 Publish/subscribe systems A publish/subscribe system decouples the information producers and the information consumers. A publish/subscribe system consists of a set of producers, who publish notifications, a set of consumers, who subscribe to notifications, and a set of message brokers who deliver the notifications from the producers to the subscribers. There are many benefits to publish/subscribe systems. The following benefits are given by Mühl [Müh02]. The first benefit is loose coupling or decoupling in space: the producers need not address or know the consumers and vice versa.

Load more