Seeking for the Best Priority Queue: Lessons Learnt
Total Page:16
File Type:pdf, Size:1020Kb
Updated 20 October, 2013 Seeking for the best priority queue: Lessons learnt Jyrki Katajainen1; 2 Stefan Edelkamp3, Amr Elmasry4, Claus Jensen5 1 University of Copenhagen 4 Alexandria University 2 Jyrki Katajainen and Company 5 The Royal Library 3 University of Bremen These slides are available from my research information system (see http://www.diku.dk/~jyrki/ under Presentations). c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (1) Categorization of priority queues insert minimum input: locator input: none output: none output: locator Elementary delete extract-min minimum input: locator p minimum() insert output: none delete(p) extract-min Addressable delete decrease decrease input: locator, value Mergeable output: none union union input: two priority queues output: one priority queue c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (2) Structure of this talk: Research question What is the "best" priority queue? I: Elementary II: Addressable III: Mergeable c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (3) Binary heaps 0 minimum() 8 return a[0] 12 10 26 insert(x) 3 4 5 6 a[N] = x 75 46 7512 siftup(N) 7 N += 1 80 extract-min() N = 8 min = a[0] N −= 1 a 8 10 26 75 12 46 75 80 a[0] = a[N] 02341 675 siftdown(0) left-child(i) return min return 2i + 1 right-child(i) Elementary: array-based return 2i + 2 Addressable: pointer-based parent(i) Mergeable: forest of heaps return b(i − 1)=2c c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (4) Market analysis Efficiency binary binomial Fibonacci run-relaxed heap queue heap heap Operation [Wil64] [Vui78] [FT87] [DGST88] worst case worst case amortized worst case minimum Θ(1) Θ(1) Θ(1) Θ(1) insert Θ(lg N) Θ(1) Θ(1) Θ(1) decrease Θ(lg N) Θ(lg N) Θ(1) Θ(1) delete Θ(lg N) Θ(lg N) Θ(lg N) Θ(lg N) union Θ(lg M×lg N) Θ(minflg M; lg Ng) Θ(1) Θ(minflg M; lg Ng) Here M and N denote the number of elements in the priority queues just prior to the operation. c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (5) I What is the best solution when handling a request sequence consisting of N insert and N extract-min operations? Best: Comparison complexity / running time In-place: O(1) extra words c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (6) # element comparisons Data structure insert extract-min binary heaps [Wil64] lg N + O(1) 2 lg N + O(1) heaps on heaps [GM86] lg lg N ± O(1)a) lg N + log∗ N ± O(1)b) optimal in-place heaps [EEK13] O(1) lg N + O(1) lower bounds Ω(1) lg N − O(1) minimum: O(1) worst-case time a) Binary search on the siftup path (also in [Car87]) b) lg N − lg lg N levels down along the siftdown path, siftup in a binary- search manner, or recur further down c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (7) Optimal in-place heaps are galactic; only a masochist would implement them Ideas are interesting though . Binary-heap lower bounds|assumptions 1. You should not help others 2. You should keep your house clean after each visit c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (8) What is the importance of other complexity measures? 1. # element moves 2. # instructions 3. # branch mispredictions 4. # cache misses c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (9) Theory # element moves # pure-C instructions Source insert extract-min Source insert extract-min [Wil64] ∼lg N ∼lg N [Wil64] ∼9 lg N ∼12 lg N 1 1 [BKS00] ∼5 lg N ∼9 lg N [LL96] ∼ lg N ∼ lg N 2 2 # branch mispredictions # cache misses Source insert extract-min Source insert extract-min 1 [Wil64] ∼lg N ∼lg N [Wil64] O(1) ∼ lg N M M 2 1 2 [WT89] ∼ lg N a) ∼ lg N a) [EK12] O(1) O(1) B M B M < i f ( less (a [ j ], a [ j + 1]) ) f a) < j += 1; amortized < g M: # elements in the cache −−− > j += less (a [ j ], a [ j + 1]) ; B: # elements in a cache line c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (10) Experimental environment for sanity checks Processor Intel R CoreTM i5-2520M CPU @ 2.50GHz × 4 Memory system 8-way-associative L1 cache: 32 KB 12-way-associative L3 cache: 3 MB cache lines: 64 B main memory: 3.8 GB Operating system Ubuntu 13.04 (Linux kernel 3.5.0- 37-generic) Compiler g++ compiler (gcc version 4.7.3) with optimization -O3 Profiler valgrind simulators (version 3.8.1) c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (11) Practice: Key performance indicators # element comparisons # element moves Source 210 215 220 225 Source 210 215 220 225 [Wil64] 1.73 1.82 1.87 1.89 [Wil64] 1.96 1.64 1.48 1.39 [Car87] 1.49 1.37 1.37 1.3 [LL96] 1.41 1.11 0.95 0.86 # instructions # branch mispredictions Source 210 215 220 225 Source 210 215 220 225 [Wil64] 15.1 14.9 14.8 14.7 [Wil64] 0.59 0.56 0.54 0.53 [BKS00] 15.5 14.8 14.5 14.3 [EK12] 0.20 0.14 0.10 0.08 Request: N×insert+N×extract-min # L1 cache misses Repetitions: r = 226=N 10 15 20 25 Source 2 2 2 2 Input: Random int's [Wil64] 1.00 13.1 19.1 19.7 Reported: Grand total divided by [WT89] 1.06 6.83 4.26 3.63 N n N o r × N lg N or r × B lg max 2; M c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (12) Practice: CPU times Source n N 210 215 220 225 std library [g++] 6.41 6.09 6.96 12.6 original [Wil64] 5.59 5.45 6.47 12.2 comparison opt. [Car87] 9.69 8.34 9.02 14.5 instruction opt. [BKS00] 5.44 5.41 6.31 11.8 misprediction opt. [EK12] 3.92 4.16 7.2 25.8 bottom-up [Weg93] 6.77 6.63 7.34 13.1 4-ary [LL96] 5.47 5.49 6.3 10.4 external [WT89] 5.23 5.27 5.69 6.04 Request: N × insert + N × extract-min Repetitions: r = 226=N; each for a different array Input: Random permutation of f0; 1;:::;N − 1g; type int Reported: Running time divided by r × N lg N [ns] c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (13) II What is the best solution when handling a request sequence consisting of N insert, N extract-min, and M decrease operations? Best: Comparison complexity / running time c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (14) Theory Rank-relaxed weak heaps are better than Fibonacci heaps! [EEK12] Data structure # element comparisons Fibonacci heap 2M + 2:89N lg N Rank-relaxed weak heap 2M + 1:5N lg N But they are not simpler! Data structure Lines of code Binary heap 205 Fibonacci heap 296 Rank-relaxed weak heap 883 c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (15) Does a factor of two matter? T vs. 2T c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (16) Parameterized design weak heap element type int, double, ... comparator type std::less, std::greater, ... node type weak-heap node, combined weak-heap node & graph node, ... modifier type relaxed heap modifier, ... level-registry type leaf registry, ... mark-registry type naive mark registry, eager mark registry, lazy mark registry, ... • comparators shared • nodes shared • transformations shared • level registries shared • mark registries shared a factor of two less code c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (17) Our play with Dijkstra's algorithm With your search engine, you will • Which algorithm find many experimental studies • Which graph representation on Dijkstra's algorithm. Be crit- • Which priority queue ical when you read the results. • Which tuning level priority queue minimum source scanned labelled unlabelled a factor of two speed-up c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (18) Policy-based benchmarking Distance comparisons [M = 2N lg N] 1.4 ) N 1.2 lg N 1 + M 0.8 0.6 rank−relaxed weak queue pairing heap 0.4 weak queue Fibonacci heap rank−relaxed weak heap 0.2 # comparisons / ( binary heap weak heap 0 104 105 106 N [logarithmic scale] c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (19) Policy-based benchmarking Running times [M = 2N lg N] 90 rank−relaxed weak heap 80 rank−relaxed weak queue Fibonacci heap 70 weak heap ) [ns] weak queue N 60 binary heap lg 50 pairing heap N + 40 M 30 20 time / ( 10 0 104 105 106 N [logarithmic scale] c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (20) Graph representation CPH STL LEDA • adjacency arrays • adjacency lists • simple • nice interface • static • fully dynamic • 16M + 16N + O(1) bytes for • parameterized a graph with M edges and N • 52M + 60N + O(1) bytes vertices [LEDA Book, x 6.14] tentative distance state edge pointer vertex edge ... end points weight a factor of two speed-up c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (21) Avoiding indirection Combine the graph vertex and the priority-queue node [Knu94] ! improves cache behaviour a factor of two speed-up c Performance Engineering Laboratory Seminar 13391: Schloss Dagstuhl, Germany (22) Tuning Running time / N [µs] # element comparisons / N Structure CPH STL LEDA 6.2 Structure CPH STL LEDA 6.2 Fibonacci Fibonacci Fibonacci Fibonacci Operation heap heap Operation heap heap insert insert N: 10 000 0.10 0.18 N: 10 000 0 1 N: 100 000 0.09 0.15 N: 100 000 0 1 N: 1 000 000 0.09 0.15 N: 1 000 000 0 1 decrease decrease N: 10 000 0.03 0.06 N: 10 000 0 2 N: 100 000 0.05 0.22 N: 100 000 0 2 N: 1 000 000 0.06 0.31 N: 1 000 000 0 2 extract-min extract-min N: 10 000 0.7 1.2 N: 10 000 16.2 29.9 N: 100 000 1.4 2.7 N: 100 000 21.2 38.3 N: 1 000 000 2.8 4.5 N: 1 000 000 26.2